Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
HEAD DETECTION IN DENSELY CROWDED SCENES
Submitted by
Ekambaram Rajmadhan
Department of Electrical & Computer Engineering
In partial fulfillment of the
Requirements for the Degree of
Master of Science in Electrical Engineering
National University of Singapore
2009
i
ABSTRACT
False alarms output by object detectors reduce their reliability and increase the
computation of subsequent processing units. In this project Viola-Jones face
detector [4] approach is compared with Dalal and Triggs [5] pedestrian detector
approach modified for head detections in densely crowded scenes. Based on the
results of this experiment Viola-Jones face detector approach is chosen for its high
detection rate. A method is then developed in this project to reduce the false alarms
obtained by using the Viola-Jones face detector when applied to the problem of
head detections in densely crowded scenes. A new type of Haar-like features called
“triangle-features” is introduced and their efficient computation is shown. The
experimental results show that “triangle-features” together with rectangle Haar-like
features reduces the false alarms from 26% to 18% with a small decrease in the
detection rate from 85% to 82.22%. This project also supports multi-view face
detector approach for head detection. For multi-view detectors the training samples
needs to be grouped based on view point or orientation. Manual classification of
training samples based on orientation is a time consuming task. A novel method to
automate the process of classifying human head images based on their orientation
is also developed in this project.
ii
ACKNOWLEDGEMENTS
I offer my sincere gratitude to my project supervisor, Professor Dr. Surendra
Ranganath, who has supported me throughout the project with his guidance and
encouragement. I would like to thank Sim Chern-Horng from Vision and Image
Processing Laboratory for giving me invaluable technical support, mentoring and
advice on this project. They opened up the unknown areas of computer vision,
machine learning and image processing to me, and enlightened me along the way.
Without them, this project would not have been possible. I would like to thank my
parents for their love and support. My family is always beside me and gives me
courage and motivation to proceed in this project.
iii
CONTENTS
ABSTRACT …………………………………………………………………… i
ACKNOWLEDGEMENTS …..………………………………………………… ii
CONTENTS ….………………………………………………………………… iii
LIST OF FIGURES …………………………………………………………….. vi
LIST OF TABLES …………………………………………………………….... viii
CHAPTER 1 INTRODUCTION ……………………………………………….. 1
1.1 Background ……………………………………………………. 1
1.2 Related Works ………………………………………………… 1
1.3 Overview ………………………………………………………. 2
CHAPTER 2 HEAD DETECTIONS IN DENSELY CROWDED SCENES…4
2.1 Introduction ……………………………………………………..4
2.2 AdaBoost ………………………………………………………. 4
2.3 Viola and Jones Face Detector ……………………………….6
2.3.1 Rectangular Haar-like Features ……………..6
2.3.2 Integral Image ……………………………….. 6
2.3.3 Weak Learning Algorithm ……………………7
2.3.4 Cascade of Classifiers ……………………… 8
2.4 Head Detector ………………………………………………... 9
CHAPTER 3 HISTOGRAM OF ORIENTED GRADIENTS FOR HEAD
DETECTIONS ………………………………………………………………… 11
3.1 Introduction ………………………………………………….. 11
3.2 Histogram of Oriented Gradients …………………………. .11
3.2.1 HOG Introduction ………………………….. 11
iv
3.2.2 HOG Descriptor ………………………… 12
3.2.3 HOG Parameters ………………………. 12
3.3 HOG for Head Detections ………………………………. 13
3.4 Experimental Details ……………………………………. 14
3.5 Testing Results ………………………………………….. 16
CHAPTER 4 HISTOGRAM OF ORIENTED GRADIENTS FOR HEAD
DETECTIONS …………………………………………………………….. 18
4.1 Introduction ………………………………………………. 18
4.2 Triangle Feature ………………………………………… 19
4.3 Triangle Feature Evaluation ……………………………. 19
4.4 A Single Stage AdaBoost Classifier To Reduce
The False Alarms .………………………………………………………… 23
4.5 Results …………………………………………………… 24
4.6 Sample Outputs …………………………………………. 26
4.7 Triangle Feature Limitations and Possible Extensions..27
CHAPTER 5 CLASSIFICATION OF HUMAN HEAD IMAGES BASED ON
ORIENTATION ……………………………………………………………. 29
5.1 Introduction ………………………………………………. 29
5.2 Overview of Head Classification Based on Orientation..30
5.3 Formation of Image Descriptor ………………………… 30
5.4 Classification Method …………………………………… 31
5.4.1 Bhattacharyya Coefficient …………….. 31
5.4.2 Clustering ………………………………. 32
5.4.3 Classification Algorithm ……………….. 32
5.5 Observation ……………………………………………… 33
v
5.6 Sample Results…………………………………………. .33
CHAPTER 6 CONCLUSIONS ……………………………………………34
REFERENCES ………………………………………………………........35
vi
LIST OF FIGURES
Figure 2.1 AdaBoost Algorithm ………………………………………………….. 5
Figure 2.2 Integral Image ………………………………………………………… 7
Figure 2.3 Sum of the pixels within rectangle can be computed with 4 memory
references………………………………………………………………………….. 7
Figure 3.1 Sample detections of a crowd facing away from the camera using HOG
descriptors and linear SVM classifier …………………………………………… 16
Figure 3.2 Sample detections for a crowd facing the camera using HOG descriptors
and linear SVM classifier ………………………………………………………… 17
Figure 4.1 Triangle feature approximation ………………………………………18
Figure 4.2 Example triangle features …………………………………………… 19
Figure 4.3 Example triangle images. The value of pixel at location 'P' is the sum of
the pixels shown in shaded region …………………………………………….. . 20
Figure 4.4 Example triangle image to show the calculation of sum of the pixels
within a triangle shown in the shaded region …………………………………. .20
Figure 4.5 Two triangular regions P1 and P2 ………………………………….21
Figure 4.6 Difference of two triangular regions P1 and P2 …………………..21
Figure 4.7 Two rectangular regions P3 and P4 ……………………………….21
Figure 4.8 Sum of pixels in the triangle can be evaluated from the triangle and
rectangular regions ………………………………………………………………22
Figure 4.9 Different triangle features …………………………………………..22
Figure 4.10 Triangle feature detector ………………………………………….23
Figure 4.11 The outputs of the detector [3] and the detections which are rejected by
Triangle Feature detector for the heads back facing the camera ………….25
Figure 4.12 The outputs of the detector [3] and the detections which are rejected by
Triangle Feature detector for the heads facing the camera ………………..26
vii
Figure 4.13 Triangle feature depicted in a square of area n * n ………….27
Figure 4.14 Scalable Triangle feature ………………………………………28
Figure 5.1 Images which are closer to the cluster center 1 ………………33
Figure 5.2 Images which are closer to the cluster center 2 ………………33
viii
LIST OF TABLES
Table 3.1 Comparison of descriptors used in head detections and in pedestrian detection ………………………………………………………………………………14 Table 4.1 Detection rate and false alarm rate of detector [3] and HOG for head detections ……………………………………………………………………………. 24 Table 4.2 Detection rate and false alarm rate of Triangle feature detector for head detections ……………………………………………………………………………. 24
1
CHAPTER 1
INTRODUCTION
1.1 Background
Tracking people in densely crowded scenes is of wide interest in Computer Vision
community, because of its potential in surveillance applications. Tracking is necessary
to monitor actions of the individuals. All the tracking algorithms require information
about the object to track. Then object detection algorithms can be used to provide
input to the tracking algorithms. To track people in densely crowded scenes a head
detector can be used to provide input to any tracking algorithm. State of the art object
detectors are available only for face detection (Viola and Jones [1]) and sparsely
crowded pedestrians (HOG [2]). The only available head detector that can be used for
head detection is [3]. The detection rate of the algorithm [3] is satisfying but its false
alarm rate is not. In this project the false alarm rate of this detector is reduced by
developing a new type of feature. To further reduce the false alarms this project
suggests using multi-view detectors. [16] can be used only for multi-view face
detection, but not for heads. To train such a multi-view detector the first step is to
classify training samples based on view point or orientation. A novel method is
developed in this project to classify heads based on orientation.
1.2 Related works
The problem considered in this project is for surveillance application, so it is assumed
that the cameras will be mounted at an elevation. People faces will not be fully visible
from top view in densely crowded scenes because of occlusion and their direction of
movement. People moving away from the camera will have their backs to the camera
and hence their faces will not be visible. For people moving sideways from right to left
or from left to right only partial profile view of their faces will be visible. Because of the
2
nature of this problem the state of the art face detector [1] cannot be used directly in
this application. A pedestrian detector like [2] is good for sparsely crowded scenes
where full body of the pedestrian is visible. But this project deals with densely crowded
scenes, where full body of the people is rarely seen because of occlusion. This makes
[2] difficult to use in this application. The pedestrian detector [11] uses Bayesian
combination of different body part detectors. [11] cannot be used in this application for
the same reasons as [2] and because of its basic concept which involves combining
different body parts detectors. [15] uses Haar-like features to detect pedestrians from a
video sequence. In this project heads should be detected from still images, so [15]
cannot be used. The detector [3] shows good results for head detections in densely
crowded scenes. But false alarms of [3] are higher compared to other detectors like [1],
[2] and [11]. In this project false alarms of [3] are reduced from 26% to 18% with a
small decrease in the detection rate from 85% to 82.22%. This project supports multi-
view face detector approach for head detections to further reduce the false alarms
without decreasing the detection rate or to even improve the detection rate. The
training for multi-view detector is a tiresome task because the training samples have to
be separated based on their view or orientation direction. [16] is a good multi-view face
detector. In [16] the authors manually classified faces based on view direction for
training. Separating human heads based on orientation is difficult and time consuming.
Inspired by [16] a method is developed to automate head orientation classification
process, which can be used to classify training samples for future multi-view head
detectors.
1.3 Overview
A new type of feature introduced in this project to reduce false alarms of the detector
[3]. A single stage classifier is constructed using AdaBoost machine learning algorithm
with triangle Haar-like features. This project proposes a new classifier which is a
cascade of classifier [3] and the newly constructed AdaBoost triangle feature classifier.
The output of the detector [3] is the co-ordinates and size information of each of the
3
detections in the image. In the new classifier the sub-images corresponding to the co-
ordinates and size information obtained from detector [3] are cropped and resized to
24*24 pixels using bilinear interpolation and passed to AdaBoost triangle feature and
rectangle Haar-like feature classifier. The outputs of this new classifier are the heads in
the image. The objective of this project is to improve the detector [3] or to find a
detector which performs better than [3]. Detector [3] uses rectangle Haar-like features
to learn the similarity among positive samples and their difference from the negative
samples. It is difficult for Haar-like features to learn the similarity among positive
samples of different orientation, since each Haar-like feature is applied over a sub-
window of an image, large structural variations among positive samples will make it
poorly learn the threshold. Based on intuition and from the performance of the detector
[16] this project supports multi-view detector for head detection. Though no multi-view
detector is developed or any existing multi-view detector is tested in this project, a
novel method to automate the tiresome task of classifying training samples based on
view point or orientation for multi-view detector training is developed. To classify the
images based on orientation, normalized descriptors similar to HOG are created and
are clustered using fuzzy c-means clustering. The clustering results in heads of
different orientation to fall in different clusters.
The rest of the chapters are organized as follows: Chapter 2 describes how the Viola
and Jones face detection approach [1] is applied to the problem of head detections in
densely crowded scenes [3]. Chapter 3 explains how Dalal and Triggs [2] pedestrian
detector approach is applied to head detection and compares its performance with
Viola-Jones face detector approach [3]. Chapter 4 introduces new type of feature
called “triangle features” and shows how it is used to reduce the false alarms of [3]. A
novel method for classifying objects based on orientation is developed in Chapter 5.
Conclusion and Future works are in Chapter 6.
4
CHAPTER 2
HEAD DETECTIONS IN DENSELY CROWDED SCENES
2.1 Introduction
The Viola and Jones face detection algorithm [1] uses rectangle Haar-like features to
learn the critical visual features of the faces and these rectangle Haar-like features are
combined to form a strong classifier using a machine learning algorithm called
Adaptive Boosting (AdaBoost). In detector [3] Viola-Jones face detection approach is
used for head detections in densely crowded scenes. This chapter explains in detail
about the AdaBoost machine learning algorithm in section 2.2, Viola-Jones face
detector in section 2.3 and detector [3] in section 2.4.
2.2 AdaBoost
Boosting is a supervised machine learning algorithm. Machine learning algorithms aim
to automatically learn complex patterns and make intelligent decisions based on data.
In supervised learning, the algorithm is presented with sample inputs and outputs, and
expected to learn the association between them. So when presented with unknown
examples the algorithm is expected to classify them correctly. Boosting is a way of
combining the results of weak learners to produce a strong classifier. It is an iterative
algorithm; in each step a simple classifier selected by a weak learning algorithm based
on a distribution, and is added to the final classifier. A learning algorithm which selects
the simple classifiers is called weak learner and the chosen classifiers are expected to
be only slightly better than random guessing i.e., their probability of classification
should be greater than 50%. Each of the simple classifiers selected by the weak
learners contributes a parameter (confidence value), which measures the importance
of the simple classifier, to the final classifier. The value of the parameter is based on
the classification accuracy of the training samples. In adaptive boosting (AdaBoost), a
variant of boosting algorithm, the parameter (confidence value or strength of the weak
5
classifier) is calculated using the probability distribution of the training samples. The
probability distribution is calculated based on the weights of the training samples. At
the start of the training weights are assigned to all the training samples. Weights are
distributed uniformly or assigned based on the importance of the training samples. The
weak classifier which produces the largest sum of weights of the correctly classified
samples is in chosen all iterations. During training the weights are updated in every
iteration after the selection of a weak classifier, based on the importance of the training
samples. The training samples which are classified correctly by a selected weak
classifier are considered less important and their weights are reduced and the
incorrectly classified samples are considered more important and their weights are
increased in every round of the iteration. The AdaBoost algorithm is presented in
Figure 2.1.
Given: (x1,y1), (x2,y2), ...... (xm,ym) where xi Є X, yi Є Y = {-1, +1}
Initialize Di(i) = 1/m
For t = 1,.....,T:
Choose αt = ½ ln((1 – εt )/εt)
Train weak learner using distribution Dt
Get weak hypothesis ht: X → {-1, +1} with error
εt = Pri ~ Dt [ht(xi) ≠ yi]
Choose αt = ½ ln((1 – εt )/εt)
Update:
Dt+1(i) = ( Dt(i) / Zt ) x { e-αt if ht(xi) = yi
{ eαt if ht(xi) ≠ yi
= Dt(i) exp(-αt yi ht(xi)) / Zt
where Zt is the normalization factor (chosen so that Dt+1 will be a
distribution)
6
Output the final hypothesis:
H(x) = sign( t=1∑T αt ht(x))
Figure 2.1 AdaBoost Algorithm
2.3 Viola and Jones Face detector
2.3.1 Rectangular Haar-Like Features
Viola and Jones face detector [1] uses rectangular Haar-like features to encode or
learn facial information or the human faces in images. The rectangular Haar-like
features are combined to form a strong classifier using AdaBoost machine learning
algorithm. The rectangular Haar-like features are reminiscent of Haar basis functions,
but are over complete. A simple rectangular Haar-like feature value can be defined as
the difference of the sum of pixels of areas inside the rectangle, which can be at any
position and scale within the original image. This feature set is called 2-rectangle
features. Viola and Jones [1] also defined 3-rectangle features and 4-rectangle
features. A 3-rectangle feature value is calculated as the difference of the sum of pixels
of the two outer rectangles from the inner rectangle. A 4 rectangle feature value is
calculated as the difference of the sum of pixels of the two diagonal rectangles from
the other two diagonal rectangles. The values of these features indicate certain
characteristics of a particular area of the image. Each feature type can indicate the
existence (or not) of certain characteristics in the image, such as edges or changes in
texture. For example, a 2-rectangle feature can indicate where the border between a
dark region and a light region lies.
2.3.2 Integral Image
The rectangular Haar-like features are computationally efficient and can be computed
rapidly using integral images. The Integral image at location (x, y) contains the sum of
the pixels above and to the left of (x, y) inclusive, as shown in Figure 2.2. Using
integral images the sum of pixels of a rectangular region in an image can be computed
7
using 4 memory references. Another advantage of integral image is that it avoids
computing the pyramid of images, which are used in other methods for finding objects
of different sizes. By using integral images, the rectangular Haar-like features are
scaled instead of the images and the features of all dimensions are computed using
the same number of operations. The following example shows the computation of the
sum of pixels within a rectangle as shown in Figure 2.3:
Sum of pixels within a rectangle = P1 – P2 – P3 + P4
Figure 2.2 Integral Image
Figure 2.3 Sum of the pixels within rectangle can be computed with 4 memory references. Sum of pixel with rectangle R = P1 – P2 – P3 + p4
2.3.3 Weak Learning Algorithm
A feature together with a threshold is called as a weak classifier. The weak learning
algorithm selects a feature which best classifies the positive and negative training
samples. For each feature the weak learner determines the optimum threshold, such
that the minimum number of samples is misclassified. The weak learner selects a
feature as follows. For each feature, the examples are sorted based on feature value.
The optimal threshold for that feature can be computed in a single pass over this
sorted list. For each element in this sorted list, four sums are evaluated: the total sum
of positive sample weights T+, the total sum of negative sample weights T-, the sum of
x,y
P1 P2
R
P3 P4
8
positive weights below the current example S+, the sum of negative weights below the
current example S-. The error for a threshold which splits the range between the
current and previous example in this sorted list is:
e = min( S+ + ( T- - S- ), S- + ( T+ - S+ ) ),
or the minimum of the error of labeling all examples below the current example
negative and labeling the examples above positive versus the error of the converse.
These sums are easily updated as the search proceeds. The feature which generates
the low error will be chosen with the corresponding threshold as a weak classifier. The
weak classifiers selected by the weak learner are combined to form a strong classifier
as given in Figure 2.1.
2.3.4 Cascade of classifiers
The advantage of Viola and Jones [1] face detector is its speed of detection. The
speed is achieved by using a cascade of classifiers. The idea is to neglect the majority
of the non-positive regions with less computation using simple classifiers and to spend
more computation on regions which have high probability of being positive using
complex classifiers. The AdaBoost algorithm described in Table 1 is used to create a
strong classifier. The strong classifiers are cascaded to form the face detector. The
earlier stages of the cascade are simple classifiers built using less number of features
and their detection rates are close to 100%. The classifiers in the higher stages are
complex and are built with large number of features to reduce the false alarms. The
detection process follows a degenerate decision tree. Only the sub-windows which are
classified as positive in an earlier stage are sent to the successive stages. The sub-
windows which are classified as negative in any one stage will be rejected
immediately. The time required to process an image relies on the amount of
computation performed over a sub window and is directly proportional to the number of
features used in the classifier. Since majority of the sub-windows in an image are
9
negative, removing most of them at an earlier stage possible; i.e., using less number of
features will reduce computation drastically.
2.4 Head detector
In Head detector [3], straight and tilted rectangle Haar-like features are used to learn
the critical visual features and Gentle AdaBoost algorithm is used to combine these
features to create the strong classifiers. These strong classifiers are connected in a
cascade to form the head detector and each strong classifier is called as the stage
classifier. In [1] the criterion to train a stage classifier in the cascade is the number of
features but in [3] the detection rate and the false alarm rate are used. A single stage
classifier is created by selecting and boosting the Haar-like features as mentioned in
section 2.3.3 until it classifies the training samples with a given detection rate and false
alarm rate. Since the classifiers are cascaded the detection rate and the false alarm
rate of the detector is equal to the product of all the stage classifiers. Let D1, D2, ......DN
be the detection rates of the stage classifiers. Then the detection rate of the cascade is
D1∙D2∙....∙DN. The false alarm rate of the detector is also calculated in the same way.
The calculation of the detection rate and the false alarm rate of each stage classifier
can be done as follows: Let D and F are the expected detection rate and false alarm
rate, respectively, of the detector and let N be the number of stages. Then the
detection rate and the false alarm rate of all the stage classifiers can be equally
chosen to be D1/N and F1/N, respectively.
Gentle AdaBoost algorithm is chosen for its performance compared to other AdaBoost
variants as mentioned in [4]. The tilted rectangle Haar-like features [18] are extensions
of the rectangle Haar-like features and are added to increase the dimension of the
feature set to improve detection. The training set consists of 4016 positive samples
and 1704 negative samples. More training samples are generated by flipping the
positive samples horizontally and negative samples both horizontally and vertically.
10
The resulting 8032 positive samples and 6816 negative samples are used for training.
Each stage classifier is trained with a detection rate of 99.9% and a false alarm rate of
50%. The number of stages trained is 20. The final classifier gives 85% detection rate
and 26% false alarm rate when tested with 1010 head samples from 30 images.
11
CHAPTER 3
HISTOGRAM OF ORIENTED GRADIENTS FOR HEAD DETECTIONS
3.1 Introduction
Tracking of people in video can be performed by using initial detections from a head
detector. For such an application the detection rate of the detector should be high with
low false alarm rate. A suitable head detector that can be used for this application is [3].
The detection rate of the detector [3] is good but its false alarm rate needs to be
reduced for its use in tracking application. Before reducing the false alarms of this
detector it is also necessary to check for the performance of other available detectors.
Potential state of the art detectors head detections in densely crowded scenes [2], [11]
and [15]. However the detectors [11] and [15] cannot be used for this application and
and is explained in related works Section 1.2. Hence, we chose to experiment with
detector [2].
3.2 Histogram of Orientated Gradients
3.2.1 HOG Introduction
The detector [2] is based on the idea that local object appearance and shape can be
characterized by distribution of intensity gradient or edge direction even without precise
knowledge about the corresponding gradient or edge directions. In this detector the
orientation of the gradient is computed and histogram of this is calculated for
overlapping image blocks. The calculated histogram is used as descriptor to detect
pedestrians in an image using linear SVM classifier.
12
3.2.2 HOG Descriptor
A descriptor is created for each of the training samples using orientation of gradient
information. A linear SVM classifier is trained to classify the descriptors of positive and
negative training samples. The descriptor for a training sample is created as follows:
Gradient of an image is calculated using a simple centered mask like [-1 0 1] in both x
and y direction. Orientation of each of the gradient elements is calculated using these
gradient images. A gradient image is divided into non-overlapping rectangles of size
m*n. Each of these rectangles is called a cell. A histogram of size h bins is calculated
for each of these cells using the magnitude of the gradient orientation. The cells are
grouped to form blocks of size p*q. The histograms of all the cells in a block are
concatenated and normalized using L1 or L2 norm. The normalized histograms of all
the blocks are concatenated to form an image descriptor. The blocks are overlapping,
meaning a cell contributes to more than one block. Though each cell contributes many
blocks, the value they contribute differs because of block normalization.
3.2.3 HOG Parameters
The following parameters yields the best results for pedestrian detector: Cell Size 8 * 8
pixels, linear gradient voting into 9 orientation bins in 0o-1800, Block size 2 * 2 cells, L2-
Hys (Lowe style clipped L2 norm) block normalization, 64 * 128 pixels detection
window and linear SVM classifier. While forming the histogram to reduce aliasing,
votes are interpolated bilinearly between the neighboring bin centers - in both
orientation and position. Image gradients are usually calculated after smoothing the
image using a Gaussian filter. But for pedestrian detector the authors found by
experiment that the Gaussian smoothing decreases performance when σ increased
from 0 to 2, so no Gaussian smoothing is performed. Gradient strengths vary over a
wide range owing to local variations in illumination and foreground-background
contrast, so effective local contrast normalization turns out to be essential for good
performance. In this experiment block normalization acts as local contrast
13
normalization. The gradient orientation is calculated for each of the color channels in
either RGB or LAB color spaces and the largest norm of all the color channels is used
in the histogram.
3.3 HOG for Head Detections
For head detections not all the steps/ideas of HOG pedestrian detector are used.
Some of the ideas are not valid for head detections in densely crowded scenes. In
pedestrian detection local contrast normalization seems to be a good idea and the
results of [2] shows their importance in improving the performance of the classifier.
Local contrast may not suit for all types of objects. Particularly for head detection it may
not improve performance. In pedestrian detection, the camera field of view includes the
entire pedestrian, and occlusion and self-shadowing affects the features extracted for
classification. However, for head detections in densely crowded scenes “heads”
occupy a small portion of the image. If there is a shadow, it will usually cover the whole
head. So, local contrast normalization which is of critical importance in the pedestrian
detection is of little use in head detection. To increase the performance of HOG for
head detections, we experimented by increasing the negative samples. But the
obtained results are contrary to the expectations. When the number of negative training
samples is increased, linear SVM is not able to find a classifier. Heuristically this
suggests that the HOG descriptors created for human heads is not sufficiently unique
for classification using linear SVM and share characteristics with natural/man-made
objects which are used as negative samples.
14
3.4 Experiment details
Table 3.1 compares the differences in parameters used and results between head
detection and pedestrian detection using HOG.
Data Head detection Pedestrian
detection
Number of
histogram bins
9 9
Cell size 6 * 6 8 * 8
Block size 3 * 3 2 * 2
Image size 30 * 30 64 * 128
Total number of
blocks in an image
(assuming window
stride is by one cell)
9 128
Number of positive
training samples
8032 2478
Number of negative
training samples
6816 12180
Detection Rate 66% 84-89%
False alarm Rate 39% 10-4 False
Positives
Per
Window
Table 3.1 Comparison of descriptors used in head detections and in pedestrian detection
15
Image descriptor dimension = Number of histogram bins * Block size * Total number of
blocks
Descriptor dimension in pedestrian detection = 9 * 4 * 128 = 4608 features
Descriptor dimension in our problem = 9 * 9 * 9 = 729 features
From the Table 3.1, it is observed that the performance of HOG descriptors for head
detection is not as good as pedestrian detection and is also less compared to the
detector [3]. So Viola and Jones approach for head detection is preferred over HOG for
head detection in densely crowded scenes.
16
3.5 Testing Results
The Figures 3.1 and 3.2 shows the performance of HOG on different images. From
these figures it can be observed that there are lots of false and miss detections.
Figure 3.1 Sample detections of a crowd facing away from the camera using HOG descriptors and linear SVM classifier
17
Figure 3.2 Sample detections for a crowd facing the camera using HOG descriptors and linear SVM classifier
18
CHAPTER 4
TRIANGLE FEATURES FOR HEAD DETECTIONS
4.1 Introduction
Features can encode ad-hoc information which is difficult to extract using pixels from a
small number of training samples. To use a feature based detector in real-time,
features should be computationally efficient. One of the reasons for the success of the
face detector [1] is its speed of computation. In this chapter we introduce a new type of
feature called triangle feature which is simple and faster to compute and are logical
extensions of the rectangle Haar-like features in [1]. The motivation for this feature is
that triangles can better approximate diagonal curves compared to rectangles, which
can be observed from the Figure 4.1. In this figure the area under the curve can be
approximated by the triangle below the diagonal of the square. The combination of the
triangle features and rectangle Haar-like features improves the performance of the
detector [3] is shown by experiment in this chapter.
Figure 4.1 Triangle feature approximation
diagonal
curve
ee
19
4.2 Triangle Feature
The value of a triangle feature is the difference between the sums of the pixels within
two triangular regions of a square. The number of pixels cannot be equally divided
between two triangle regions inside a square, so a factor equal to the ratio of the two
areas is multiplied with the sum of the pixels of the smaller triangle before subtracting
from the sum of the pixels from the larger triangle. Example triangle feature are shown
below:
Figure 4.2 Example triangle features
The value of triangle feature = Sum of pixels in Area A1 – m * Sum of pixels in Area A2,
where 'm' is the ratio of the number of pixels in A1 to the number of pixels in A2.
4.3 Triangle Feature Evaluation
To calculate the sum of pixels in a triangular region in an image, an intermediate
representation called triangle image which is similar to integral image in [1] is used.
The triangle image eases the evaluation of the features, and hence boosts the speed
of detection. The triangle image at location (x, y) contains the sum of the pixels above
and touches the line drawn at 45° to the right of the pixel. The example triangle images
are shown in Figure 4.3.
A1
A2
A1
A2
20
Figure 4.3 Example triangle images. The value of pixel at location 'P' is the sum of the pixels shown in shaded region.
The triangle image can be efficiently calculated in one pass of the image using the
following pair of recurrences:
s(x,y) = s(x,y-1) + i(x,y)
ti(x,y) = s(x,y) + ti(x+1,y-1)
where s(x,y) is the column sum of the input image and ti(x,y) is the pixel value of the
triangle image. The recurrences should be evaluated from top to bottom and from right
to left of the image.
The evaluation of triangle features requires both triangle image and integral image. The
computation of integral image is explained in Section 2.3.2. The value of triangle in an
image is calculated using 4 array accesses: two in the triangle image and two in the
integral image. The following steps show how to find the sum of pixels within the
triangle shown in Figure 4.4.
Figure 4.4 Example triangle image to show the calculation of sum of the pixels within a triangle shown in the shaded region.
PP
P4 P3 P1
P2
21
Step 1
Using the triangle image obtain the sum of pixels in two triangles P1 and P2. The
smaller triangle is P1 and the larger triangle is P2.
Figure 4.5 Two triangular regions P1 and P2
Step 2
Find sum of pixels in (P2 – P1). This gives an intermediate form as shown in Figure
4.6.
Figure 4.6 Difference of two triangular regions P1 and P2
Step 3
Using integral image obtain the sum of pixels in the two rectangular regions shown in
Figure 4.7 and represent them as P3 and P4.
Figure 4.7 Two rectangular regions P3 and P4
P3
P1
P2
P4
P1
P2
22
Step 4
Using the rectangle and triangle areas mentioned in the steps above, the sum of the
pixels in the triangle in Figure 4.4 can be evaluated as follows:
sum of pixels in triangular region (P2 - P1) - sum of pixels in rectangular region (P3 –
P4)
Figure 4.8 Sum of pixels in the triangle can be evaluated from the triangle and rectangular regions
The number of operations required to find the sum of pixels in a rectangle using the
integral image is same as the number of operations required to find the sum of pixels in
a triangle using both the triangle and integral image. But computing triangle image is
an excess operation which is not required for rectangle Haar features. Figure 4.9
shows the different ways of representing the triangle features. The evaluation of these
features can be carried out in the same way as the one explained above by using
flipped input images.
Figure 4.9 Different triangle features
P4 P3 P1
P2
A1
A2
A2
A1
A2
A1
23
4.4 A Single stage AdaBoost classifier to reduce the false alarms
A simple discrete AdaBoost classifier explained in Chapter 2.2 with the rectangle Haar-
like and the triangle features as weak classifiers is built to reduce the false alarms. The
single stage classifier is trained using the same positive samples used for training the
head detector [3] and more negatives samples are used. During the process of head
detections, the head detector [3] is used as the initial head detector and its outputs are
resized to 24*24 using bilinear interpolation and passed as inputs to the triangle feature
classifier. The output of this chain gives the heads in the image as shown in Figure
4.10.
Figure 4.10 Triangle feature detector
Cascade of classifiers is used to speed up the detection process. In cascade of
classifiers negatives are rejected with fewer computations and more computations are
performed over sub-windows which have high probability of being positive. The
objective of this classifier is to reduce the false alarms of detector [3], and it does not
involve the initial detection process in the image. So the number of inputs to this
classifier will usually be very small compared to the number of sub windows in an
image and hence single stage classifier is sufficient.
24
4.5 Results
The following tables shows the detection rate and false alarm rate of the detector [3],
HOG and triangle feature detector for head detections in densely crowded scenes:
Method Detection % False
Alarm %
AdaBoost rectangle
Haar feature
85 26
HOG 66 39
Table 4.1 Detection rate and false alarm rate of detector [3] and HOG for head
detections
Number of features Detection % False
Alarm %
3 82.22 18
10 83.94 21.63
30 84.51 22.67
50 84.46 22.76
Table 4.2 Detection rate and false alarm rate of Triangle feature detector for head detections
25
4.6 Sample outputs
The Figures 4.11 and 4.12 shows the output of detector [3] and the detections which
are removed by the triangle feature classifier.
Figure 4.11 The red and blue circles are the detections of the detector [3]. The blue circles are the detections which are rejected by Triangle Feature detector for the heads
back facing the camera.
26
Figure 4.12 The red and blue circles are the detections of the detector [3]. The blue circles are the detections which are rejected by Triangle Feature detector for the heads
facing the camera.
27
4.7 Triangle Feature Limitations and Possible Extensions
In face detector [1] the rectangle features are scaled instead of the image to find faces
of different scales. It is one of the main reasons for the speed of the face detector [1].
While scaling a rectangle feature the threshold associated with it will also be scaled.
The disadvantage of the triangle features is that they cannot be scaled, because the
threshold value associated with the feature cannot be scaled. When a triangle feature
is scaled by a value say 'a', the proportionality constant 'm' the ratio of the number of
pixels in A1 to the number of pixels in A2 will not be scaled by the same value 'a'.
Instead the value of 'm' should be calculated for the triangle feature of that dimension,
because the area of the two triangles inside a square in a triangle feature will not get
scaled by the same value. So the triangle formed when scaled by a value 'a' is a
different feature and threshold for that feature cannot be obtained using the scale value
'a'. This can be shown for the triangle feature in Figure 4.13 using the derivation below:
Figure 4.13 Triangle feature depicted in a square of area n * n
The number pixels in the square = n * n
Number of pixels in triangle A1 = n * (n + 1) / 2
Number of pixels in triangle A2 = n * (n - 1) / 2
m = Area of triangle A1 / Area of triangle A2
= Number of pixels in triangle A1 / Number of pixels in triangle A2
» m = (n + 1) / (n - 1)
Let 'a' be the scale factor for the feature.
The number pixels in the scaled square = (a * n) * (a * n)
Number of pixels in scaled triangle A1 = (a * n) * ((a * n) + 1) / 2
1
2
A1 3
:
:
A2 :
:
n
1 2 3 ... ... ... ... n
28
Number of pixels in scaled triangle A2 = (a * n) * ((a * n) - 1) / 2
mscaled = ((a * n) + 1) / ((a * n) – 1)
mscaled / m = (((a * n) + 1) * (n - 1) )/ (((a * n) – 1) * (n + 1)) ≠ a
The relation mscaled / m ≠ a show that the threshold for the triangle features and the
scaled triangle features are not the same and hence the triangle feature in Figure 4.13
is not scalable.
Triangle features can be made scalable if represented as shown in Figure 4.14. From
the Figure it can be observed that the number of pixels in the triangle A1 and A2 are
the same and there is no need for the proportionality constant 'm'. Therefore the
threshold for this feature can be scaled and can be used in the same way as rectangle
features in [1].
Figure 4.14 Scalable Triangle feature
1
2
A1 3
:
:
A2 :
:
n
1 2 3 ... ... ... ... n
29
CHAPTER 5
CLASSIFICATION OF HUMAN HEAD IMAGES BASED ON ORIENTATION
5.1 Introduction
Feature based detectors will evaluate the features by applying them at the same pixel
locations over all the training samples of the same resolution. For a feature to classify
the training samples properly, the range of feature values for the positive and negative
samples should be separated. If the feature values vary greatly for the positive training
samples then it is possible that features values evaluated over the negative samples
will lie between the feature values of the positive samples. These features cannot
classify the positive and negative samples properly. Structural variations among the
training samples, e.g. heads of different orientation, will result in feature values to vary
greatly for majority of the features. In such cases the threshold of the selected feature
will not be good. If the positive samples can be divided based on the structural
similarity and trained separately, features can be selected independently and it help to
increase the detection rate and reduce the false alarm rate. An example of such a
detector is multi view face detector [11]. To train such a detector, the training samples
have to be classified based on structural similarity. Depending on the object to be
trained, structural similarity can be defined. For example in [11], structural similarity is
defined based on the out of plane rotation angle and the human faces are divided into
5 categories: full left profile, half left profile, frontal, half right and full right profile.
Similarly human heads can be coarsely classified based on their orientation: head
facing right, facing left, facing the camera and back head, and can even be finely
divided based on the angle of orientation. Instead of specifying any requirements for
manual classification, a new approach to automatically classify the heads is developed
in this project. This automatic classification will greatly reduce time and also help to
30
gain understanding over the problem, i.e., how close or far apart heads of different
orientations are. Gradient histogram information seems to be good for classifying
objects based on orientation, so HOG descriptor is tested for its performance for head
classification based on orientation. While calculating HOG descriptors an image is
divided into blocks and orientation histogram is calculated for each of the blocks
separately and concatenated to form the image descriptor. When an image is divided
into blocks, it is possible that structurally different regions of the objects are separated.
So when images of slight change in orientations are divided into blocks, approximately
structurally similar region of the image belongs to a particular block in all the images.
This will help in selecting features to classify the training samples properly. In this
experiment instead of constraining any particular number of orientations, natural
grouping is preferred.
5.2 Overview of Head Classification Based on Orientation
Though the results obtained from HOG are not impressive for head detections, the
gradient orientation information can be made useful for classification of heads. The
proposed approach uses gradient orientation histogram and Bhattacharyya distance to
classify heads based on orientation. Image descriptors are formed for all the training
samples using gradient orientation histogram and are clustered using Bhattacharyya
distance. Each cluster will contain only heads of particular orientation.
5.3 Formation of Image Descriptor
From the surveillance videos, it is observed that the size of head ranges between
20*20 and 40*40 pixels. Before creating the image descriptors, all the images are
resized to an equal size. Intuitively the midpoint in the range appears to be a good
choice, so all the training images are resized to 30*30 pixels. All the resized images
are divided into four quadrants of size 15 * 15 pixels, and the gradient orientation
histogram of 9 bins is computed for each quadrant. Based on the analysis done for
HOG descriptor (for detailed description refer Chapter 3.4), computing overlapping
31
blocks will not improve the performance for human head classification. Also the object
is very small in this application, so we choose to divide the image into 4 non-
overlapping blocks only. This gives a contribution of 225 pixels for each of the blocks.
To form a histogram of 9 bins, contribution from less than 225 pixels seems to be too
low. After computing the histogram, it is normalized to be a pdf. The concatenation of
all the histograms in the four blocks represents the descriptor. The reason for choosing
pdf is to compensate for brightness variation in different images. If pdf is created after
concatenation of descriptors of each block, histogram peak in one block will affect the
other blocks, so pdf is created before concatenation of descriptors of each block.
5.4 Classification Method
5.4.1 Bhattacharyya Coefficient
The Bhattacharyya distance measures the similarity of two discrete probability
distributions. It is normally used to measure the separability of classes in classification.
For discrete probability distributions p and q over the same domain X, it is defined as:
DB(p,q) = - ln(BC(p,q)),
where,
BC(p,q) = Σxϵ X(p(x)q(x))0.5
The Bhattacharyya coefficient (BC(p,q)) is an approximate measurement of the amount
of overlap between two statistical samples. The coefficient can be used to determine
the relative closeness of the two samples being considered.
32
5.4.2 Clustering
In pattern recognition applications, one of the ways to find natural grouping is to use
clustering. Clustering can be defined as “the process of organizing objects into groups
whose members are similar in some way”. All the members in a group or cluster will be
similar between them and are dissimilar to the members from other groups. “Similarity
measure” measures how samples in one cluster are more like one another than
samples in other cluster. There is no existing “similarity measure” which can be used
for all the problems. User must define the “similarity measure” based on the problem at
hand. There are different clustering algorithms available: K-means, Fuzzy C-means
(FCM), Hierarchical Clustering, Mixture of Gaussian, etc. Fuzzy C-means is a method
of clustering which allows one piece of data to belong to two or more clusters, with a
degree of membership to each of the clusters it belongs.
5.4.3 Classification Algorithm
Image descriptors are formed for all the samples as described in Section 5.3. Since the
objective of this experiment is to find natural groupings, clustering seems to be a
reasonable choice. Fuzzy c-means clustering with Bhattacharyya coefficient similarity
measure is used for classification. The descriptor involves concatenation of the pdf's,
so the maximum Bhattacharyya coefficient will be 4, one for each pdf. The basic
assumption for this method is that Bhattacharyya coefficient between training samples
of same orientation will be large and of different orientation will be small. The output of
clustering will be cluster centers and degree of membership for all the training samples
with the cluster centers. Each cluster is expected to contain samples with specific
orientation. The number of clusters has to be decided iteratively based on the
experiment results.
33
5.5 Observation
The training is started with 4 clusters and the degree of membership is observed for
each of the training samples. For 13% of the samples the degree of membership is
equally shared among all the clusters. And for 26% of the samples the degree of
membership is equally shared between any of the 3 clusters. As the number of clusters
is increased, the degree of membership is approximately equally shared between more
clusters. When the number of clusters is reduced to 2, more reliable degree of
membership is obtained. This suggests that only 2 natural groups can be found using
Bhattacharyya coefficient similarity measure for this data set. The samples belongs to
the two groups, frontal and back head as shown in the Figure 5.1 and 5.2, respectively.
5.6 Sample Results
The following figures show how the algorithm classifies human head images into
different clusters based on orientation.
Figure 5.1 Images which are closer to the cluster center 1
Figure 5.2 Images which are closer to the cluster center 2
34
CHAPTER 6
CONCLUSIONS
Tracking people in densely crowded scenes is a complicated task in the field of computer
vision and has large potential in surveillance applications. With the advances in the
computing power of the processors, this task can be done with reasonable accuracy. To
achieve this task, head detections explored in this project. The current best performing
head detection algorithm [3] is chosen and improved on by reducing its false alarms in
this work. We also discussed the possible improvements of the detector [3] using multi-
view face detector approach [16] and an initial step was taken by develop a novel
method for head classification based on orientation.
Possibilities of the current state of the art object detectors are analyzed for head
detections in densely crowded scenes. A modified version of pedestrian detector [2] was
tested and compared with the available head detector [3]. Based on the testing results [3]
was chosen and improved by reducing its false alarms. A new type of feature called
“triangle features” similar to rectangle haar-like features was developed. The use of this
feature was shown experimentally by verifying the reduction in false alarms of the
detector [3] from 26% to 18%.
The appearance of human heads is structurally different from different viewpoints or
orientations, so multi-view based object detector is very useful for head detections in
densely crowded scenes. A novel method based on HOG was developed to automate
the initial process of classifying training samples based on orientation for training a multi-
view detector. The performance of this method was shown by classifying the human
head images into frontal and rear heads. This generic method can be adapted for
classifying any object groups.
35
REFERENCES
1. Paul Viola, Michael J. Jones: Robust Real-Time Face Detection. In: ICCV, vol. 20(11),
pp. 1254-1259 (2001)
2. Navneet Dalal and Bill Triggs: Histograms of oriented gradients for human detection.
In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2 (2005)
3. Chern-Horng Sim, Ekambaram Rajmadhan and Surendra Ranganath: A Two-Step
Approach for Detecting Individuals within Dense Crowds. In: AMDO: 166-174 (2008)
4. Jerome Friedman, Trevor Hastie, Robert Tibshirani, Additive Logistics Regression: A
statistical view of boosting. Technical Report, Department of Statistics, Stanford
University.
5. Yoav Freund and Robert E. Schapire: A Short Introduction to Boosting. In: Journal of
Japanese Society for Artificial Intelligence, vol. 14(5), pp. 771–780, (September 1999)
6. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: IEEE
Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 878–885 (2005)
7. Rittscher, J., Tu, P.H., Krahnstoever, N.: Simultaneous estimation of segmentation and
shape. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp.
486–493 (2005)
8. Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S.: Fast human detection using a cascade of
histograms of oriented gradients. In: IEEE Conference on Computer Vision and Pattern
Recognition, vol. 2, pp. 1491–1498 (2006)
9. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic
assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004.
LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004)
10. Tuzel, O., Porikli, F.M., Meer, P.: Human detection via classification on riemannian
manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1
(2007)
36
11. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by
bayesian combination of edgelet based part detectors. In: International Journal of
Computer Vision, vol. 75(2), pp. 247–266 (2007)
12. Casas, J.R., Sitjes, A.P., Folch, P.P.: Mutual feedback scheme for face detection and
tracking aimed at density estimation in demonstrations. In: Vision, Image and Signal
Processing, vol. 152(3), pp. 334–346 (2005)
13. Brostow, G.J., Cipolla, R.: Unsupervised bayesian detection of independent motion in
crowds. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp.
594–601 (2006)
14. Rabaud, V., Belongie, S.: Counting crowded moving objects. In: IEEE Conference on
Computer Vision and Pattern Recognition, vol. 1, pp. 705–711 (2006)
15. Viola, P.; Jones, M.J.; Snow, D: Detecting Pedestrians Using Patterns of Motion and
Appearance. In: ICCV, vol. 1, pp. 734–741 (2003)
16. Chang Huang, Haizhou Al, Bo Wu, Shihong Lao: Boosting Nested Cascade Detector
for Multi-View Face Detection. In: ICPR, vol. 2, pp. 415-418 (2004)
17. http://wapedia.mobi/en/Haar-like_features
18. Lienhart, R. and Maydt, J., "An extended set of Haar-like features for rapid object
detection", In: ICIP, pp. 900-903 (2002)