DETECTING HUMANS IN VIDEO SEQUENCES USING ......DETECTING HUMANS IN VIDEO SEQUENCES USING STATISTICAL COLOR AND SHAPE MODELS BY IVÁN R. ZAPATA A THESIS PRESENTED TO THE GRADUATE SCHOOL

DETECTING HUMANS IN VIDEO SEQUENCES USINGSTATISTICAL COLOR AND SHAPE MODELS

BY

IVÁN R. ZAPATA

A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2001

ii

ACKNOWLEDGMENTS

I would like to thank my parents for their endless support, guidance, and love. Their

unbounded faith makes nothing seem unconquerable. I would also like to thank Dr. Michael

Nechyba for sharing his extensive knowledge, as well as for his patience and time. I would like to

thank Dr. A. Antonio Arroyo for giving me the opportunity to find and explore new fields. Special

thanks go to Dr. Keith Doty who gave me the tools to explore these fields, and allowed me the free-

dom to do so on my own terms. Thanks go to Scott Nichols for providing an avenue to express and

develop new ideas, and delivering undissembled feedback at all times. Additional thanks go to

Clara Zapata, Eric Henderson, and Hadar Vinayi for their editorial input and moral support.

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTERS

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 STATISTICAL CLASSIFIERS FOR SKIN DETECTION . . . . . . . . . . . . . . . . . . . . . 8

2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Skin and Non-Skin Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Linear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Nearest Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Comparison of Classifier Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.8 Designing a Skin Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 OBJECT SEGMENTATION AND CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Blob Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Initial Color Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 The Object List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Processing Human Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Processing Ball Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii

4.2 Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Color Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Human Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Human Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.6 Ball Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.8 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

APPENDICES

A THE HSV COLOR SPACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

B ELLIPTICAL BLOB REPRESENTATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iv

LIST OF FIGURES

figure page

1-1 Basketball scene capture system diagram.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2-1 Distribution of skin (red) and non-skin (blue) data in color space.. . . . . . . . . . . 9

2-2 Data histogram mapped onto discriminant function. . . . . . . . . . . . . . . . . . . . . . 10

2-3 Training data (a) raw and (b) quantized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2-4 Skin/non-skin results for statistical classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . 16

3-1 Sample reference image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3-2 Sample (a) original image with corresponding (b) difference image.. . . . . . . . 19

3-3 Detected moving blobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3-4 Sample training data for (a) ball and (b) skin classes.. . . . . . . . . . . . . . . . . . . . 22

3-5 Pixel classification and corresponding binary masks . . . . . . . . . . . . . . . . . . . . 23

3-6 Detected objects: (a) ball, (b) humans, and (c) unknown.. . . . . . . . . . . . . . . . . 24

3-7 Body part search regions in the human cardboard model.. . . . . . . . . . . . . . . . . 24

3-8 Inaccurate arm representation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3-9 Corrected arm representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4-1 Object pixels are misclassified due to a high differencing threshold. . . . . . . . . 29

4-2 Several small motion blobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4-3 Incorrect pixel masks due to shadows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4-4 Sample video sequence frames for color model training. . . . . . . . . . . . . . . . . . 31

4-5 Decision volumes for skin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4-6 Decision volumes for two skin models.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4-7 Color classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4-8 Human detection.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4-9 Human detection failure mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4-10 Separation of skin regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

v

4-11 Results of ball detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4-12 Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A-1 HSV Color Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

B-1 Sample pixel blob . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

B-2 Blob with representative ellipse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

vi

vii

LIST OF TABLES

table page

2-1 Combinations of skin and non-skin prototype vectors. . . . . . . . . . . . . . . . . . . . . . . 13

2-2 Performance comparison of statistical classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4-1 Error rates on a validation data set given a Mahalanobis distance threshold of 6.. . 33

viii

Abstract of Thesis Presented to the Graduate School

of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

DETECTING HUMANS IN VIDEO SEQUENCES USING

STATISTICAL COLOR AND SHAPE MODELS

By

Iván R. Zapata

August 2001

Chairman: Dr. Michael C. Nechyba

Major Department: Electrical and Computer Engineering

We present a method for detecting and classifying humans and other objects from video

images in a known environment. Specifically, we have developed and implemented part of a real-

time scene reconstruction system for live basketball games. This thesis covers the problem of

detecting known objects and persons in real-time within the given environment using nonintrusive

techniques. In other words, our object detection requires no markers or transmitters on the sub-

jects, and no special equipment other than a common video camera and personal computer. The

system detects foreground objects using prior knowledge of a scene’s background, and classifies

these objects using pre-determined statistical color models into three categories: human, basket-

ball, or other. These color models consist of several unimodal Gaussian distributions which repre-

sent each player’s skin, the basketball, and any other objects to be recognized. For detected

humans, a shape model detects the head and limbs, and provides a rough estimation of a player’s

pose. The system runs on a Pentium II 450MHz computer, with run-time speeds as fast as 10

frames per second, depending on the input image size.

CHAPTER 1INTRODUCTION

1.1 Objectives

Technologies for tracking persons in a given environment have been explored in varied

forms, and for several types of applications. Video surveillance and security immediately come to

mind, and in fact a great deal of research has been driven by these fields, such as the systems out-

lined by Haritaoglu et al.[3] and Lipton et al. [6]. A system that can find, track and identify people

is of obvious value for such an application, especially if the system is able to do these things with-

out specialized hardware, other than already existing video cameras. The entertainment field also

has plenty of ways to exploit such a system, one of which is motion capture for video games,

motion pictures or sports. Our work is aimed at applications in the latter field, specifically, for

motion capture of televised basketball games. We envision a system consisting of several video

cameras capturing different views of the same game. Each camera is connected to a computer

which recognizes elements in the camera’s view, such as the players, referees, basketball, and the

court. Given enough views, a computer recreates the three-dimensional scene from several two-

dimensional camera views, as shown in Figure 1-1. The three-dimensional scene data can be trans-

mitted over a low-bandwidth channel, such as a telephone line, for reconstruction on a personal

computer. This research focuses on the two-dimensional scene reconstruction of raw video images.

There are a few restrictions imposed by the nature of this application, most notably that any

approach must be completely unobtrusive to the game, and must work in real-time. There must be

no markers or transmitting devices on the players, ball, or court, as they would likely interfere with

1

2

the game. Therefore, image capture must be from video cameras alone, which makes this a com-

plex computer vision problem. Furthermore, the vision algorithms must be computationally effi-

cient, as we wish to transmit the game information as close as possible to a video camera’s frame

rate of thirty frames per second. At the same time, the restricted environment allows for some sim-

plified processing, and exploitation of prior knowledge, such as the size and appearance of the

court, as well as information about each player’s appearance.

1.2 Related Work

1.2.1 Skin Detection

Face and gesture recognition systems have pushed much of the development of skin detec-

tion algorithms, with focus on systems that recognize a wide range of skin tones in different light-

ing conditions. The system proposed by Raja et al. [7] detects skin regions through a mixture-of-

Gaussians model in color space, in real-time, for potential gesture- and face-recognition applica-

tions. Starting with a single Gaussian expectation maximization (EM) fits training data to the first-

order model, and the log-likelihood of a validation set is computed and recorded. This process is

repeated, increasing the model order by one at each step, until a peak in the log-likelihood function

2-D SceneRecognition

Raw VideoFrame #n

2-D SceneRecognition

Raw VideoFrame #1

FuserTime delay 3-D SceneReconstruction

Figure 1-1: Basketball scene capture system diagram.

3

is found, which corresponds to the appropriate mixture of Gaussians. To minimize the effect of

shadows and lighting intensity changes, color data is sampled from HSV space (see Appendix A),

ignoring the “value” channel. To account for changes in lighting over time, the mixture model is

dynamically adapted at each new frame, using discounted weighted sums of the model’s parame-

ters (mean and covariance) at previous time steps. By customizing the order and Gaussian parame-

ters for each color model, this system can robustly track skin regions over widely changing

conditions, and could easily be adapted to track any other type of object (non-skin) with an arbi-

trary color distribution.

While tracking a single person, first-order Gaussian distribution models for skin perform

comparably well to the multimodal system of Raja et al. [7], since skin hue for an individual is

essentially constant. Darrell et al.[1] uses a single Gaussian probability model in log-opponent

space to classify skin-colored pixels. The log-opponent space is a transformation of RGB values

into tuples of the form , which

describe the hue of a pixel. User-defined thresholds in this space determine whether pixels are rec-

ognized as skin.

1.2.2 Motion Detection and Object Classification

The system proposed by Lipton et al. [6] realizes a motion-detection scheme which relies on

temporal differencing of frames in a video stream. Consecutive frames are subtracted from each

other and the result is run through a threshold function, leaving regions that are considered moving

objects. Moving objects are classified using a dispersedness metric, which essentially describes the

complexity of an object’s outline to its total area. Shape statistics (perimeter and area) are gathered

over time for each moving region, and are then used to classify the region using a Maximum Like-

lihood Estimation (MLE) technique over several frames. To account for the possible lack of spatial

constancy of objects over time, or possible ambiguities in an object’s shape, multiple hypotheses

G( )log R( ) G( ) B( )log R( )log G( )log+( ) 2⁄–,log–log,( )

4

are postulated for the object’s class at each frame. A classification histogram is computed for each

frame, and no decision is made on the object’s class until peaks in this histogram have remained

for some given amount of time, resulting in a high-confidence decision. An advantage of this

method is that it is robust to transient or random background motion, since its classification histo-

grams will likely not remain constant over many frames.

Schneiderman [11] has applied a generalizable method of object detection to the segmenta-

tion of faces and cars in images. Statistical distributions are built for each object,

, and for the rest of the training image, , so that the class

of an object in a new image is computed using a ratio test:

(1-1)

If the ratio on the left side is greater than , the object is determined to be present. Since the true

mode of distribution is not really known for an object or an image (e.g., Gaussian, Poisson, multi-

modal) histograms are used to estimate a distribution. The histograms are built from seventeen

visual features extracted through a wavelet transform, which has localized components in space,

frequency and orientation.

1.2.3 Human Detection

Prior knowledge of the appearance or physical characteristics of the human body can be

used to build templates or spatial models to aid in human detection. Such a model might, for exam-

ple, include information about where different body parts are in relation to the rest of the body, or

information about the shape of a person’s silhouette while walking.

W4 [3] is a system designed to function in nighttime or low-light surveillance applications,

and uses exclusively monochromatic images to detect and track people. Haritaglou et al. therefore

P image | object( ) P image | non-object( )

P image | object( )P image | non-object( )---------------------------------------------------- λ> λ P non-object( )

P object( )---------------------------------=

λ

5

chose not to rely on color information—unavailable in infrared or low-light cameras—but instead

on shape information. The system obtains video images directly from a stationary infrared or low-

light camera, and all processing occurs under the assumption that the camera is never moved,

panned or zoomed, as the algorithm does not generalize well to a moving camera. Foreground

regions are detected by background analysis and background differencing for each new frame.

Objects are classified with second-order motion models, which describe the motion of a person

and all their parts. This motion model is refined over time, and is used as a temporal template to

aid in detection and tracking of the same object in later frames. W4 finds and tracks people’s parts

(head, torso, hands, and legs) through a combination of template matching and spatial heuristics,

based on the cardboard model in [4].

Pfinder [14] is another real-time system which uses trained statistical models of a person’s

appearance to track a single user in a known, static environment. Pfinder represents large color

regions as two-dimensional Gaussian blobs in the image space, and classifies body parts (head,

hands, torso) using five-dimensional feature vectors. These feature vectors are trained by incorpo-

rating color information in the YUV color space with the pixel coordinates of a point. A feature

vector is built for each body part to be tracked.

While the work of Rigol et al. [8] detects humans based on shape information, its shape

models are generated dynamically from input data, without prior knowledge about the human

form. This work uses a stochastic technique of so-called Pseudo-2D Hidden Markov Models

(P2DHMMs), based on the Hidden Markov Models made popular in speech recognition applica-

tions. To form the P2DHMM, the columns of an image are scanned and arranged as a vector; then

a single HMM models the relationships between adjacent image blocks (pixels) which are treated

as states in a Markov chain. After each column in the image has an HMM associated with it, each

HMM is treated as a single state and transitions between them are extracted. Thus, the P2DHMM

is essentially a series of nested HMMs. For training, this algorithm runs on a set of hand-seg-

6

mented training images, where each pixel is previously marked as a “person” or “other” pixel. The

P2DHMM labels certain states as “person” states, and others as “other” states during training, so

that when the algorithm runs on a new image, pixels in the image are recognized as “person” pixels

or “other” pixels. Since there is no use of color information, this system is indifferent to lighting

changes, as well as skin tones or clothing. Furthermore, since the segmentation of humans is

entirely static, i.e., there is no motion segmentation step or shape information from previous

frames, the system is also quite stable during camera motion or zoom, and is quite robust to mov-

ing objects that are not humans.

1.3 Approach

We make use of prior knowledge about the color and shape of any objects that will be

present in the video image, namely the court, players, and basketball. We describe each of these

through statistical color models and pre-defined shape templates. Statistical models allow for large

noise and variation margins, as well as the flexibility to describe many different types of objects.

We build these models off-line, prior to any real-time processing, in order to expedite on-line pro-

cessing times. As each frame is captured in real-time, we detect moving objects in the scene, and

calculate their position and size. A set of detectors and classifiers determines whether each of these

objects is a human, a basketball, or something else. Each type of object is further processed by spe-

cialized algorithms to clean up noise, correct misclassified objects, or further classify each type of

object. The output of the system will be a list of objects with their specific attributes, such as size,

position and the relative positions of limbs in humans.

This thesis is organized as follows. Chapter 2 explains the development of our skin detection

algorithm. We discuss our initial work classifying pixels into skin and non-skin classes using dif-

ferent types of statistical models. We show the relative merits of four such classifiers, as well as the

representational properties of different types of statistical and memory-based models. From these

7

comparisons we justify our final choice of statistical color representation, and derive the algorith-

mic basis for the skin detection algorithm.

Chapter 3 describes the process of detecting and classifying objects from a raw video image.

We explain our motion segmentation strategy, which is similar to the background subtraction

methods of Haritaoglu et al.[3] and Wren et al.[14]. We describe the different types of noise cre-

ated by the background subtraction method, and strategies for reducing each type of noise. The

skin detector from Chapter 2 is explained in further detail, and its application to other types of

color classification is demonstrated. We describe how we use the output from these color classifi-

ers as a basis for object classification, where we classify all detected objects as human, ball, or

unknown. Lastly, we outline some specialized algorithms for processing possible errors in detected

human and ball objects.

Chapter 4 presents experimental results for each of the systems implemented in Chapters 2

and 3. We present both qualitative and quantitative results from each stage of processing, and dis-

cuss error rates, failure modes, and computational efficiency for each. We propose viable solutions

for each algorithm’s modes of failure, as well as methods to expand the work to other applications

of human and object detection. We submit that this work is a very effective open-loop object detec-

tion scheme, and that as part of the closed-loop feedback system of Figure 1-1, system robustness

could be increased significantly.

CHAPTER 2STATISTICAL CLASSIFIERS FOR SKIN DETECTION

2.1 Introduction

A reliable skin detection algorithm is critical to this research, as the performance of the skin

detector directly determines the accuracy of the entire system. We must be able to differentiate skin

regions from all other objects in an image, and we need to detect different skin tones under varied

lighting conditions. Here we attempt to solve this detection problem with a statistical classifier, by

comparing statistical models to decide whether skin regions are present in an image or not.

2.2 Skin and Non-Skin Classes

We initially build two classes from a set of training images: a “skin” class, and a “non-skin”

class. Every pixel in the training images is marked manually1 as skin-colored or not skin-colored.

Points in the “skin” class represent pixel color values that are skin-colored. The “non-skin” class

represents all other pixels in an image, so that the classifier is in fact acting as a skin detector, by

“detecting” pixels which are skin-colored. Pixel data is captured in HSV space (see Appendix A),

and is shown in Figure 2-1. Data points are 3-dimensional vectors of color values,

. The skin class contains just over 5,000 data points, and the non-skin class over

12,000—sufficient training data to build an accurate statistical model for each class.

1. There exist automatic methods of region-of-interest demarcation from training images, a methodsuperior to hand-segmentation. Recent work in our lab has demonstrated a simplified method forsegmentation of arbitrary regions of interest in an image by clustering of spatial and chromaticcharacteristics. Integrating such work into our system would minimize the need for human interventionin the class-training process.

x H S V, ,{ }T=

8

9

2.3 Linear Classifier

The data plot in Figure 2-1 shows that the two classes are, to some extent, linearly separable.

That is to say, there exists a plane in 3-dimensional space that will separate the two classes with

reasonably low classification error. Under this assumption, we can find a linear function, ,

which will discriminate new data points using such a plane as a class boundary, so that points on

either side of the plane correspond to each of the two classes. Such a line must satisfy the Fisher

criterion [10], and is of the form:

(2-1)

(2-2)

(2-3)

Figure 2-1: Distribution of skin (red) and non-skin (blue) data in color space.

g x( )

g x( ) wTx ω0+=

w Sw1– µ1 µ2–( )∝

Sw ΣΣΣΣ1 ΣΣΣΣ2+=

10

where is a coordinate in color space, and are the mean and covariance matrix, respec-

tively, for the ith class. Given these parameters, we can project the training data points onto ,

effectively collapsing the color space into one dimension along the line direction of . This allows

us to visualize how effectively the plane separates the two classes. We find a linear intercept along

, in (2-1), through a linear search, looking for the point yielding the smallest classification

error. Figure 2-2 shows a histogram of each class as it projects onto the vector , as well as the

optimal value of , marked as a vertical green line. The line intercept shown gives a total classi-

fication error of 3.59%, as calculated on a validation data set. While this classifier performs accept-

ably, a higher-order discriminating surface, such as a quadratic, will likely reduce the number of

misclassified skin-colored pixels.

x µi ΣΣΣΣi,( )

w

w

w ω0

Figure 2-2: Data histogram mapped onto discriminant function.

w

ω0

11

2.4 Bayes Classifier

To improve on the error of the linear classifier, we now implement a quadratic discriminant

function using a Bayes classifier. We do this under the assumption that our classes exhibit normal

(i.e., Gaussian) distributions, such that the ith class has a probability density function of the form:

(2-4)

Assuming equal prior probabilities for both classes, a Bayes classifier gives us the following dis-

criminant function, for class i:

(2-5)

(2-5a)

(2-5b)

(2-5c)

A new data point classifies to the highest-valued discriminant function as given in (2-5). Validation

data was classified with 2.67% misrepresentation rate, lower than the linear classifier. Clearly, a

quadratic decision surface works better on this data than a decision plane, so Gaussian probability

distributions model the training data more accurately. We calculate this classifier’s expected error

bound to be 2.34%, using the Bhattacharrya bound method in [10].

2.5 Nearest Neighbor Classifier

The nearest-neighbor classifier is a non-parametric approach that makes no assumptions

about the data’s distribution or separability. New data is classified by finding the Euclidian dis-

tance to all training data points, and choosing the class of the closest point. Validation error for this

pi x( ) 1

2π( )3 ΣΣΣΣi

--------------------------12--- x µi–( )TΣΣΣΣi

1– x µ i–( )–exp=

gi x( )

gi x( ) xTW ix ωix ω0i+ +=

W i12---ΣΣΣΣi

1––=

ωi ΣΣΣΣi1– µi=

ω0i12---µ i

TΣΣΣΣi1– µi– 1

2--- ΣΣΣΣiln–=

12

classifier is 1.3%, the lowest of any of the classifiers implemented here. Its major drawback, how-

ever, is computational efficiency: new data must be compared to the entire training data set, over

19,000 points. In a 320x240 pixel image, we must compare over points to process the

entire image. This number corresponds to long computation times per image, which are unaccept-

able in a real-time system. The nearest-neighbor approach, however, offers great data modeling

capabilities which are quite useful, once the size of the training data set is properly reduced.

2.6 Vector Quantization

To relieve the computational load of the nearest-neighbor approach, we need to effectively

reduce the size of the training data set in such a way that its representational accuracy is kept as

high as possible. Given our training data set, , , of n 3-dimensional data

points, we find a set of prototype vectors , , , such that the total

distortion D,

(2-6)

is minimized [9]. We calculate a set of prototype vectors Z for each class using the LBG vector

quantization algorithm [5]. Table 2-1 tabulates the classification error for different numbers of skin

and non-skin prototype vectors. From this table, we choose a suitable quantized representation by

choosing the combination that yields minimal error, highlighted in gray. Figure 2-3 shows how the

data space is quantized from X (left plot) to Z (right plot).

2.7 Comparison of Classifier Results

All four statistical classifiers discriminate well between skin and non-skin colored pixels.

Error rates for all classifiers are comparable, and all are quite acceptable misclassification rates.

1.504 109×

X xj{ }= j 1 2 … n, , ,{ }∈

Z zi{ }= i 1 2 … L, , ,{ }∈ L n«

D min i( ) xj zi–( )

j 1=

n

∑=

13

Table 2-1: Combinations of skin and non-skin prototype vectors.

# of skin vectors # of non-skin vectors Percent Error

128 128 2.97

64 128 3.70

32 128 5.55

16 128 9.95

8 128 16.2

4 128 24.3

2 128 30.6

128 64 2.52

64 64 3.40

32 64 3.79

16 64 7.12

8 64 11.9

4 64 18.2

2 64 27.9

128 32 2.95

64 32 3.18

32 32 4.04

16 32 5.93

8 32 9.60

4 32 14.8

2 32 25.4

128 16 3.35

64 16 3.67

32 16 4.63

16 16 6.71

8 16 10.2

4 16 15.5

2 16 21.5

14

An important performance measure in this work is the computational complexity of an algorithm,

as the principal goal is to produce a system that will work in real-time. Table 2-2 compares these

performance metrics for all four statistical classifiers. The error rate includes all misclassified pix-

els for both classes, and the number of operations counts the number of multiplications and addi-

tions that must be evaluated for a single new data point. From Table 2-2 we conclude that a suitable

compromise between efficiency and accuracy is the Bayes classifier using Gaussian distributions.

Figure 2-4 represents a more qualitative comparison of the four statistical classifiers on a test

image. These images reflect the fact that the nearest neighbor classifier and the Bayes classifier

exhibit the lowest misclassification rates.

2.8 Designing a Skin Detector

The discussion of results in Section 2.7 leads us to design a skin detector based on a Bayes

classifier with Gaussian distributions. While the exact design of the skin detector is described in

fuller detail in Section 3.4, its major points are highlighted here. First, we will not implement the

full discriminant function as given in equation (2-5); instead the discriminant function is based on

the Mahalanobis distance measure, which requires less computation and allows the possibility of

(a) (b)

Figure 2-3: Training data (a) raw and (b) quantized.

15

rejecting a point from a particular class by implementing threshold rules for new data. This ability

to reject points from the skin class makes it unnecessary to train a non-skin class, thus reducing the

complexity of the system.

Table 2-2: Performance comparison of statistical classifiers.

Classifier Type Error Rate Number of Operations

Linear 3.59% 3 mult., 3 add.

Bayes 2.34%* 30 mult., 22 add.

Nearest Neighbor 1.30% 38,000 mult., 38,000 add.

Vector Quantization 2.95% 320 mult., 320 add.

* this is the expected error bound, rather than the measured error.

16

Linear Classifier Bayes Classifier

Nearest Neighbor Classifier Vector Quantization Classifier

Figure 2-4: Skin/non-skin results for statistical classifiers.

CHAPTER 3 OBJECT SEGMENTATION AND CLASSIFICATION

3.1 Overview

We divide the processing of an image into the following series of steps: motion segmenta-

tion, blob detection, color classification, object classification, and object-specific processing. The

motion segmentation stage marks the foreground objects of a raw video image as a binary pixel

mask. Contiguous clusters of pixels from the binary mask are grouped together by a blob detection

algorithm, and each grouping marked as an individual object. Each individual object’s color distri-

bution determines whether the system will consider that object a human. The system recognizes

different types of objects, and processes each differently according to type; human objects are fur-

ther examined to determine limb locations and orientation, objects unknown to the system are

marked as such and left for possible later processing.

3.2 Motion Segmentation

3.2.1 The Reference Image

The reference image, , is a matrix of dimension containing

the color information of a scene’s background; that is, an image taken with no humans or objects

present, as shown in Figure 3-1. We obtain the reference image off-line, since it will remain essen-

tially unchanged throughout the on-line image processing.1 Since all on-line image processing is

done in the HSV color space (see Appendix A), we convert into .

1. We have also experimented with a reference image that is periodically updated in order to account forgradual background changes, which greatly increases the system robustness.

R x y c, ,( ) XDIM YDIM 3××

R x y c, ,( ) Rhsv x y c, ,( )

17

18

3.2.2 The Foreground Image

Foreground objects are extracted from a scene using a frame differencing technique which

has been used previously by Haritaoglu et al.[3] and Wren et al.[14]. As a new frame

comes into the video sequence, we convert it into HSV space and subtract it from the reference

image:

(3-1)

producing a difference image, . From this difference image, we generate a binary image

by running each of its three color channels through a specific channel threshold:

(3-2)

Figure 3-1: Sample reference image .R x y c, ,( )

I x y c, ,( )

D x y c, ,( ) Ihsv x y c, ,( ) Rhsv x y c, ,( )–=

D x y c, ,( )

B x y,( )

1

D x y hue, ,( ) Hmin>and

D x y saturation, ,( ) Smin>and

D x y value, ,( ) Vmin>

0 otherwise

=

19

where , , and , are carefully chosen threshold levels for the hue, saturation, and

value channels, respectively. This operation generates the foreground image, as non-zero pixels in

belong to objects in the scene’s foreground, which we define to be moving objects. Figure

3-2 shows a sample difference image, and the resulting binary foreground image. The channel

threshold levels in (3-2) set the sensitivity of the system to camera noise, sudden lighting changes,

and shadows. Camera noise may appear as random fluctuations on any of the three channels, and is

due to physical sensor inaccuracies in the camera’s circuitry. Since it is small in magnitude (1-2

bits per channel) and area (5-10 contiguous pixels), a low-valued threshold for each channel will

Hmin Smin Vmin

B x y,( )

Figure 3-2: Sample (a) original image with corresponding (b) difference image and (c) binary mask image.

(a) (b)

(c)

20

generally filter out camera noise. A subsequent pass through a median filter removes any

noise that remains after thresholding. Shadows generally affect the value channel almost exclu-

sively, and unpredictably. Consequently, the value channel threshold should be higher in magni-

tude than the other two channels to allow for a larger variation margin for shadows.

3.3 Blob Detection

We define a blob as a contiguous grouping of pixels that are similar by some given measure

(e.g., color, position). We initially segment moving objects by running the pixels in

through a blob detection algorithm, which returns a list of blobs. Blobs are detected by a recursive

search through the binary image’s nonzero pixels. Each blob in the list has the following parame-

ters:

• Size of the blob in pixels, n.• A unique blob label.

• matrix of pixel coordinates.

• matrix of pixel colors.• Spatial information vector.

The pixel coordinate and color vectors are sampled from , and the blob label is gener-

ated automatically for each new blob. The spatial information vector is a collection of pixel statis-

tics that describe each blob’s size, position, and rough shape using an ellipse; Appendix B provides

a complete description of these statistics and the ellipse representation. Figure 3-3 shows the esti-

mated position and shape for all detected blobs in the sample image. Each ellipse corre-

sponds to a moving blob, which may be a single moving object, two or more objects merged

together, or noise that was not filtered. For example, the blob list for the sample image in Figure 3-

3 would contain one blob which is a single human (person on the left), one blob made up of two

separate objects (basketball and person on the right), and several small noise blobs. We eliminate

most noise blobs by rejecting blobs smaller than some arbitrary size, since they usually contain

few pixels.

3 3×

B x y,( )

2 n×3 n×

Ihsv x y c, ,( )

B x y,( )

21

3.4 Initial Color Classification

After the initial spatial segmentation of the image, we use color information to classify the

detected blobs. We initially detect two types of pixels: skin-colored and ball-colored. Both of these

classes are fairly homogeneous and continuous in the HSV color space, as shown in Figure 3-4, so

we model each with a single 3-dimensional Gaussian pdf, as given by,

(3-3)

The Gaussian parameters are obtained from color sampling of hand-segmented training

images, as previously explained in Section 2.2. Only a few training images are necessary, as long

as care is taken to select images that exhibit a wide range of lighting intensities for skin and ball

pixels. To classify a new pixel, , we calculate the Mahalanobis distance from that pixel to each of

the two classes:

(3-4)

Figure 3-3: Detected moving blobs.

p y( ) 1

2π( )d ΣΣΣΣ-------------------------

12--- y µ–( )TΣΣΣΣ 1– y µ–( )–exp=

ΣΣΣΣ µ,( )

y

Mi2 y µi–( )TΣΣΣΣi

1– y µi–( )= i 0 1,{ }∈

22

In order to simplify on-line evaluation of (3-4), we assume that the covariance matrix for class i

is diagonal. Since all off-diagonal correlation terms in are at least two orders of magnitude

smaller than the diagonal terms, we diagonalize by forcing off-diagonal terms to zero, with no

appreciable effect on classification accuracy. This simplification substantially reduces the compu-

tation of . After evaluating the distance metric (3-4) for each class, we compare the smallest

distance to a pre-determined threshold distance, . If , is classified

as a member of class i. Figure 3-5 shows examples of image pixels that have been classified as ball

pixels, skin pixels, or “other” pixels — they are rejected by the classifier, and marked for later pro-

cessing. After classifying every pixel in a blob, a binary mask image is created for each of the two

classes and for unclassified pixels, as shown on Figure 3-5. The ball and skin binary masks run

once again through the blob detection algorithm, returning a new list of blobs for each object.

3.5 The Object List

We consider all non-noise moving blobs to be foreground objects. An object’s type depends

on the color distribution of its pixels, i.e., the output of the color classifiers. For instance, to clas-

Figure 3-4: Sample training data for (a) ball and (b) skin classes.

(b)(a)

ΣΣΣΣi

ΣΣΣΣi

ΣΣΣΣi

ΣΣΣΣi1–

D2threshold D2

i D2threshold< y

23

sify an object as a ball, we say that some percentage of the object must be made up of ball-colored

pixels; to classify an object as a human, some percentage of its pixels must be skin-colored. An

object that fails to be classified as a human or a ball is left as an unknown, and marked as such for

possible later processing. In our controlled environment, unknown objects will generally be false

objects produced by pixel noise, as shown on Figure 3-6. We keep track of all the objects in an

image through an object list; each object in the list has the following parameters:

• Size, in pixels, of the object.• Object type.• Unique object label.• Spatial information vector for the object.• Number of detected color regions within the object.• Spatial information vectors for each color region.• The class of each color region (skin, ball, other).• Size, in pixels, of each color region.

All these parameters are available as the output of either the blob detector or the color classifiers .

Figure 3-5: Pixel classification and corresponding binary masks of (a) ball pixels, (b) skin pixels, and (c) other pixels.

(a) (b) (c)

24

3.6 Processing Human Objects

After classifying an object as human, we wish to extract the person’s pose, which we do very

roughly by estimating head and limb positions from spatial statistics of skin regions. To determine

correspondence between skin regions and known body parts, we implement a simplified human

cardboard model similar to that proposed by Ju et al.[4] and implemented in [3] as shown in Figure

3-7. The human cardboard model defines a set of general search regions for specific body parts.

The simplified cardboard model only sets search regions for the head, arm(s), and leg(s). The algo-

rithm checks the spatial means of detected skin regions in an object against these search areas,

classifying each skin region to the search area encompassing its mean. This method of spatial clas-

(a) (b) (c)

Figure 3-6: Detected objects: (a) ball, (b) humans, and (c) unknown.

Figure 3-7: Body part search regions in the human cardboard model.

25

sification, combined with the ellipsoid shape representation, models the head and legs very closely.

Therefore, if a skin blob is identified as a leg or head, its elliptical representation is assumed to

hold accurate position and size information. Arms, however, require further processing, because

generally an arm’s ellipse data will be an inaccurate representation of the arm’s position and shape.

Figure 3-8 shows a typical arm misrepresentation, caused by the fact that a single ellipse cannot

describe a bent arm’s true position. To correct this problem, we split up each arm’s pixel data, and

assume a new two-ellipse representation. We split the arm blob data along its minor axis in the fol-

lowing manner:

(3-5)

where is the pixel’s position relative to the ellipse centroid, is a vector in the direction of the

ellipse’s major axis, and and are new distinct labels that distinguish pixels on either side of

the minor axis. We calculate ellipse parameters for each of the newly labeled sets, which together

model the two arm segments more accurately, as shown on Figure 3-9.

Figure 3-8: Inaccurate arm representation.

P x( )θ1 ωT x⋅ 0>

θ2 ωT x⋅ 0≤

∈

x ωT

θ1 θ2

26

3.7 Processing Ball Objects

After calculating the spatial characteristics of a ball blob, very little processing remains for

that blob. Conveniently, a ball looks exactly the same from any viewing angle, such that a ball

object which is not circular clearly has incomplete or incorrect data. If the pixel data is incomplete

(due to obstruction from other objects), we rely on past unobstructed frames to correct the ball’s

position and size. A ball may also be split into two or more regions due to partial occlusion from

an object such as an arm or leg. To detect this problem, we search for large ball blobs which are

close to each other in the image. If the ball blobs are close enough, we determine that they must

come from the same ball, so we join the two through a statistical method used in [12]. We can fuse

two normal distributions, and , by computing a new covariance matrix and a

new mean vector , as shown in equations (3-6) and (3-7), respectively.

(3-6)

(3-7)

We modify each covariance matrix before this operation, by setting both diagonal terms equal to

the largest diagonal term, as equation (3-8) shows.

Figure 3-9: Corrected arm representation.

µ1 ΣΣΣΣ1,{ } µ2 ΣΣΣΣ2,{ } ΣΣΣΣ

µ

ΣΣΣΣ ΣΣΣΣ1 ΣΣΣΣ1 ΣΣΣΣ1 ΣΣΣΣ2+[ ] 1– ΣΣΣΣ1–=

µ µ1 ΣΣΣΣ1 ΣΣΣΣ1 ΣΣΣΣ2+[ ] 1– µ1 µ2–( )+=

27

(3-8)

This is to ensure that, as the blobs are merged, the new blob’s axes will correspond to the

unmerged blob’s major axes.

ΣΣΣΣ σ112 σ12

2

σ212 σ22

2= ΣΣΣΣ'

max σ11 σ22,( )2 σ122

σ212 max σ11 σ22,( )2

=

CHAPTER 4

RESULTS AND DISCUSSION

4.1 Experimental Setup

The system runs on a Pentium II 450 MHz computer under Red Hat Linux. A digital video

camera captures images at 15 frames per second, which are recorded by the computer through an

S-video connection. The camera is fixed in position, and is never panned or zoomed. All images

are taken with fixed lighting and background. While this setup may seem restrictive, it is not unlike

the broadcasting setup at any basketball game, where the background (court) remains unchanged

through a game, and lighting is constant throughout a single game. Processing speed varies with

the dimensions of a captured image, the number of objects in the scene, and the size of those

objects. We have achieved frame rates as fast as 10 frames per second for a fairly simple (no more

than 2 objects), 160x120 pixel image. However, the system is less prone to errors with larger

images, so all experiments were run on 320x240 pixel images.

4.2 Motion Detection

The background differencing algorithm works very consistently as long as the camera

remains stationary and the lighting remains constant. Performance, however, does hinge on choos-

ing an appropriate set of HSV thresholds for the image differencing, as described in Section 3.2.

Choosing unsuitable thresholds will cause the motion detection to fail. For instance, a threshold

that is too high will reject a pixel that is part of an object, but is of a color similar to the back-

ground pixel at the same coordinates. Figure 4-1 shows an example of such an error, where a per-

son’s white socks and shoes are confused with the white floor. There is essentially no way to detect

28

29

this problem for a single pixel, though a solution can be implemented by using neighboring pixel

information, for example, through repetitive dilation and erosion of the binary mask image [3].

Conversely, when a threshold is set too low, pixels that do not actually belong to an object appear

as moving pixels. This case tends to occur from increased camera noise or momentary changes in

light reflections, as in Figure 4-2, where pixel colors fluctuate by a few values. While these false

moving blobs are undesirable, they are filtered out by later processing, so they pose no major con-

cerns. More frequently, false positives occur when objects cast shadows on the background of a

scene. Shadows are particularly troublesome because pixel color can change significantly when

part of a scene is shaded. It is not as simple to fix this problem through size discrimination, as it is

Figure 4-1: Object pixels are misclassified due to a high differencing threshold. Pixels in the shoes and socks are misclassified because they are similar in color to the floor.

30

with noise blobs, because quite frequently shadow pixels are detected as part of a large moving

blob. Figure 4-3 shows a case where the binary pixel mask for the person on the left is not a true

representation of that person’s shape. The binary mask includes the moving pixels for the person,

as well as a large shadow cast by him on the wall. The same effect is observed on the ball and the

shadow it casts on the floor. These errors can be lessened by setting the threshold on the value

channel at a higher level than the other two channels. Color processing at later stages will com-

pletely eliminate the residual shadows that are still present in the binary mask.

Figure 4-2: Several small motion blobs are detected where there are no objects, likely due to inaccuracies in the camera sensor and small lighting changes.

Figure 4-3: Incorrect pixel masks due to shadows.

31

4.3 Color Classification

Obtaining an accurate color model for each class we expect to process is extremely impor-

tant. Color models are generated off-line, by capturing a series of video images that show each

color type to be classified (ball, human, etc.). The video sequence should show a wide range of

variation in the object’s lighting and position in the room, in order to allow the model to generalize

well. Frames from a sample training sequence are shown in Figure 4-4. To train a statistical model,

each of these frames is segmented manually for desired color models. For the example in Figure 4-

4, we would mark the skin regions and the ball regions in each frame, and use those pixels to build

a skin model and a ball model. The raw data from these hand-segmented regions is shown in Fig-

ure 3-4, and should obviously be as large a data set as possible. The hand-segmentation process

can be tedious for a large training sequence; fortunately a training set of less than 10 images will

produce somewhere between five and ten thousand training points, which is sufficient to build a

satisfactory statistical model. Once this data is collected, pixel color statistics are computed. After

establishing a suitable classification threshold for the Mahalanobis distance of each class, we can

Figure 4-4: Sample video sequence frames for color model training.

32

plot a decision volume for each class in HSV space, as shown in Figure 4-5. On this plot we can

observe that the two classes are separable, and that, as expected, they show significantly more vari-

ance along the saturation and value axes than along the hue axis. We also note qualitatively that

there is very little correlation between the three dimensions of each class. Since the major axes for

Figure 4-5: Decision volumes for skin (red, Mahalanobis distance < 6) and ball (blue, Mahalanobis distance < 6) models.

33

both decision regions are parallel to the three dimensional axes, we are empirically justified in

decoupling and diagonalizing each class’s covariance matrix.

To assure the lowest possible classification error rate, we build separate skin models for

every person we expect to see during run-time processing. Figure 4-6 shows the decision volumes

for two skin models trained on two separate individuals. We note that although we build unique

skin color models for each person, we consider them distributions of the same class, and our multi-

ple-model approach is effectively a multimodal distribution of a single class. The large amount of

overlap between the two skin distributions in Figure 4-6 makes it impossible to treat them as sepa-

rate classes without a large misclassification error—the main reason we treat all skin models as

part of a single class. We can observe some typical results of this color classifier in Figure 4-7. The

classifier works well on the test images shown, though not without some classification errors.

Table 4-1 gives the average classification error rate for a validation data set, which consists of

labeled, hand-segmented images similar to Figure 4-7. Error rates vary according to several param-

eters, but are most influenced by the choice of Mahalanobis distance threshold. Generally, raising

Table 4-1: Error rates on a validation data set given a Mahalanobis distance threshold of 6.

Type of error Average Error Rate

Overall misclassification1 2.1%

Non-skin pixels classified as skin pixels1 1.1%

Unclassified skin pixels2 1.2%

Non-ball pixels classified as ball1 0.01%

Unclassified ball pixels2 0.8%

1 Percentage given is out of the total number of validation pixels.2 Percentage given is out of the number of pixels in that class only.

34

this threshold for a class will reduce the number of unclassified pixels, but will also increase the

number of misclassified pixels for that class (e.g., non-skin pixels classified as skin, etc.). We have

chosen to keep the class threshold relatively low, reducing misclassification of out-of-class pixels,

but in turn detecting incomplete skin regions. We choose this trade-off because the missing skin

regions can be corrected with a median filter, or other methods such as dilation and erosion, while

incorrect skin regions are not always trivial to detect. Furthermore, it is possible to still construct a

Figure 4-6: Decision volumes for two skin models (red and green, Mahalanobis distance<6) and one ball model (blue, Mahalanobis distance<6).

35

representative ellipse with incomplete skin data, and it is unlikely that an entire leg or arm will fail

to be recognized as a skin region.

4.4 Human Detection

A moving object that has significant skin-colored areas is determined to be a human. We

base this decision solely on color, not on object shape or size, since the human body can assume an

immense variety of shapes and sizes in a video image. The amount of skin that must be part of a

human object is found experimentally, though there is no need for extensive fine-tuning of this

parameter, i.e., we can reasonably determine that an object with more than 10% skin pixels is a

Figure 4-7: Color classification. Yellow pixels are classified as part of the ball, white pixels are classified as skin pixels.

36

human. This number is high enough that noise on a non-human object will not cause it to be mis-

classified as a human, while at the same time being low enough so that a person without much vis-

ible skin will still be classified correctly. Some results of the human detection algorithm are shown

in Figure 4-8. As shown, each ellipse gives a reasonable measure of the size of each person, and of

their position in the image frame. Most importantly, no stray objects are classified as human. We

observe in the two leftmost frames of Figure 4-8 one of the failure modes of the human detection

algorithm, where a human’s size is estimated inaccurately because of shadows misidentified as

part of a moving object. The two ellipses in these frames do not encircle the human’s outline as

closely as those in the other frames, because nearby shadows (on the wall and floor) are detected as

part of the human. These shadows were not filtered through earlier processing, and cannot be eas-

Figure 4-8: Human detection. Each ellipse is considered a separate human.

37

ily differentiated from the human motion blob. This is not an irrecoverable error, and a possible

solution is discussed in Section 4.7. Figure 4-9 presents a more severe failure mode, where two

persons are recognized as a single human. This error occurs because the motion blobs of both per-

sons are connected in the binary mask image, , and there is no way to discern between the

two in the early processing stages. Thus, both persons are processed as a single motion blob with

several skin regions, which, by our simplified initial classification assumption, classifies as a sin-

gle human. This is in fact a mild example of a complicated processing problem which this system

does not attempt to correct: object occlusion. When a foreground object (in this case, a human),

moves in front of another and obstructs the camera’s view, there is not always a clear way to distin-

guish the two objects, unless we have more complete statistics on the object’s colors (i.e., clothing,

etc.). A more robust solution, and one which is beyond the scope of this particular system, is to use

past blob information (position, velocity) to predict an object’s motion through occlusion. Pro-

vided that the occlusion does not last very long, temporal motion filtering has been shown to be

quite effective during object occlusion [3, 6, 7, 8], and is discussed further for our context in Sec-

tion 4.7.

Figure 4-9: Human detection failure mode, two humans recognized as one due to partial occlusion.

B x y,( )

38

4.5 Human Representation

The human cardboard model described in Section 3.6 provides very effective search regions

for human limbs, provided that each person remains relatively upright, and there is no occlusion

from other objects. Figure 4-10 shows sample frames with detected limb region ellipses, including

head, arms, and legs. Skin regions that fall within the arm search regions in the cardboard model

have been split into two ellipses, as outlined in Section 3.6. The top left frame in Figure 4-10

shows a failure case of the limb detection algorithm. Because the legs of the person to the right are

very close together, the color classification and blob detection algorithms cannot recognize two

Figure 4-10: Separation of skin regions. Arms are split into two ellipses, legs and head are left as a single ellipse.

39

distinct limbs. Temporal filtering and tracking of individual limbs can correct this problem, as dis-

cussed further in Section 4.7.

4.6 Ball Processing

We assume that during a basketball game, only one basketball will be visible at any time.

This reduces the complexity of processing of ball objects, since we can assume that if we detect

multiple ball objects, the largest one is very likely the basketball, and all others are noise. The

color classifier’s misclassification rate for ball-colored pixels is low enough that the detected bas-

ketball is always significantly larger than noise pixels. Figure 4-11 shows typical results of this

Figure 4-11: Results of ball detection.

40

size-based ball detection. This detection scheme would fail if more than one basketball were

present in the image, or if the ball is missing completely from the frame—in this case, the largest

noise blob would be detected as the basketball, although this failure mode is eliminated through

size discrimination, i.e., setting a minimum possible size for the basketball. A partial failure mode

occurs if a ball is partially occluded by another object, in which case its size and position will be

miscalculated depending on the severity of the occlusion. The case also exists when a single ball

may split into two separate blobs after frame differencing and blob detection, as shown in Figure

4-12. This ball splitting may occur, as detailed in Section 3.7, from partial occlusion by a thin

object, such as an arm or leg. In this example, the blobs are close enough to each other that the

algorithm will merge their ellipse statistics into one, recognizing that both blobs are most likely

part of the same ball object.

4.7 Future Work

As mentioned in Section 1.1, this thesis forms a building block for a three-dimensional scene

reconstruction system. As such, it remains to be integrated with other parts of that system, though

it still stands on its own as a reliable object detection scheme. Nonetheless, the system presented

(a) (b)

Figure 4-12: Correction of (a) split ball blobs into (b) a single ellipsoid.

41

here stands to gain improved performance from its incorporation into the complete system. Most

improvement would come from the addition of a time-delay feedback loop for the tracking of

objects, a method widely used in other tracking systems,[3,6,7,8]. There are several implementa-

tions for temporal feedback into this sort of system. One possible method is to calculate the veloc-

ity and acceleration of an object of interest (i.e., limbs, ball, etc.), and use that information to

predict the position and speed of that object in future frames [3,6]; this method assumes strictly

linear motion from one frame to the next—not unreasonable given the speeds of the objects

tracked. A popular motion tracking technique uses Kalman filters as predictive motion filters. For

each object being tracked, a Kalman filter can take a number of motion descriptors such as position

and velocity, along with size or shape descriptors such as a bounding box, and predict the object’s

position and size. This approach would be very effective during temporary object occlusions, as it

has been shown to predict an object’s position if it disappears for a short time [8].

There is ample room for future work in the object classification sections, as currently this

system will not distinguish between different individuals. With few modifications, the existing

color classifiers could recognize different persons based on uniform color. In the context of a bas-

ketball game, this would then give us the ability to discern between players of opposing teams,

although more sophisticated tracking of jersey numbers or facial features would be required to dis-

tinguish individuals on the same team.

4.8 Conclusions

In this thesis, we present a system that can detect humans and other objects using non-inva-

sive techniques in real-time. We use color information from video images to robustly detect and

classify moving objects within a known environment, eliminating the need for specialized trans-

mitting devices or markers. Using statistical color models, the system classifies each detected

object as a human, basketball, or unknown object, according to the object’s color distribution. The

42

system outputs an estimate of each object’s size and position through an elliptical representation;

human objects are further processed through a shape model, which extracts the size and position of

the head, arms, and legs of each person. While we designed this system with a very specific appli-

cation in mind—three-dimensional reconstruction of a basketball game—its possible applications

range from three-dimensional reconstruction of any arbitrary scene, to intelligent surveillance sys-

tems.

APPENDIX A THE HSV COLOR SPACE

Choosing a suitable color space from the start is crucial to system robustness. Although

color information in computer applications is usually kept in the RGB (Red, Green, Blue) color

space, this type of description is not the most convenient or intuitive for our purposes. Instead, we

choose to convert all color information to the HSV (Hue, Saturation, Value) color space. In the

RGB representation, each color is described by three 8-bit values (three channels), corresponding

to the amount of red, green and blue in a pixel. We can then think of the RGB space as a 3-dimen-

sional cube, with red, green and blue as the axes, and different positions within the cube represent-

ing different colors. A drawback to the RGB representation is that the relative location of different

shades of the same color (i.e., light blue to dark blue) may or may not be known, or even continu-

ous. Thus, we need a color space where similar color shades cluster together within the color space

in a continuous pattern. The HSV color space is a non-linear transformation from RGB, where col-

ors are represented by angular coordinates along a cone, rather than cartesian coordinates. Conver-

sions are shown in equations (A-1) and (A-2).

(A-1)

(A-2)

The hue channel is the “tint” or “tone” of the color, and is an angle along the circumference

of the color cone; the angle distinguishes whether a color is blue, yellow, purple, etc. Complimen-

tary colors are 180 degrees apart along the hue angle, as shown in Figure A-1. The saturation chan-

nel is a measure of the amount of white in a color, or its “brightness”, which corresponds to the

v max r g b, ,( )=

smax r g b, ,( ) min r g b, ,( )–

max r g b, ,( )-----------------------------------------------------------------=

43

44

radial distance from the center axis of the cone. Saturation is the amount of black in a color, or its

“intensity”, which is measured along the vertical axis of the cone. Thus, similar shades of the same

color cluster together nicely in this color space, since their hue would remain constant, and only

the saturation or value channels would vary. This property of the HSV space allows us to ignore

changes in lighting intensity on an object, since we can choose to put more importance on the hue

channel and less importance on the other two channels.

RedCyan

V

Black

Yellow

MagentaBlue

Green

SH

Figure A-1: HSV Color Cone

APPENDIX B

ELLIPTICAL BLOB REPRESENTATION

We face the problem of describing an arbitrary collection of pixels in such a way that shape

and size information are not lost, while still keeping the size of the representation small, so that it

can be handled efficiently. Consider the pixel cluster of Figure B-1, a blob of n contiguous image

pixels that we wish to describe. Clearly, the most accurate representation of this blob would be a

list of its pixel elements . This would be a straightforward solution, but one that may

become cumbersome when having to handle several large blobs. Furthermore, a pixel list may not

necessarily provide immediate useful information without some sort of processing, e.g., pixel

mean position, median position, etc. We instead choose to represent this pixel blob with an ellip-

soid, as shown in Figure B-2. While this ellipsoid does not exactly represent the blob’s shape or

boundary, we trade this loss of information for the compactness of the ellipsoid model. We can

estimate the ellipsoid in Figure B-2 by estimating the spatial mean and covariance for these pixels:

P xi yi,( )

P xi yi,( )= i 0 1 2 … n, , , ,{ }∈

Figure B-1: Sample pixel blob

45

46

(B-1)

(B-2)

If we assume that the pixel data is distributed more or less normally (in the Gaussian sense), the

eigenvectors of , and , correspond to the blob’s major and minor axes, as shown on Figure

B-2. If we assume that is the largest eigenvalue of , then points in the direction of the

major axis, and points in the direction of the minor axis. We now define the ellipse in the image

space using three parameters: its spatial mean, , and its axes, and , where A is a scalar.

A corresponds to a distribution boundary that encompasses a large percentage (> 90%) of the cor-

responding Gaussian distribution.

µ 1n--- xi

i 1=

n

∑ yi

i 1=

n

∑,

=

ΣΣΣΣ 1n 1–------------ Pi µ–( ) Pi µ–( )T

i 1=

n

∑=

ΣΣΣΣ φi φ2

λ1 ΣΣΣΣ φ1

φ2

µ Aφ1 Aφ1

, ,µ Aφ1 Aφ2

Figure B-2: Blob with representative ellipse.

REFERENCES

[1] T. Darrell, G. Gordon, M. Harville and J. Woodfill, “Integrated Person Tracking Using Stereo,Color, and Pattern Detection,” Proc. IEEE Conf. on Computer Vision and Pattern Recogni-tion, pp. 601-9, 1998.

[2] D. M. Gavrila, “The Visual Analysis of Human Movement: A Survey,” Computer Vision andImage Understanding, vol. 73, no. 1, pp. 82-98, 1999.

[3] I. Haritaoglu, D. Harwood and L. Davis, “ : Who? When? Where? What? A Real TimeSystem for Detecting and Tracking People,” Proc. IEEE Int. Conf. on Face and Gesture Rec-ognition, pp. 222-7, 1998.

[4] S. X. Ju, M. J. Black and Y. Yacoob, “Cardboard People: A Parameterized Model of Articu-lated Image Motion,” Proc. Second Int. Conf. on Automatic Face and Gesture Recognition,pp. 38-44, 1996.

[5] Y. Linde, A. Buzo and R. M.Gray, “An Algorithm for Vector Quantizer Design,” IEEE Trans.on Communication, vol. COM-28, no. 1, pp. 84-95, 1980.

[6] A J. Lipton, H. Fujiyoshi and R. S. Patil, “Moving Target Classification and Tracking fromReal-Time Video,” Proc. IEEE Workshop on Applications of Computer Vision, pp. 8-14,1998.

[7] Y. Raja, S. J. McKenna and S. Gong, “Tracking and Segmenting People in Varying LightingConditions Using Colour,” Proc. Fourth Int. Conf. on Automatic Face and Gesture Recogni-tion, pp. 228-33, 1998.

[8] G. Rigoll, S. Eickeler and S. Müller, “Person Tracking in Real-World Scenarios Using Statis-tical Methods,” Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, pp.342-7,2000.

[9] M. C. Nechyba. “Vector Quantization: A Special Case of EM,” EEL6935 Spring 2000 LectureNotes, Dept. of Electrical and Computer Engineering, University of Florida, 2000.

[10] R. Schalkoff, Pattern Recognition: Statistical, Structural, and Neural Approaches. JohnWiley and Sons, 1992.

[11] H. Schneiderman, “A Statistical Approach to 3D Object Detection Applied to Faces andCars,” CMU-RI-TR-00-06, Ph.D. Thesis, The Robotics Institute, Carnegie Mellon Univer-sity, 2000.

W4

47

48

[12] A. W. Stroupe, M. C. Martin and T. Balch, “Distributed Sensor Fusion for Object PositionEstimation by Multi-Robot Systems,” Proc. IEEE Int. Conf. on Robotics and Automation,May, 2001.

[13] A. Ude, “Robust Estimation of Human Body Kinematics from Video,” Proc. IEEE Int. Conf.on Intelligent Robots and Systems, vol. 3, pp. 1489-94, 1999.

[14] C. Wren, A. Azarbayejani, T. Darrell and A. Pentland, “Pfinder: Real-Time Tracking of theHuman Body,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp.780-5, 1997.

49

BIOGRAPHICAL SKETCH

Iván Zapata was born in Bogotá, Colombia, in 1975 and has lived in the United States since

1988. He earned a Bachelor of Science degree in electrical engineering degree from the University

of Florida in December of 1998, with special emphasis on autonomous robots and intelligent sys-

tems. He has since worked as a research assistant at the Machine Intelligence Laboratory, working

toward a Master of Science degree in electrical engineering.

Documents

DETECTING HUMANS IN VIDEO SEQUENCES USING ......DETECTING HUMANS IN VIDEO SEQUENCES USING STATISTICAL COLOR AND SHAPE MODELS BY IVÁN R. ZAPATA A THESIS PRESENTED TO THE GRADUATE SCHOOL