Report

Electronics and Computer ScienceFaculty of Physical Sciences and Engineering

University of Southampton

Christopher J. WattsApril 28, 2015

Estimating Full-Body Demographics via SoftBiometrics

Project Supervisor: Professor Mark S NixonSecond Examiner: Professor George Chen

A project report submitted for the award ofBachelor of Science (BSc.) in Computer Science

Abstract

Soft-biometrics is increasingly becoming more realistic for identifying individuals in the field ofcomputer vision. This project proposes a novel method of automatic demographic annotationusing categoric labels for a wide range of body features including height, leg length, and shoulderwidth where previous research has been limited to facial images and very few biometric features.Using common computer vision techniques, it is possible to categorise subjects’ body features fromstill images or video frames and directly compare them to other known subjects with high levels ofnoise and image compression resistance. This project explores the viability of this new techniqueand its impact on soft-biometrics as a whole.

Contents

1 Introduction 1

2 Requirements of Solution 2

3 Consideration of Approaches and Literature Review 33.1 Code Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Region of Interest Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.1 Locating the Subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3.2 Categoric Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3.3 Weighting Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Final Design and Justification 74.1 Technologies and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1.1 Code Libraries and Project Setup . . . . . . . . . . . . . . . . . . . . . . . . . 74.1.2 Subject Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1.4 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1.5 Weighting Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Implementation 125.1 Loading Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1.1 Interpreting the GaitAnnotate Database . . . . . . . . . . . . . . . . . . . . . 125.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.2.1 Limiting the Size of the Input . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.2.2 Background Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2.3 Processing Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.3 Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.4 Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.4.1 Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.5 Storing Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.5.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.5.2 Principal Component Data and Training Sets . . . . . . . . . . . . . . . . . . 185.5.3 Query Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Results and Evaluation 196.1 Ability to Estimate Body Demographics . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.1.1 How to Measure Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.1.2 Results on Test Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.1.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.3 Viability of Use as a Human Identification System . . . . . . . . . . . . . . . . . . . 226.4 Background Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.5 Evaluation against Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii

6.6 Evaluation against Other Known Techniques . . . . . . . . . . . . . . . . . . . . . . 246.6.1 As Demographic Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.6.2 As Human Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Summary and Conclusion 287.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.1.1 Migrating to C++ and OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . 287.1.2 Dataset Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.1.3 Use of Comparative Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.1.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.1.5 Weighting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Appendices 32

A Project Management 33

B Results 40

C Design Archive 44

D Project Brief 45

iii

Preface

This project makes use of the terms subject and suspect. A subject is a person who is scanned bythe system to serve as training data or test data. A suspect is a person who will be input as aquery to identify matches from the known dataset of subjects.

iv

Acknowledgements

I, Christopher J. Watts, certify that this project represents solely my own work and that all workreferenced has been acknowledged appropriately. Further to this, I would like to thank thoseinvolved in the creation of the Southampton Gait Database and GaitAnnotate projects which havebeen used extensively in this work, and my supervisor Mark S. Nixon for providing guidance andfeedback throughout the duration of this project.

v

Chapter 1

Introduction

The original motivation for this project was to be able to identify a criminal suspect’s presencein surveillance footage given one or more reference images of the suspect — even if part of thesubject, such as the face, is concealed. This would help particularly in law enforcement for findingfugitives who appear in CCTV footage with traditional biometric data being obscured (such asfacial features). During the course of this project, the goal developed into a generalised target ofcalculating individual body features from an image to assist in identification processes associatedwith finding criminals.

The method proposed in this project is a mixture of computer vision and machine learning toapproximate the metrics of an individual. Approaches of this sort reside under the category of”soft-biometrics” and there have been several attempts to identify people this way with ”accept-able rates of accuracy” in comparison to traditional biometrics [3].

The focus of this implementation is to identify the body demographics of subjects from still im-ages using a set of categoric labels. An example is to categorise height by [Very Short, Short,

Average, Tall, Very Tall]. There are two distinct advantages to using categories for featuresrather than attempting to estimate an absolute value:

1. Estimating labels from video footage is more robust with greater invariance to noise, skew,and low camera resolution

2. The accuracy of training data for each subject (which must be generated by hand) becomesmore reliable since research shows humans perform poorly when estimating absolute values[19], [15].

These demographic categories have been in use before by Sina Samangooei [19] who has created adatabase of subjects in collaboration with Mark Nixon for the GaitAnnotate project [20]. Previousresearch has had some success with automatic demographic recognition [1], [25], [6], [7], but onlyin the domain of facial imagery. Furthermore, these projects have been typically limited to asmall number of demographics — age, gender and race. This project builds on existing research togeneralise the techniques and detect a wide range of demographic information from images of fullbodies to further push the possibility of automatic human identification using computer vision.

1

Chapter 2

Requirements of Solution

In order to successfully achieve the goal of identifying body demographics, the following minimalcriteria were referred to when justifying all major decisions on the project.

ID Type RequirementFR1 Functional The system must accept colour images and video frames of any size as

an input (although it may then convert the image to greyscale)FR2 Functional The system must process video frame-by-frame autonomouslyFR3 Functional The system must be able to perform all calculations without any addi-

tional user inputFR4 Functional The system must be invariant to the background of each imageFR5 Functional The system must be self-contained with any learning processes — storing

all heuristic knowledge needed to perform identificationFR6 Functional The system must be able to read from a database of subjects and feature

categories (GaitAnnotate DB)FR7 Functional Once trained, the system must produce an estimate for each body feature

on the subject given one or more query imagesFR8 Functional Once trained, the system must produce the best guess or best-guesses

from the database of known subjects when given one or more queryimages

R1 Non-functional The system must not rely on pixel-for-pixel measurements when esti-mating lengths

R2 Non-functional The system must demonstrate a measurable level of invariance to noiseand resolution (as if a CCTV camera is in use)

R3 Non-functional Once trained, the system should satisfy queries within one minute ona standard desktop or laptop computer (although 1/30th of a second ispreferable)

R4 Non-functional The accuracy when estimating each feature should be greater than ran-dom ( 100

number of categories percent)

R5 Non-functional The accuracy of subject retrieval should be better than random (retrievalcorrectly matches more subjects than a random test case for a sufficientlylarge number of queries)

Understanding the research-based nature of the project, the requirements have been kept to aminimum to avoid over-specifying the system and eliminating possibilities before they are explored.

2

Chapter 3

Consideration of Approaches andLiterature Review

3.1 Code Libraries

As computer vision has progressed to a relatively mature field, there are programming tools toabstract much of the functionality this project requires. One of which is OpenCV1 — a computervision library written in C/C++ with interfaces available for Java and Python. Due to the heavyoptimisation of the binaries, this library is computationally quick and memory efficient which makesit ideal for real-time applications.

Another option is OpenIMAJ2 — a modernised approach to computer vision libraries writtenin pure Java that makes best use of the Object-Orientated paradigm.

Finally, MATLAB provides a computer vision toolbox3 that offers rapid prototyping with manyimportant algorithms in-built and support for generating C-code.

3.2 Region of Interest Extraction

On their own, working out the soft-biometrics of subjects from the raw input images is intractable.A series of filters must be used to abstract the necessary features for labelling.

3.2.1 Locating the Subject

One of the most-used methods of finding a person in an image is the Viola-Jones algorithm [22]. It iscommonly used today to detect faces on smartphones and cameras, and to automatically tag friendson social networks. Using an alternate set of Haar-like features to the ones used for face detection,the Viola-Jones algorithm is able to detect full-bodies, and hence extract them from an image.

Figure 3.1: An example of the ViolaJones algorithm detecting full bodies.Images credit: mzacha; RGBstock.com

Alternatively, background subtraction can be used. Thisis plausible in the solution domain because most CCTVcameras are static, therefore a background referencemodel can be extracted over a period of time using al-gorithms such as Temporal Median [16]. The de-factoalgorithm for background subtraction described by Hor-prasert et al. [10] illustrates a way of obtaining a fore-ground subject from a background model by examiningthe change in brightness and the change in colour sep-arately — allowing for shadow elimination. The largestresulting connected components can then be masked from

1http://opencv.org/2http://www.openimaj.org/3http://uk.mathworks.com/products/computervision/

3

the background — hopefully containing the subject with minimal background pixel false-positives.

There have since been several extensions to the Horprasert algorithm such as the approach takenby Kim et al. [13] which combines the four-class thresholding of Horprasert et al. with silhouetteextraction to smooth out noise and false-negatives in connected components.

3.3 Training

3.3.1 Feature Extraction

In order to learn labels, a feature vector needs to be made for each subject. A basic example couldbe the pixel vectors from head to toe, shoulder to shoulder, pelvis to knee etc., but the solution isunlikely to be robust if the pose were to change.

The route recommended by Hare et al. is auto-annotation with Latent Semantic Analysis (LSA) [8].

LSA works by finding the eigenvectors (Q,Q′) and eigenvalues (Λ) of a matrix of the commonterms between a set of documents (A) using the eigendecomposition equation:

A = QΛQ′−1

In this case, Q is the matrix of eigenvectors for AAT and Q′ is for ATA. The eigenvectors can thenbe used for many purposes. A common task is to find the similarity between any two documents.This is achieved by finding the cosine similarity between any two rows of the eigenvector matrixQ′.

LSA was originally used in document analysis to find common terms between many large doc-uments. By considering images as ’documents’ and features as ’terms’, Hare et al. describe howthe process of finding and sorting the principal components of the image terms can be used for im-age retrieval [8]. Weights can then be assigned to the individual principal components to describehow relevant each component is with respect to a particular body feature. This type of process isknown as Principal Component Analysis (PCA).

To implement PCA, the eigenvectors are organised in descending order by the corresponding eigen-value λλλ ∈ Λ for each row. The eigenvector with the largest eigenvalue represents the principalcomponent: the most important variance that contributes the majority of change in an image.

Applying this to a matrix of images and their features or pixels requires finding the covariancematrix. A quick and dirty trick is to approximate it with A = IIT where I is the matrix of fea-tures for n images. Since A is a covariance matrix, and therefore a real symmetric matrix, theeigendecomposition can be reduced to

A = QΛQT

PCA has previously been used by Klare et al. [14] as part of the process in facial demographic recog-nition for age, gender and race producing respectable results when trained with Linear DiscriminantAnalysis as a classification-type technique.

3.3.2 Categoric Labelling

The training data lists a set of categoric measurements for each subject in the database of footage.Since describing the relative traits of individuals varies based on personal experience [18], thetraining data must be derived from the average ’vote’ of many judges. This project makes use ofSamangooei’s collection of categorical labels [19] for the subjects of the Southampton Gait Database(SGDB)4 which have been derived using this method.

Further work in the field of annotation revealed that using comparative descriptions in place ofcategoric labels when identifying suspects from witness statements is more reliable [17] than abso-lute labelling alone. The primary advantage is that it eliminates the bias of previous experience

4http://www.gait.ecs.soton.ac.uk/database/

4

(e.g. what a witness thinks is tall or short) by making the witness estimate if the suspect wastaller/shorter/slimmer/fatter than the subject shown. While this technique is particularly suitedto identifying an individual through iteratively narrowing down possibilities, it is less suited to iden-tifying the individual categories of demographic features. Instead, the bias of the witness-generatedcategories is minimised by taking the average of all witness statements when preparing the trainingdata.

3.3.3 Weighting Function

Modelling the correlation of principal components to semantic labels requires machine learning.Given an n-dimensional vector of principal components ppp and an expected category y, a model ofweights www can be learned such that pppwww ≈ y. Therefore, over the entire m-dimensional training set,an error function can be defined as the squared sum of errors:

E = ‖Xwww − yyy‖2

where

X(m,n) =

ppp11 ppp12 · · · ppp1nppp21 ppp22 · · · ppp2n

......

. . ....

pppm1pppm2

· · · pppmn

yyy(m) =

y1y2...ym

Perceptron Learning

One of the most common types of learning weights is using the perceptron training algorithm. Theprinciple is to adjust weights iteratively until the error falls below a certain threshold.

www = www − η∇E

Linear Regression

Using the sum of squared errors error function, linear regression offers a very simple (althoughnumerically unstable without a regularisation term) way of guessing the ideal set of weights for www.

www = (X ′X)−1X ′yyy

Radial Basis Function

The Radial Basis Function (RBF) regression algorithm improves upon linear regression. It performsclustering, then uses some non-linear function φ(α) on the distances from each cluster to map thedata onto new axes. From there, it is possible to find a linear classifier that models a non-linearclassifier on the real data, improving on both perceptron learning and linear regression.

www = (Φ′Φ)−1Φ′yyy

where

Φ(m,n) =

φ(‖ppp1 − C1‖) φ(‖ppp1 − C2‖) · · · φ(‖ppp1 − Cn‖)φ(‖ppp2 − C1‖) φ(‖ppp2 − C2‖) · · · φ(‖ppp2 − Cn‖)

......

. . ....

φ(‖pppm − C1‖) φ(‖pppm − C2‖) · · · φ(‖pppm − Cn‖)

an example φ may be

φ(α) = e−α/σ2

5

Neural Networks

Neural Networks are another form of iterative learning in which multiple sigmoid-response percep-trons are linked to multiple layers. Although training is much more complex and requires differentialequations to solve, using the network is relatively fast. Unlike perceptron learning, which is limitedto linearly separable problems, neural networks are ”capable of approximating any Borel measurablefunction from one finite dimensional space to another” [9] which is comparable to the non-linearattributes of RBF.

6

Chapter 4

Final Design and Justification

4.1 Technologies and Methods

4.1.1 Code Libraries and Project Setup

This project is written in Java 7 using the OpenIMAJ library described previously for the reasonsof prior experience with the language and object-orientated finesse. Maven is used for dependencyresolution. However, in knowledge of some of the flaws with OpenIMAJ, choosing C or C++ wouldhave been beneficial for both performance and utility reasons.

4.1.2 Subject Location

Initially, the Viola-Jones algorithm seemed to be the ideal choice. It is well-used in computer visionand trusted. However, preliminary tests indicated several issues:

1. The algorithm runs slowly on the large sized images in the training data (approximately1900ms per full-scale image and 350ms per image when scaled to 800x600 pixels)

2. There were up to 15 false positives for each true positive (an example of which is shown infigure 4.1)

3. For each true positive, only 60% of the bounding boxes contained the entire body (exampleshown in figure 4.2)

Efforts pursued to redeem the algorithm detailed in section 5.2.2 on page 13 were not produc-ing results to a high enough standard, so a decision was made to revert to background subtrac-tion and silhouetting. Basic subtraction proved to be fast, but with too much noise to per-form any cropping. The Horprasert algorithm worked much better, but was slow and volu-minous in code. Better performance was achieved by using the ”robust” algorithm in Kim et

Figure 4.1: A false positiveFigure 4.2: A true positive that has not beenbounded correctly

7

al. [13] up until the labelling phase. This provides the same functionality as the Horprasert al-gorithm, but working in HSI colour space rather than RGB gives a significant efficiency boost:calculating the changes in luminance and the changes in saturation become much more intuitive.

After background subtraction, assuming zero noise around the subject, the black-and-white maskcan be cropped to the smallest bounding box containing all the white pixels in the image. Thisexclusively contains the subject and provides satisfactory alignment for Principal Component Anal-ysis to work correctly. The mask can then be multiplied onto the image to extract the subject ontoa fully-black background before normalising the image.


Subtract

Background

Largest Connected

Component

Trim

Apply Mask

Figure 4.3: Action Diagram for back-ground removal.Images credit: mzacha; RGBstock.com

Principal Component Analysis is used to extract featurevectors from each image using OpenIMAJ’s EigenImagesclass — an implementation of PCA for image sets.

4.1.4 Labelling

Each feature (e.g. height, arm thickness) of each subjectis categorised by an enumerated type which represents anumber in the range [0, 6] — this is necessary to allow forsome of the more diverse categories such as age which re-quires 7 categories. This is replaceable by an enumeratedcategory type for each class to resolve the issue of somefeature requiring more categories than others. An addedbenefit is improved comprehensibility by using labels suchas Age.Category.YOUNG rather than Category.LOW. Inthe implementation, only generic categories are utilised,but these class-based categories are able to be applied tofinal results before outputting to the terminal.

4.1.5 Weighting Function

During preliminary training, benchmarking tests (de-scribed in section 5.4 on page 17) were performed oneach weighting function to discern the best performingalgorithm. Due to the lack of a reliable neural networksframework, there is no implementation for feed-fowardneural networks. In further work, it may be worthwhilewriting the code to explore this option.

8

4.2 Processing Pipeline

Overall, the application will be trained and queried as follows:

Load Training

Dataset

Split Dataset into

Training and

Testing Subsets

Preprocess Images

Extract Subjects

Crop and NormaliseTrain PCA

Algorithm

Analyse Training

Set

Learn Weighting

Function

Analyse Testing Set

Check Weighting

Function

Figure 4.4: Action Diagram for training the system

9

Load Footage Preprocess Images

Extract Suspect

Crop and Normalise Analyse Footage

Apply Weighting

Function

Run Query Against

Database

Return Closest

Match(es)

Figure 4.5: Action Diagram for querying the system

10

<<

Ab

stra

ct>

> <

<D

eco

rato

r>>

I ex

ten

ds

Imag

e

Su

bje

ctP

roces

sor

+p

roce

ssS

ub

ject(

I)

+p

roce

ssIm

ag

e(I)

+p

roce

ss(I

) :

I

+p

roce

ssA

ll(L

ist<

I>)

: L

ist<

I>

+p

roce

ssA

llIn

pla

ce(L

ist<

I>)

<<

Inte

rfa

ce>

>

I ex

ten

ds

Imag

e

ImageP

roce

sso

r

+p

roce

ssIm

ag

e(I)

Inpu

tLim

iter

+p

roce

ssS

ub

ject

(I)

Su

bje

ctN

orm

ali

ser

+p

roce

ssS

ub

ject

(I)

Su

bje

ctR

esiz

er

+p

roce

ssS

ub

ject(

I)

Su

bje

ctT

rim

mer

+p

roce

ssS

ub

ject(

I)

I ex

ten

ds

Imag

e

Su

bje

ctV

ideo

Pro

cess

or

+p

rocess

Fra

me(I

)

<<

Inte

rfa

ce>

>

I ex

ten

ds

Imag

e

Vid

eoP

roce

sso

r

+p

roce

ssF

ram

e(I

)

<<

Ab

stra

ct>

>

I ex

ten

ds

Imag

e

Back

gro

und

Rem

over

Basi

cBack

gro

und

Rem

over

+p

roce

ssS

ub

ject(

I)

Ho

rpra

sert

Back

gro

und

Rem

over

+p

rocess

Su

bje

ct(

I)

Tsu

kab

aBack

gro

un

dR

emo

ver

+p

rocess

Su

bje

ct(

I)

-mo

del

-mo

del

+co

nst

ruct

(Su

bje

ctP

roce

sso

r)

+co

nst

ruct(

Su

bje

ctP

roces

sor)

Figure 4.6: Class Diagram for the various processing filters

11

Chapter 5

Implementation

5.1 Loading Training Data

Using OpenIMAJ’s library for datasets, all training and test data is grouped by the subject in theimage. For example, a dataset can contain 50 subjects, but multiple images per subject. This way,it is possible to train with both side-view and frontal-view images from the gait database withoutusing complex iterators. Furthermore, OpenIMAJ keeps all images in datasets on the disk untilthey are needed. If all images are loaded at runtime, memory would be an issue.

In order to make training valid, it is important to choose random splits each time the systemis tested which is done directly with OpenIMAJ’s group splitting class. Typically, the system istrained with N − 20 subjects, and tested with 20.

5.1.1 Interpreting the GaitAnnotate Database

Since the demographics associated with each subject were kept in a separate MySQL database witha table layout that was not ideally suited for this project, it was decided to migrate the databaseinto a new structure.

Initially, a database was set up using JavaDB/Derby to store the learned weightings. However,it became clear that this was a heavy-weight solution for a light-weight problem so it was decidedto store weights using XML instead via the JAXB framework for simplicity and ease of manualtweaking.

With use of a PHP script, an XML file was created for each subject in the training images contain-ing the human-estimated categories for each feature.

In a retrospective decision, since the querying engine requires an element of speed and efficientuse of memory (R3), XML could not be used to match subjects on the trained system further inthe development timeline. Instead, the JavaDB solution has been re-implemented using the samedata as in the XML files with the added advantage of being able to use primary keys for searching,limiting the amount of data required in memory at the time of execution. XML is still in use forthe training data due to its simplicity.

5.2 Image Processing

5.2.1 Limiting the Size of the Input

It is clear that the larger an image is, the more pixels need to be processed for subject extractionand training. If an input image is very large (> 1000px for example), then a large amount ofprocessing time is wasted on insignificant details such as the buttons on a subject’s shirt when allthat’s really needed is enough resolution to identify demographics. The first processing filter istherefore to limit the size of the image to a constant. In the default case, all images with a heightor a width greater than 800px are resized so the longest length is exactly 800px. Aspect ratioremains the same.

12

5.2.2 Background Removal

As described in section 3.3.3 on page 5, several algorithms were shortlisted to further remove un-necessary details from the input. Work first started on implementing the Viola-Jones algorithm toremove the bulk of the background. OpenIMAJ has built-in methods to run the algorithm, andcontains many Haar cascades for detecting different parts of the body.

The first runs of the algorithm picked up many false positives as described in section 4.1.2 onpage 7. To remedy this, the images were run through basic background subtraction to removeas much of the background as possible. The still-image training data does not contain an imageof just the background itself, so one was derived using some image editing software. Backgroundremoval resulted in fewer erroneous detections, but there were still as many as 15 false positivesper subject. In knowledge of the existence of a voting variant of Viola-Jones, the following basicvoting algorithm was implemented.

detected← V iolaJones(image)votes← {∅}seen← ∅for all rectangle ∈ detected do

if rectangle /∈ seen thenif Area(rectangle) > 10000 then

seen← seen ∪ rectangleoverlaps← ∅for all other ∈ detected do

if other /∈ seen thenif Overlapping(rectangle, other) then

if Area(other) > 10000 thenseen← seen ∪ rectangleoverlaps← overlaps ∪ other

end ifend if

end ifend forvotes← votes ∪ overlaps

end ifend if

end for

regions← ∅for all candidates ∈ votes do

if |candidates| ≥ 3 thenregions←MeanAverage(candidates)

end ifend forreturn regions

This dramatically decreased the rate of false positives, and gave a more accurate subject boundary.Figure 5.1 on the following page shows the result of running the voting algorithm. Taking themedian of voted images was also attempted, but with inferior results to taking the mean.

From here, edge detection was performed using the Canny operator to assist in isolating the bound-ary of the subject — the top of their head, the bottom of their feet (although these are mostlycropped out of the training images), and the sides of their shoulders.

However, it soon became apparent after running the application multiple times that using thefull-body cascades was too slow — much slower than using face cascades and upper body cascades.In some cases, a single image sized 579x1423px would take 20 seconds to process. Upon speakingwith OpenIMAJ’s author, Jonathon Hare, he advised that while the cascades were taken straightfrom OpenCV, some cascades perform quite poorly compared to others in the library and ”proba-bly need improving”. In order to avoid the lengthy process of creating cascades, methods such as

13

(a) The candidate images produced by the Viola-Jones algorithm (b) The returned result

Figure 5.1: An example result of using the above voting algorithm

subtraction and silhouetting became favourable.

Second approach at Background Removal

The Horprasert algorithm immediately showed better results at background subtraction — but stillwith a large amount of background being erroneously detected. The cause of the false positivesarose from the positioning of the camera in the training images. Although care had been takento minimise the variance of the images, occasionally the camera may have been kicked and thetreadmill moved — the repercussions of which are demonstrated in figure 5.2 on the next page.

Compared with video imagery, it is exceptionally difficult to work out the background of a stillimage. Methods such as temporal median exist for video, but the backgrounds of still images mustbe computed manually. To rectify the background issue, the training images were imported intoimage editing software, and a script was run to automatically align and crop the images. The imageswere cropped to not contain the treadmill, but as some subjects were standing behind the handles,not all subjects fit fully into the image bounds (further experiments will need to be conducted toexamine whether this has adverse effects on the results). From this new aligned dataset, a suitablebackground was derived. Figure 5.3 on the facing page shows the result of the Horprasert algorithmusing the new dataset.

The second background algorithm written by Kim et al. begins with a method very similar toHorprasert’s, and this presented itself to be faster at the same job. However, the remainder of thealgorithm that includes labelling and silhouette extraction could not be implemented due to thelarge amount of time OpenIMAJ’s default connected component labeller takes to execute on thetraining images. Further work must be undertaken to make this possible.

Since the goal of background removal was to reveal a single silhouette mask that encompassedmost of the subject’s outline. A cropping algorithm was designed to remove all black areas, leaving

14

(a) A correctly aligned input image (b) Result of Horprasert algorithm

(c) An incorrectly aligned input image (d) Result of Horprasert algorithm

Figure 5.2: Results of using background subtraction on two different training images — one ofwhich misaligned with the assumed background. The thresholds are manually guessed.

(a) A correctly aligned and cropped input image (b) Result of Horprasert algorithm

Figure 5.3: Results of using background subtraction on the cropped and aligned dataset.

15

the subject’s silhouette in full-frame.

bounds← ∅for all pixel ∈ image do

if pixel.value > 0.5 thenif bounds.x = 0 ∨ pixel.x < bounds.x then bounds.x = pixel.xelse if pixel.x− bounds.x > bounds.width then bounds.width = pixel.x - bounds.xend ifif bounds.y = 0 then bounds.y = pixel.yelse if pixel.y − bounds.y > bounds.height then bounds.height = pixel.y - bounds.yend if

end ifend for

The Deprecation of Background Removal

While use of both Horprasert [10] and Tsukaba [13] algorithms seemed to effectively remove thebackground in the test examples, Horprasert was not robust enough to remove all non-subjectareas which prevented the cropping algorithm from working as planned. The Tsukaba algorithmincorporates a stage for connected component labelling which mitigates this issue, but sadly the per-formance of OpenIMAJ’s labeller was far too slow for realistic use. The feasibility of this algorithmwas later reduced when it was noted that the novel ’elastic’ borders would be almost impossible toprocess at an acceptable speed in Java — only a native C compiled library would be realistic.

In order to remain on-track with the crucial research-based components of the project, backgroundremoval was deprecated in favour of manually cropping the images and keeping the backgroundas the solid green screen from the laboratory. While this means the system will not work in non-controlled conditions, it still serves as a convincing proof-of-concept with the potential to work onmore ambitious footage given a fast and optimal background removal implementation.

5.2.3 Processing Video

It was initially intended that the project could take videos as an input rather than singular stillimages. Using Temporal Median (or Temporal Mode to eliminate the need for expensive sortingoperations), a representative background could be generated autonomously for use in image seg-mentation provided there is enough movement in the scene. Another algorithm was discoveredoffering a fast implementation of Temporal Median [11], but since this method relies on pixel valuesbeing in the relatively low greyscale range of [0, 255], the 16 million possible values of an RGBimage made this algorithm unlikely to show any benefit (although this theory has not been tested).Instead, the next best alternative was to use the Quickselect method for finding the median [21].

To further reduce processing time, the background image could be used to remove all frames with nosubjects present. This is simple to achieve by setting some threshold of foreground-to-backgroundpixels in the subtraction mask, or requiring that the primary connected component has a sufficientlylarge area.

Unfortunately, certain problems emanated when attempting to load video footage. The Southamp-ton Gait Database contains a very large repository of videos for each subject in the GaitAnnotatedatabase. All videos are encoded as raw DV footage which should be simple enough to process.Since OpenIMAJ uses a library that relies on FFMPEG1, a well known and respected video codecpack, there shouldn’t be any issues. Despite this, there appears to be either some level of corruptionin the headers of the files or a bug that causes FFMPEG to load the frames, but skip all metadatarequired for seeking which is essential for preparing video filters.

The only option left was to load the frames as images and work with them thus, but to trainthe system this way is impossible without the hideous amounts of computer memory required tohold each image, and the principal component data that results. It remains possible to query thetrained system with image sequences however.

1https://www.ffmpeg.org/ffplay.html

16

5.3 Component Analysis

Despite having not completed a robust method of background subtraction, Principal ComponentAnalysis could still be performed on the raw cropped images of the training set on the basis thatthe background should be represented by the least significant components. As PCA requires theinputs to have the same number of rows and columns, a normalising class was written to resizeeach input image to exactly the same size without losing any of the original image (padding isadded to the outside if a dimension is too short). The PCA interfacing class directly invokesOpenIMAJ’s EigenImages class, but also allows for Java serialization so it can be stored for lateruse — an important time saving method for rapid training and a necessity for performing PCA onany successive queries.

5.4 Training Algorithms

To gauge the success of a training algorithm, an error function of the total ’distance’ between thefeatures of each subject (f) and the features of the estimated subject (g) was devised such that

num features∑i=1

|fi − gi|

When choosing the best algorithm to use, a set of tests were run for each algorithm using a 50/50split of training to testing with 58 subjects in each group, and using both frontal and side-viewimages.

Testing began with a perceptron learning algorithm to iteratively converge upon ideal weights foreach principal component using gradient descent. The weights are initialized as a uniform randomguess, and updated as described earlier using the gradient ∇E = 2XT (Xwww − yyy). After adjustingthe learning rate and iteration count, an overall error ≈ 80 at η = 10−5; iter = 1000 was achieved.By comparison, a random guess produced an error ≈ 115. A bias term was then added to the datato minimize bias error (an extra input with a constant value of 1), but this made no significantdifference to gradient descent.

This means that using perceptron-trained weights is not much better than random guessing, butit is most certainly an improvement. Higher iteration counts were also tried up to 10,000, butthis seemed to over-classify the data: error rates increased. In hindsight, using a verification set inaddition to training and testing could have prevented this to attain better results at higher iterationcounts.

Linear regression was implemented next using the formula described earlier. The average errorwas slightly higher, ≈ 83, but the algorithm took much less time to train than using the perceptron(4ms average for each feature of 58 subjects compared to 80ms). After further optimization byadding a regularization term λI for variance error and a bias term as described above, the errordecreased to ≈ 45 which is much more realistic, but still not very useful.

The final algorithm tested was the Radial Basis Function (RBF). After multiple failures and sev-eral test cases that were statistically worse than random (error ≈ 140), results were achieved in therange (30, 40) with the a priori variable α = 5. By manually adjusting α, it was found that a valueof 20 produces the most accurate and consistent results.

During development, it was conceived that there should be separate weightings for frontal viewand side view images to classify either styles more accurately. However, this presented itself tobe a bad idea as real CCTV footage won’t guarantee a frontal or side view image, but rather arange of oblique angles. Training is therefore best to incorporate multiple angles in efforts to reducegeneralization error.

5.4.1 Further Improvements

Shortly after the initial algorithm tests, a bug was discovered in the implementation of finding thetotal distance between subjects which meant scores were more than double what they should have

17

been. After fixing this issue, along with some smaller discrepancies, gradient descent reduced toan average distance of ≈ 36, linear regression to ≈ 8, RBF to ≈ 7 using 50/50 training to testingsplits.. It was noted at this point that linear regression is definitely a contender for the final solutiondue to its speed and accuracy. Despite this, since RBF adds little extra time to training for a smalldecrease in distance, it is still preferred. For larger training sets in more extreme conditions itis quite possible that the non-linear training algorithm will have distinct advantages over linearregression due to the flexibility of the model.

5.5 Storing Training Data

5.5.1 Heuristics

Since the JAXB library was already being used for loading demographic data, using XML to storetrained heuristics seemed a logical solution. The implementation is a trainable Heuristic classwith a subclass for each body feature. A storable JAXB version of a Heuristic was then createdwith containers for the class name, the weightings map, and any serialized data that the trainingalgorithm may need to set up again such as centroid data for the RadialBasisFunctionTrainer.This method proved to be effective when debugging as individual weights could be manually tweakedwith a standard text editor, and changes are easier to notice.

5.5.2 Principal Component Data and Training Sets

Since Principal Component Analysis is a costly procedure, but the results are reusable, heuristictraining times could be reduced by caching both the principal component data and any generatedtraining sets (containing mappings of component data to categories) to disk. Since this data is noteditable, it is simply serialized using Java serialization.

5.5.3 Query Engine

With a trained set of heuristics, one or more images should produce an estimation for each bodyfeature, and ultimately, a guess of whom the subject may be. In order to achieve the latter, adatabase needed to be re-implemented to avoid loading every single subject’s XML file separatelyfor each query as it would need to find the subjects with the closest matching features.

The database is SQL-based which makes querying relatively simple. Stored procedures and func-tions were considered to calculate the distance between a subject in the database and a suspectprobe, but rejected in favour of dynamically building a SQL statement which takes the followingformat (where question marks are replaced with the respective values of the probe’s features):

SELECT id , SUM(ABS( ‘ age ‘−?) , . . . , ABS( ‘ weight ’−?)) as d i s t anc eFROM s u b j e c t sGROUP BY idORDER BY di s t ance ASCFETCH FIRST 5 ROWS ONLY

This produces a list of the top 5 matching subjects with their total distance in ascending order.This can be used directly to identify the suspect, or to narrow down the possibilities in a widersearch.

18

Chapter 6

Results and Evaluation

6.1 Ability to Estimate Body Demographics

6.1.1 How to Measure Success

In this project, the success of recognition is measured by three metrics. Firstly, the percentageaccuracy for a particular feature. An example is age, which has seven categories. The accuracy ofthe system for a feature is determined by the number of correct categorisations divided by the totalnumber of subjects used for testing.

Note that the system can either guess right or wrong — the distance from the correct category isnot taken into account. For an entirely random guess approach on a sufficiently large dataset oftest subjects, the accuracy for the seven-category age feature will be 100/7 = 14.29%.

The second metric, as described previously, is the total ’distance’ between the guessed demographics(g) and the actual demographics (f) for all body features such that:

distance =

num features∑i=1

|fi − gi|

This metric does allow for guesses to be one or two categories out which gives a clearer overallpicture of how close the match is.

The final metric is the index of the actual subject when querying the database for most-probablesubjects. For example, if the system guesses the suspect’s body demographics slightly wrong, itmay mean that the closest match in the database is not the correct person, and instead the correctsubject is ranked as the 6th most probable.

6.1.2 Results on Test Subjects

As mentioned with the implementation of different training algorithms, the best results were ob-tained using a Radial Basis Function φ(α) with α = 20. The results in table B.1 on page 40 showthe performance of the RBF trainer as a function of the average percentage of correct estimationsin comparison to the ’expected’ accuracy and two random techniques — the median category, anda random category between 1 and the maximum value within the training set. For this set of tests,the training set was kept the same. Typically, the correct classification rate for a particular bodyfeature is 72± 3.8%. Some features are particularly accurate (e.g. 90% for Proportions), and someare close to useless (namely 55.25% for Skin Colour, and 10.5% for Ethnicity).

It is safe to assume that human demographics approximate a gaussian distribution per contin-uously measured biometric, so it is not surprising that choosing the median category yields betterresults than choosing a random category between sensible limits. However, both random methodsproduced results significantly worse than the informed method which validates that it is possibleto use principal component analysis to estimate body demographics.

19

0 20 40 60 80 100

Age

Arm Length

Arm Thickness

Chest

Ethnicity

Facial Hair Colour

Facial Hair Length

Figure

Hair Colour

Hair Length

Height

Hips

Leg Length

Leg Shape

Leg Thickness

Muscle Build

Neck Length

Neck Thickness

Proportions

Sex

Shoulder Shape

Skin Colour

Weight

Percentage Accuracy

Figure 6.1: The correct classification rates of each biometric for the RBF method

Algorithm Average Distance % Recalls within top 5RBF 7.2 51.75%Random 26.0 5.0%Median 10.3 5.0%

It is clear that while the difference in average distance between RBF and Median algorithms isrelatively small, the slight advantage of RBF yields a large increase in the percentage of correctrecalls (defined as the proportion of test subjects that were recalled as one of the top 5 most likelyfrom the database). The relative success of guessing with the median indicates the category setupis not unique enough to separate individuals — this is explored in further detail in section 6.3 onpage 22. Comparison box-and-whisker plots for the accuracy of each algorithm is shown in fulldetails in figure B.1 on page 42.

The effects of occluded body features

Surprisingly, it seems that the difference in accuracy between frontal and side-views when estimat-ing categories is minimal. The expectation is for features such as hips, chest and shoulder shapeto become less accurate as they can no longer be easily determined. However, it seems that theproposed method implies information from alternate variances that aren’t affected by the view angle.

In the example shown in figure 6.2 on page 26, a subject (not used in training) has been queriedagainst the known information in the database for both frontal and side views. However, the dif-ferences are minor and unrelated to the aforementioned categories that should be affected in theside image. This phenomenon may be of great help when identifying suspects at odd angles wherehumans are unable to determine certain traits.

Difficulties and Limitations with Training Data

It is apparent from the results in table B.1 on page 40 that certain body features are more easilydetectable than others. The likely explanations are the following:

1. Features such as ethnicity do not have a natural order or scale. Far-Eastern and Black forexample are next to each other by their categorical order, yet they share few similarities. Thiseffect can reduce the accuracy of classification.

20

2. Subjects in the training set are predominantly male and have normal proportions which makesguessing by median more accurate than informed estimation for certain characteristics.

3. The dataset is limited in size — with just 198 images to learn from.

4. The training data is written by humans in a consensus approach [19] which means whilean individual’s error is overruled, the entire group of annotators can be biased towards aparticular category based on their experience — especially if the annotators are mostly whiteBritish males in their early twenties.

In addition to these, there are issues with the images themselves such as body parts being obstructedby a treadmill, poor exposure and white balance issues leading to skin colour looking darker thannormal. It is entirely plausible that a more comprehensive, higher quality training set could greatlyaugment the trained system — a suggestion that should be taken into account in future research.

6.1.3 Robustness

The most surprising results come with robustness testing which surpassed expectations. In par-ticular, the system appears to be strong against resolution, quality, and noise constraints as thefollowing tests show:

Test TotalDistance

RecallRate

Normal Image (sub-ject 010)

5.0 10/116

+10% Uniform Noise 5.0 10/116+20% Uniform Noise 5.0 10/116+30% Uniform Noise 5.0 10/116+40% Uniform Noise 6.0 21/116+50% Uniform Noise 7.0 48/116+60% Uniform Noise 7.0 48/116+70% Uniform Noise 8.0 56/116+80% Uniform Noise 10.0 67/116+90% Uniform Noise 11.0 69/116+100% Uniform Noise 11.0 69/116

(a) Response in accuracy as noise levels increase(100% noise = every pixel is different from its orig-inal)

Test TotalDistance

RecallRate


5.0 1/116

50% Resolution 5.0 1/11625% Resolution 6.0 1/11612.5% Resolution 6.0 1/1166.25% Resolution 6.0 1/1163.13% Resolution 11.0 64/1161.57% Resolution 15.0 107/116

(b) Response in accuracy as resolution decreases(original size = 579x1423px; 72dpi)

Test TotalDistance

RecallRate


5.0 1/116

Lowest JPEG Quality 5.0 1/116100 Colour GIF 5.0 1/11620 Colour GIF 6.0 1/116

(c) Response in accuracy as a result of image com-pression

Table 6.1: Results of degraded image quality tests

Though these tests are limited to select few subjects (as each image requires manual editing),repeating the tests with different subjects and training sets yield similar results. These are cru-cially important findings as this shows a strong invariance to factors commonly associated withCCTV footage which would aid deployment using existing non-expensive camera equipment. Bythese findings, even a cheap webcam could be used as a state-of-the-art demographic estimatorwhereas other methods such as gait recognition are more susceptible to resolution and frame rate [5].

To showcase the importance of these results, figure 6.3 on page 27 shows the input images for6.25% resolution and 20 colour GIF.

21

Subroutine Execution TimeImage Loading 2.565sPCA Analysis 170.9sTraining Set Generation 81.99sTraining (RBF) 15.48sTotal (Complete) 271.1sTotal (Cached) 18.22s

Table 6.2: Performance of Training Engine

Subroutine Execution TimeImage Analysis 0.367sCategorisation 0.072sSubject Matching 0.150sTotal 1.237s

Table 6.3: Performance of Query Engine on Desktop Computer

6.2 Performance

6.2.1 Training

Training performance depends on whether prior training has taken place. Since gathering principalcomponent data and creating training sets takes a long time and are both reusable, they are cachedfor future use. On a desktop computer (Intel Core i7 2600K @ 3.6GHz; 8GB 1600MHz RAM;7200RPM HDD; Java 8), table 6.2 shows the mean execution time for each on the GaitAnnotatedatabase with 198 images.

6.2.2 Querying

For the same desktop computer, the mean execution time for taking a single image, and identifyinga list of 10 potential matches are shown in table 6.3. The code was also run on a Raspberri Pi 2Model B (ARM Cortex-A7 quad core @ 900MHz; 1GB RAM; microSD; Java 8) - resultsof which are shown in table 6.4.

Although subject matching is relatively fast when performing individual queries, multiple queriesmade less than one second apart can form a queue on the database which can delay queries to upto 20 seconds each if many thousands of requests are made.

6.3 Viability of Use as a Human Identification System

While the estimation of demographics is certainly an effective breakthrough, there appears to belittle hope in the current system becoming a way to identify masked criminals. With the smallpopulation of 116 subjects in the database, there is already a significant overlap of ’average’ people.This has meant that the fully trained system can only manage to retrieve a subject in the top 550-60% of the time, as opposed to retrieving the correct subject 80-100% of the time which wouldbe more realistic. Notwithstanding, random estimation has an average top-5 recall rate of 5%, sothe result is certainly significant if not ideal.

Subroutine Execution TimeImage Analysis 9.489sCategorisation 1.876sSubject Matching 7.216sTotal 12.60s

Table 6.4: Performance of Query Engine on Raspberry Pi

22

Despite having a low recall rate, the proposed system may decrease the amount of searching re-quired if the system could narrow down the possible candidates in a man-hunt from thousands tojust a few hundred.

The system can also be used to simply aid with witness descriptions. The labels in table 6.4bon page 27 were generated from a single image of ’Jihadi John’ — the masked murderer of theIslamic State — which seems to identify features with respectable accuracy despite the system notbeing trained for backgrounds other than green baize. While some features are misguided (e.g. skincolour), the majority of estimations are certainly fitting. The unusual proportions, short legs, andsmall figure are explained by the legs being cropped out.

While this system was trained to use discrete categoric labels for identification, it may be pos-sible to use comparative labels described in section 3.3.2 on page 4, whereby metrics are given asrelative to other subjects (e.g. taller, fatter), with the intention of reducing the amount of conflictsby increasing the uniqueness of each database entry, ultimately leading to a system that can moreaccurately identify a single person.

6.4 Background Invariance

Though the training data is limited to laboratory conditions, a limited set of tests, including theannotation of ’Jihadi John’ indicate a slight invariance to background images that may renderbackground subtraction obsolete if results improve when using training data with non-uniform,indoor and outdoor backgrounds. See figure B.2 on page 43 for the results of trying to classify anindividual standing outside, which show mostly correct or reasonable categories. It is interestingto note that both frontal and side-on views under the same lighting produce more-or-less the sameresults. This further validates the decision to use a single set of heuristics for all view angles insection 5.4 on page 17.

6.5 Evaluation against Requirements

ID Pass/Fail CommentsFR1 Pass The system resizes colour images to a constant dimension before pro-

cessingFR2 Fail Video footage could not be loaded successfully using OpenIMAJFR3 Pass Both the training and querying engines require no user input other than

number of matches to retrieve and the images to use.FR4 Fail* *While background removal was not successfully implemented, tests in

table 6.4b on page 27 and figure B.2 on page 43 demonstrate the possi-bility that the background could be rendered insignificant given a com-prehensive training set

FR5 Pass Heuristics are stored as XML, principal component data is stored inserialized form

FR6 Pass XML files for each subject contain training inputs, and the querydatabase contains all required information to find matches derived fromthe GaitAnnotate project

FR7 Pass Example: table 6.4b on page 27FR8 Pass The system produces the n top matching subjects from the database

when using the query engineR1 Pass The system uses statistical analysis for estimationsR2 Pass See section 6.1.3 on page 21R3 Pass A single query, with the overhead of loading principal component data

and heuristics, takes an average of 5.9 seconds on a standard desktopcomputer

R4 Pass See table B.1 on page 40R5 Pass Average recall within top 5 for Radial Basis Function is between 50%

and 60%, average recall for random or median is less than 10%

23

6.6 Evaluation against Other Known Techniques

6.6.1 As Demographic Estimation

There has been previous research this domain of soft-biometrics which offers some comparison tothe performance of this project. The comparisons will be made mostly against research conductedby Hu Han et al. [7] and their ”biologically inspired” framework which includes the performance ofhuman estimation for the demographics of age, gender and race.

However, this is as far as evaluation for demographic estimation can extend — nobody has yetpublished any generalised form that is not limited to select few categories. This project is not onlyable to identify twenty three different traits, but it may use images of whole bodies which is moresuitable for low quality CCTV — a domain where research of this manner is most important.

In terms of simplicity, this project also takes a more general and malleable approach comparedto facial processing and biologically inspired features [7], facial surface normals [24], and ActiveAppearance Models (AAM) [4] which all require human faces as training images. By contrast, theproposed method makes use of natural variances in any image (not necessarily that of a human),and classifies with any regression-based or classification-based machine learning algorithm (RBF;Linear Regression; Neural Networks) resulting in a more generic framework which can be built onand improved for greater accuracy across a limit-less number of demographic features.

Age Estimation

Because age was estimated with an absolute figure rather than in categories in [7], it is not directlycomparable to the technique presented in this project. Moreover, Han et al. achieved a mean averageerror of 3.8 ± 3.3 years with the MORPH II facial image database (without Quality Assessment)compared to 6.2 ± 4.9 years with human estimation. By comparison, assuming each of the 7categories represents a range of 10 years, this solution obtains a mean average error of 4.3±0.9, whichby the research presented above is potentially better than human estimation. The measurement isflawed however as it assumes correct categorisation has an error of exactly 0 years (and 10 yearsfor 1 category difference).

Gender Classification

For the gender demographic, Han et al. achieved an accuracy of 97.6%, while Guodong Guo andGuowang Mu’s method [6] achieves 98.5% on the MORPH II dataset. These are both higher thanthe human performance for this dataset which is 96.9%. These results do exceed the accuracy ofthis project which has a mean accuracy of 85%. The decrease in accuracy could be as a result ofusing a relatively small dataset for training — just 198 training images compared to 2,000 imagesin the MORPH II dataset used by the majority of papers on the subject of demographic estimationin facial images.

Race Classification

Han et al. classify races as black or white only with an accuracy of 99.1%, compared to the humanperformance of 97.8% using the MORPH II dataset. Guodong Guo and Guowang Mu achieve anaccuracy of 99.0%. In contrast, this project performs poorly when judging both skin colour andrace: 55.25% for the former and 10.5% for the latter. The most likely reason is a mix of incorrectdata in the database (some samples are clearly European, yet labeled as ’Far-Eastern’ or ’Other’),and poor colour balance in the photos which can make some subjects appear over-saturated, thusmore tanned. In modern society race and skin colour are becoming meaningless, if not entirelysubjective: a white person may think an Indian is black, while a black person might say they arewhite. Because of these conflicts, it’s understandable why some of the training data is inconsistent.

6.6.2 As Human Identification

Gait biometrics has been an active area of research for human identification since the 1990s andhas proved to be a viable and robust method. Goffredo et al. have compiled a report on theperformance of gait recognition with the techniques in use at the University of Southampton [5].

24

In the report, the recognition rates for gait techniques peak at 96% on a database of 12 subjectsover 275 video sequences, tailing off to circa 52% for more acute view angles. The proposed methodis evidently not suitable for identification in its current state with a mean recall rate of 51.75%for matches within the top-5 retrievals. If recall is limited to being the top ranking match fromthe database exclusively, then the recall decreases to 29.75% — this implies that most of the top-5recalls are actually the top-scoring result.

25

(a) (b)

Feature Image A Image B TargetAge 4 / 5*/ 5Arm Length 3* 3* 3Arm Thickness 2* 2* 2Chest 3* 3* 3Ethnicity 3 3 4Facial Hair Colour 1* 1* 1Facial Hair Length 1* 1* 1Figure 3* 3* 3Hair Colour 2 2 1Hair Length 3*/ 4 / 3Height 4*/ 3 / 4Hips 2* 2* 2Leg Length 3* 3* 3Leg Shape 2* 2* 2Leg Thickness 2 / 3*/ 3Muscle Build 2* 2* 2Neck Length 3* 3* 3Neck Thickness 3* 3* 3Proportions 1* 1* 1Sex 2* 2* 2Shoulder Shape 3 3 4Skin Colour 3*/ 2 / 3Weight 3* 3* 3

(c) Per-feature categories for each of the above images, and the actual category targets. Asterisks markcorrect estimations, slashes mark differences between A and B estimations.

Figure 6.2: The effects of occluded body parts on categoric estimations

26

(a) Subject 053 at 6.25% resolution (b) Subject 053 as a 20 colour GIF

Figure 6.3: Reduced quality images used for queries

(a) A freeze-frame of ’Jihadi John’

Feature Categoryage MIDDLE AGED

armlength LONGarmthickness THICK

chest VERY SLIMethnicity OTHER

facialhaircolour NONEfacialhairlength NONE

figure SMALLhaircolour DYEDhairlength MEDIUM

height TALLhips NARROW

leglength SHORTlegshape VERY STRAIGHT

legthickness AVERAGEmusclebuild AVERAGEnecklength VERY LONG

neckthickness THICKproportions UNUSUAL

sex MALEshouldershape ROUNDED

skincolour WHITEweight THIN

(b) ’Jihadi John’s estimated labels

27

Chapter 7

Summary and Conclusion

This project details a novel, effective, and noise-invariant way at estimating body demographicsthrough computer vision. While the demographics themselves cannot be used as an identificationprocess in their current form, the estimation process can greatly assist in a broad range of appli-cations such as witness descriptions at crime scenes, targeted advertising for passers-by, and evenkeeping track of wildlife (there’s no fathomable reason why the system cannot be trained withimages of animals).

Further research in this area may be able to allow for multiple subjects in an image, improvedaccuracy, improved speed, and greater uniqueness and differentiability in the demographics used —perhaps using comparative labels — in order to potentially use this as an identification mechanismto automatically find people across an array of CCTV cameras in real-time.

Since the system produced respectable results despite the background being present in the trainingimages, there is a possibility of utilising pedestrian detection techniques such as patterns of motionand appearance [23] given a suitably fast implementation (which may enforce the use of nativeC code) to train on extracted, low-resolution subjects without requiring background removal. Iftraining is successful across a wide range of camera angles, then theoretically any CCTV cameracould be used to estimate body demographics.

7.1 Future Work

7.1.1 Migrating to C++ and OpenCV

While Java and OpenIMAJ have produced a clear and coherent object-orientated solution, a com-bination of bugs with OpenIMAJ and speed issues with Java indicate that migration to C or C++with OpenCV would be beneficial.

7.1.2 Dataset Improvements

As mentioned in section 6.1.2 on page 20 there are some issues with the dataset used. A morecomprehensive dataset with greater than 1000 images across indoor and outdoor environmentswith some subjects wearing masks or other items of clothing that would inhibit other methods ofrecognition would be ideal for training a system that works in the real world.

7.1.3 Use of Comparative Labels

While categoric labels have been useful in determining categories for individual body features,comparative labels described in section 3.3.2 on page 4 could increase recall rates.


There exists research on choosing the correct number of principal components for optimal compres-sion [2], [12] including trial and improvement to adequately represent the human body for theseexperiments. This was not included in the research presented due to the Radial Basis Function

28

requiring at least as many samples as the number of clusters therefore limiting the number ofprincipal components to the quantity of images in the database. This project uses a constant 100principal components; improvements may be observed when using more.

7.1.5 Weighting Functions

As discussed in section 4.1.5 on page 8, neural networks were not implemented due to their com-plexity in setting up. There is a realistic probability that neural networks are even better than RBFat increasing noise, lighting and background invariance due to the inherent power of the modelswhen a single hidden layer is used in a feed-forward network.

29

Bibliography

[1] Niyati Chhaya and Tim Oates. Joint inference of soft biometric features. In Biometrics (ICB),2012 5th IAPR International Conference on, pages 466–471. IEEE, 2012.

[2] Ralph B. D’agostino and Heidy K. Russell. Scree Test. John Wiley & Sons, Ltd, 2005.

[3] S. Denman, C. Fookes, A. Bialkowski, and S. Sridharan. Soft-biometrics: Unconstrainedauthentication in a surveillance environment. In Digital Image Computing: Techniques andApplications, 2009. DICTA ’09., pages 196–203, Dec 2009.

[4] Xin Geng, Zhi-Hua Zhou, and Kate Smith-Miles. Automatic age estimation based on facial ag-ing patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(12):2234–2240, 2007.

[5] Michela Goffredo, Imed Bouchrika, John N Carter, and Mark S Nixon. Performance analysisfor gait in camera networks. In Proceedings of the 1st ACM workshop on Analysis and retrievalof events/actions and workflows in video streams, pages 73–80. ACM, 2008.

[6] Guodong Guo and Guowang Mu. Joint estimation of age, gender and ethnicity: Cca vs. pls.In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conferenceand Workshops on, pages 1–6. IEEE, 2013.

[7] H. Han, C. Otto, X. Liu, and A. Jain. Demographic estimation from face images: Humanvs. machine performance. Pattern Analysis and Machine Intelligence, IEEE Transactions on,PP(99):1–1, 2014.

[8] Jonathon S. Hare, Sina Samangooei, Paul H. Lewis, and Mark S. Nixon. Semantic spacesrevisited: Investigating the performance of auto-annotation and semantic retrieval using se-mantic spaces. In Proceedings of the 2008 International Conference on Content-based Imageand Video Retrieval, CIVR ’08, pages 359–368, New York, NY, USA, 2008. ACM.

[9] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks areuniversal approximators. Neural Networks, 2(5):359 – 366, 1989.

[10] T. Horprasert, D. Harwood, and L. S. Davis. A statistical approach for real-time robustbackground subtraction and shadow detection. In Proc. IEEE ICCV, volume 99, pages 1–19.

[11] Mao-Hsiung Hung, Jeng-Shyang Pan, and Chaur-Heh Hsieh. Speed up temporal median fil-ter for background subtraction. In Pervasive Computing Signal Processing and Applications(PCSPA), 2010 First International Conference on, pages 297–300, Sept 2010.

[12] Donald A Jackson. Stopping rules in principal components analysis: a comparison of heuristicaland statistical approaches. Ecology, pages 2204–2214, 1993.

[13] Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Tomoji Toriyama, and Kiyoshi Kogure.Robust foreground extraction technique using gaussian family model and multiple thresholds.In Yasushi Yagi, SingBing Kang, InSo Kweon, and Hongbin Zha, editors, Computer Vision– ACCV 2007, volume 4843 of Lecture Notes in Computer Science, pages 758–768. SpringerBerlin Heidelberg, 2007.

[14] B.F. Klare, M.J. Burge, J.C. Klontz, R.W. Vorder Bruegge, and A.K. Jain. Face recognitionperformance: Role of demographic information. Information Forensics and Security, IEEETransactions on, 7(6):1789–1801, Dec 2012.

30

[15] C. Neil Macrae and Galen V. Bodenhausen. Social cognition: Thinking categorically aboutothers. Annual Review of Psychology, 51(1):93–120, 2000.

[16] Mark Nixon and Alberto S. Aguado. Feature Extraction & Image Processing for ComputerVideo, Third Edition. Academic Press, 3rd edition, 2012.

[17] D.A. Reid, M.S. Nixon, and S.V. Stevenage. Soft biometrics; human identification usingcomparative descriptions. Pattern Analysis and Machine Intelligence, IEEE Transactions on,36(6):1216–1228, June 2014.

[18] Daniel A. Reid. Human Identification Using Soft Biometrics. PhD thesis, University ofSouthampton, Apr 2013.

[19] S. Samangooei, Baofeng Guo, and M.S. Nixon. The use of semantic human description asa soft biometric. In Biometrics: Theory, Applications and Systems, 2008. BTAS 2008. 2ndIEEE International Conference on, pages 1–7, Sept 2008.

[20] Sina Samangooei and Mark S. Nixon. Performing content-based retrieval of humans using gaitbiometrics. In David Duke, Lynda Hardman, Alex Hauptmann, Dietrich Paulus, and SteffenStaab, editors, Semantic Multimedia, volume 5392 of Lecture Notes in Computer Science, pages105–120. Springer Berlin Heidelberg, 2008.

[21] Ryan J Tibshirani. Fast computation of the median by successive binning. Unpublishedmanuscript, http://stat. stanford. edu/ryantibs/median, 2008.

[22] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal ofComputer Vision, 57(2):137–154, 2004.

[23] Paul Viola, Michael J Jones, and Daniel Snow. Detecting pedestrians using patterns of motionand appearance. International Journal of Computer Vision, 63(2):153–161, 2005.

[24] Jing Wu, William AP Smith, and Edwin R Hancock. Facial gender classification using shape-from-shading. Image and Vision Computing, 28(6):1039–1048, 2010.

[25] Zhiguang Yang and Haizhou Ai. Demographic classification with local binary patterns. InAdvances in Biometrics, pages 464–473. Springer, 2007.

31

Appendices

32

Appendix A

Project Management

33

Figure A.1: Task list of proposed work

34

Figure A.2: Task list of proposed work

35

Figure A.3: Gantt chart of proposed work

36

Figure A.4: Task list of actual work

37

Figure A.5: Task list of actual work

38

Figure A.6: Gantt chart of actual work

39

Appendix B

Results

Average Accuracy Per AlgorithmFeature Num. Categories Expectation RBF Random MedianAge 7 14.25% 70.75% 12% 35%Arm Length 5 20% 79.25% 21% 55%Arm Thickness 5 20% 62% 23.75% 55%Chest 5 20% 65% 22% 55%Ethnicity 6 16.5% 10.5% 26% 10%Facial Hair Colour 6 16.5% 87% 60.25% 90%Facial Hair Length 5 20% 82.5% 36% 85%Figure 5 20% 86.25% 26% 75%Hair Colour 6 16.5% 61.75% 23.5% 60%Hair Length 5 20% 67.25% 20.75% 60%Height 5 20% 75.75% 15% 45%Hips 5 20% 66% 24.25% 55%Leg Length 5 20% 66.75% 20.75% 45%Leg Shape 5 20% 59.5% 24.75% 50%Leg Thickness 5 20% 68.5% 25.75% 60%Muscle Build 5 20% 72.25% 22.25% 50%Neck Length 5 20% 75.5% 19.5% 45%Neck Thickness 5 20% 78.5% 22.5% 65%Proportions 2 50% 85% 95% 95%Sex 2 50% 80% 15% 85%Shoulder Shape 5 20% 72.5% 20% 70%Skin Colour 4 25% 55.25% 47% 65%Weight 5 20% 75% 23.75% 59.25

Table B.1: Per-algorithm accuracy for correctly estimating each feature on a human body. Expec-tation is the expected accuracy for random guessing.

40

010

20

30

40

50

60

70

80

90

10

0

0

10

20

30

40

50

60

70

80

90

10

0

Percentage Correct

Feat

ure

Rec

ogn

itio

n R

ates

fo

r R

BF

Trai

nin

g

(a) Accuracy of RBF on 116 subjects

010

20

30

40

50

60

70

80

90

10

0

0

10

20

30

40

50

60

70

80

90

10

0

Percentage Correct

Feat

ure

Re

cogn

itio

n R

ate

s fo

r R

and

om

Gu

ess

ing

(b) Accuracy of Random Guessing on 116 sub-jects

41

010

20

30

40

50

60

70

80

90

10

0

0

10

20

30

40

50

60

70

80

90

10

0

Percentage Correct

Feat

ure

Re

cogn

itio

n R

ate

s fo

r M

ed

ian

Gu

ess

ing

(c) Accuracy of Median Guessing on 116 subjects

Figure B.1: Correct classification percentages for each biometric feature for the preferred method(RBF), and two guessing algorithms for comparison.

42

(a) (b) (c) (d)

Feature Image A Image B Image C Image D TargetAge Middle Aged Middle Aged Young Adult* Adult Young AdultArm Length Long Long Long Long AverageArm Thickness Thick Thick Average* Thick AverageChest Very Slim Slim* Slim* Slim* SlimEthnicity Other European* European* European* EuropeanFacial Hair Colour None None None None BrownFacial Hair Length Stubble* None None None StubbleFigure Average Average Small* Small* SmallHair Colour Grey Grey Blond* Grey BlondHair Length Medium* Medium* Short Short MediumHeight Tall Tall Tall Tall AverageHips Average* Average* Average* Average* AverageLeg Length Average Short* Average Average ShortLeg Shape Very Straight Very Straight Straight Straight AverageLeg Thickness Average* Average* Average* Average* AverageMuscle Build Muscly Muscly Average* Average* AverageNeck Length Long* Long* Long* Long* LongNeck Thickness Thick Thick Average* Average* AverageProportions Average* Average* Average* Average* AverageSex Male* Male* Male* Male* MaleShoulder Shape Average Average Average Average RoundedSkin Colour Tanned Tanned Tanned Tanned WhiteWeight Average* Fat Average* Average* Average

(e) Per-feature estimations for each of the above images, and a self-estimated target. Asterisks mark correctestimations

Figure B.2: Estimating demographics with images taken outdoors on an unseen subject

43

Appendix C

Design Archive

Table of Contents

cache

Pre-built cache files for the PrincipleComponentExtractor class and TrainingSet class.

db

Populated GaitAnnotate database.

heuristics*

An exemplary set of training weights for the RadialBasisFunctionTrainer. Heuristics markedwith numerical ranges indicate the particular queries they were used for in the results present inthis document for the purpose of repeatability.

queries

Images used to test the robustness of the solution.

scripts

Database table generation code.

src

Full source code in Java.

tests

Windows Batch files for invoking the QueryEngine class and other testing classes for quick perfor-mance analysis.

trainingdata

Images used for training and testing.

44

Appendix D

Project Brief

45

Wally - A System to Identify Criminal Suspects by Generating

Labels from Video Footage and Still Images

Christopher Watts - cw17g12Supervised by Mark Nixon

October 10, 2014

1 Problem Overview

Traditionally in law enforcement, an image of a criminal suspect is cross-referenced to databases such asthe Passport database for information. However, it’s becoming increasingly prominent for well-organisedcriminals to use fake identification or, for foreign criminals, no identification at all.

The proposition of this project is to create a system that given an image or a video of a criminalsuspect will identify metrics unique to the person from sets of comparative and categorical labels. Forexample: height; length of forearms; width of shoulders. Comparative labels will be used over absolutelabels (e.g. ’taller than’ rather than roughly ’5’9”’) because of observed accuracy benefits.

From this information, it should be possible to scan CCTV footage for pedestrians whose labels match(within a certain error margin) those of the suspect so law enforcement can track the movements of thesuspect and potentially reveal who they really are.

2 Goals

The aim of the project is to assist law enforcement on finding fugitives and criminals in video footagewho are not initially identifiable from traditional techniques such as face and voice recognition.

Some current examples include finding ”Jihadi John,” responsible for the murders of, at the timeof writing, four British and American nationals in Syria, and Op Trebia, wanted for terrorism by theMetropolitan Police since 2012.

3 Scope

The application domain of this project is rather large. Ideally, it would work on any CCTV footage fromany angle. However, due to time restraints on the project, some limitations will be imposed:

• The system will use computer vision to generate comparative labels for subjects whose full bodyis visible in either still or video imagery.

• Initially, all footage must be front body view, front facial view or side body view at a constantelevation and angle with the same lighting.

• When analysing footage of the full body, the subject may be masked.

• The system will use machine learning to train the label generator on a limited set of subjects fromthe Soton HiD gait database with known comparative and categorical labels.

• The system will match subsequent footage of a suspect to the most likely subjects known by thesystem, ranked by certainty.

1

Documents

Report