Handbook of Pattern Recognition

A

y d D o o

3rd Edition

r

This page is intentionally left blank

A

J i b o o

3rd Edition

r < o f

Pattern Recognition and Computer Vision

editors

C H Chen University of Massachusetts Dartmouth, USA

P S P Wang Northeastern University, USA

\[p World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONG KONG TAIPEI CHENNAI

Published by

World Scientific Publishing Co. Pte. Ltd. 5 Tori Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

First published 2005 Reprinted 2006

HANDBOOK OF PATTERN RECOGNITION & COMPUTER VISION (3rd Edition) Copyright 2005 by World Scientific Publishing Co. Pte. Ltd.

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-256-105-6

Printed in Singapore by Mainland Press

Preface to the Third Edition

Dedicated to the memory of the late Professor King Sun Fu (1930-1985), the handbook series, with first edition (1993), second edition (1999) and third edition (2005), provides a comprehensive, concise and balanced coverage of the progress and achievements in the field of pattern recognition and computer vision in the last twenty years. This is a highly dynamic field which has been expanding greatly over the last thirty years. No handbook can cover the essence of all aspects of the field and we have not attempted to do that. The carefully selected 33 chapters in the current edition were written by leaders in the field and we believe that the book and its sister volumes, the first and second editions, will provide the growing pattern recognition and computer vision community a set of valuable resource books that can last for a long time. Each chapter will speak for itself the importance of the subject area covered.

The book continues to contain five parts. Part 1 is on the basic methods of pattern recognition. Though there are only five chapters, the readers may find other coverage of basic methods in the first and second editions. Part 2 is on basic methods in computer vision. Again readers may find that Part 2 complements well what were offered in the first and second editions. Part 3 on recognition applications continues to emphasize on character recognition and document processing. It also presents new applications in digital mammograms, remote sensing images and functional magnetic resonance imaging data. Currently one intensively explored area of pattern recognition applications is the personal identification problem, also called biometrics, though the problem has been around for a number of years. Part 4 is especially devoted to this topic area. Indeed chapters in both Part 3 and Part 4 represent the growing importance of applications in pattern recognition. In fact Prof. Fu had envisioned the growth of pattern recognition applications in the early 60's He and his group at Purdue had worked on the character recognition, speech recognition, fingerprint recognition, seismic pattern recognition, biomedical and remote sensing recognition problems, etc. Part 5 on system and technology presents other important aspects of pattern recognition and computer vision.

Our sincere thanks go to all contributors of this volume for their outstanding technical contributions. We would like to mention specially Dr. Quang-Tuan Luong, Dr. Giovanni Garibotto and Prof. Ching Y. Suen for their original contributions to all three volumes. Other authors who have contributed to all three volumes are: Prof. Thomas S. Huang, Prof. J.K. Aggarwal, Prof. Yun Y. Tang, Prof. C.C. Li, Prof. R. Chellappa and Prof. P.S.P. Wang. We are pleased to mention that Prof. Thomas Huang and Prof. Jake Aggarwal are the recipients respectively in 2002 and 2004, of the prestigious K.S. Fu Prize sponsored by the International Association of Pattern Recognition (IAPR). Among Prof. Fu's Ph.D. graduates at Purdue who have contributed to the handbook series are: C.H. Chen (1965), M.H. Loew (1972), S.M. Hsu (1975), S.Y. Lu (1977), K.Y. Huang (1983) and H.D. Cheng (1985). Finally we would like to pay tribute to the late Prof. Azirel Rosenfeld (1931-2004) who, as one IAPR member put it, is a true scientist and a great giant in the field. He was awarded the K.S. Fu Prize by IAPR in 1988. Readers are reminded to read Prof. Rosenfeld's inspirational article on "Vision - Some Speculations" that appeared as Foreword of the second edition of the handbook series. Prof. Rosenfeld's profound influence in the field will be felt in the many years to come.

v

VI

The camera ready manuscript production requires certain amount of additional efforts, as compared to typeset printing, on the part of editors and authors. We like to thank all contributors for their patience in making the necessary revisions to comply with the format requirements during this long process of manuscript preparation. Our special thanks go to Steven Patt, in-house editor of World Scientific Publishing, for his efficient effort to make a timely publication of the book possible.

September, 2004 The Editors

Contents

Preface to the Third Edition v

Contents vii

Part 1. Basic Methods in Pattern Recognition 1

Chapter 1.1 Statistical Pattern Recognition 3 R.P.W. Duin and D.M.J. Tax

Chapter 1.2 Hidden Markov Models for Spatio-Temporal Pattern Recognition 25

Brian C. Lovell and Terry Caelli

Chapter 1.3 A New Kernel-Based Formalization of Minimum Error Pattern Recognition 41

Erik McDermott andShigeru Katagiri

Chapter 1.4 Parallel Contextual Array Grammars with Trajectories 55 P. Helen Chandra, C. Martin-Vide, K.G. Subramanian, D.L. Van and P. S. P. Wang

Chapter 1.5 Pattern Recognition with Local Invariant Features 71 C. Schmid, G. Dorko, S. Lazebnik, K. Mikolajczyk and J. Ponce

Part 2. Basic Methods in Computer Vision 93

Chapter 2.1 Case-Based Reasoning for Image Analysis and Interpretation 95 Petra Perner

Chapter 2.2 Multiple Image Geometry - A Projective Viewpoint 115 Quang-Tuan Luong

Chapter 2.3 Skeletonization in 3D Discrete Binary Images 137 Gabrielle Sanniti di Baja and Injela Nystrom

Chapter 2.4 Digital Distance Transforms in 2D, 3D, and 4D 157 Gunilla Borgefors

Chapter 2.5 Computing Global Shape Measures 177 Paul L. Rosin

VII

VIII

Chapter 2.6 Texture Analysis with Local Binary Patterns 197 Topi Mdenpdd and Matti Pietikdinen

Part 3. Recognition Applications 217

Chapter 3.1 Document Analysis and Understanding 219 Yuan Yan Tang

Chapter 3.2 Chinese Character Recognition 241 Xiaoqing Ding

Chapter 3.3 Extraction of Words from Handwritten Legal Amounts on Bank Cheques 259

In Cheol Kim and Ching Y. Suen

Chapter 3.4 OCR Assessment of Printed-Fonts for Enhancing Human Vision 273

Ching Y. Suen, Qizhi Xu and Cedric Devoghelaere

Chapter 3.5 Clustering and Classification of Web Documents Using a Graph Model 287

Adam Schenker, Horst Bunke, Mark Last and Abraham Kandel

Chapter 3.6 Automated Detection of Masses in Mammograms 303 H.D. Cheng, X.J. Shi, R. Min, X.P. Cai and H.N. Du

Chapter 3.7 Wavelet-Based Kalman Filtering in Scale Space for Image Fusion 325

Hsi-Chin Hsin and Ching-Chung Li

Chapter 3.8 Multisensor Fusion with Hyperspectral Imaging Data: Detection and Classification 347

Su May Hsu and Hsiao-hua Burke

Chapter 3.9 Independent Component Analysis of Functional Magnetic Resonance Imaging Data 365

V.D. Calhoun andB. Hong

Part 4. Human Identification 385

Chapter 4.1 Multimodal Emotion Recognition 387 Nicu Sebe, Ira Cohen and Thomas S. Huang

Chapter 4.2 Gait-Based Human Identification from a Monocular Video Sequence 411

Amit Kale, AravindSundaresan, AmitK. RoyChowdhury and Rama Chelappa

IX

Chapter 4.3 Palmprint Authentication System 431 David Zhang

Chapter 4.4 Reconstruction of High-Resolution Facial Images for Visual Surveillance 445

Jeong-Seon Park and Seong Whan Lee

Chapter 4.5 Object Recognition with Deformable Feature Graphs: Faces, Hands, and Cluttered Scenes 461

Jochen Triesch and Christian Eckes

Chapter 4.6 Hierarchical Classification and Feature Reduction for Fast Face Detection 481

Bernd Heisele, Thomas Serre, Sam Prentice and Tomaso Poggio

Part 5. System and Technology 497

Chapter 5.1 Tracking and Classifying Moving Objects Using Single or Multiple Cameras 499

Quming Zhou andJ.K. Aggarwal

Chapter 5.2 Performance Evaluation of Image Segmentation Algorithms 525 Xiaoyi Jiang

Chapter 5.3 Contents-Based Video Analysis for Knowledge Discovery 543 Chia-Hung Yeh, Shih-Hung Lee andC.-C. Jay Kuo

Chapter 5.4 Object-Process Methodology and Its Applications to Image Processing and Pattern Recognition 559

Dov Dori

Chapter 5.5 Musical Style Recognition AQuantative Approach 583 Peter van Kranenburg and Eric Backer

Chapter 5.6 Auto-Detector: Mobile Automatic Number Plate Recognition 601 Giovanni B. Garribotto

Chapter 5.7 Omnidirectional Vision 619 Hiroshi Ishiguro

Index 629

C H A P T E R 1.1

S T A T I S T I C A L P A T T E R N R E C O G N I T I O N

R.P.W. Duin, D.M.J. Tax Information and Communication Theory Group

Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

P.O.Box 5031, 2600 GA, Delft, The Netherlands E-mail: {R.P. W.Duin,D. M.J. Tax} @ ewi.tudelft.nl

A review is given of the area of statistical pattern recognition: the representation of objects and the design and evaluation of trainable systems for generalization. Traditional as well as more recently studied procedures are reviewed like the classical Bayes classifiers, neural networks, support vector machines, one-class classifiers and combining classifiers. Further we introduce methods for feature re-duction and error evaluation. New developments in statistical pattern recognition are briefly discussed.

1. I n t r o d u c t i o n

Statistical pa t te rn recognition is the research area tha t studies statistical tools for the generalization of sets of real world objects or phenomena. It thereby aims to find procedures tha t answer questions like: does this new object fit into the pa t tern of a given set of objects, or: to which of the pat terns defined in a given set does it fit best? The first question is related to cluster analysis, but is also discussed from some perspective in this chapter. The second question is on pa t te rn classification and tha t is what will be the main concern here.

The overall s tructure of a pa t te rn recognition system may be summarized as in Figure 1. Objects have first to be appropriately represented before a generalization can be derived. Depending on the demands of the procedures used for this the representation has to be adapted, e.g. transformed, scaled or simplified.

The procedures discussed in this chapter are partially also studied in areas like statistical learning theory 3 2 , machine learning 2 5 and neural networks 14 . As the emphasis in pa t te rn recognition is close to application areas, questions related to the representation of the objects are important here: how are objects described (e.g. fea-tures, distances to prototypes), how extensive may this description be, what are the ways to incorporate knowledge from the application domain? Representations have to be adapted to fit the tools tha t are used later. Simplifications of representations

3

4

like feature reduction and prototype selection should thereby be considered. In order to derive, from a training set, a classifier that is valid for new objects

(i.e. that it is able to generalize) the representation should fulfill an important condition: representations of similar real world objects have to be similar as well. The representations should be close. This is the so-called compactness hypothesis2 on which the generalization from examples to new, unseen objects is built. It enables the estimation of their class labels on the basis of distances to examples or on class densities derived from examples.

Objects are traditionally represented by vectors in a feature space. An important recent development to incorporate domain knowledge is the representation of objects by their relation to other objects. This may be done by a so called kernel method29, derived from features, or directly on dissimilarities computed from the raw data26.

We will assume that, after processing the raw measurements, objects are given in a p-dimensional vector space Cl. Traditionally this space is spanned by p features, but also the dissimilarities with p prototype objects may be used. To simplify the discussion we will use the term feature space for both. If K is the number of classes to be distinguished, a pattern classification system, or shortly classifier C{x) is a function or a procedure that assigns to each object x in fi a class u>c, with c = 1,...,K. Such a classifier has to be derived from a set of examples Xtv = {xi,i = 1...N} of known classes y,. Xtr will be called the training set and yi wc, c = 1...K a label. Unless otherwise stated it is assumed that yt is unique (objects belong to just a single class) and is known for all objects in Xtl.

In section 2 training procedures will be discussed to derive classifiers C(x) from training sets. The performance of these classifiers is usually not just related the quality of the features (their ability to show class differences) but also to their number, i.e. the dimensionality of the feature space. A growing number of features

evaluation

s

a

charactt 4

update

update

update

larger tra ning

rization

generalization

adaptation

representation

sets ^ obj ects

class labels, confidences

classifiers, class models

feature extraction prototype selection

features, dissimilarities class models, object models

better sensors or measurement conditions

Fig. 1. The pattern recognition system

5

may increase the class separability, but, may also decrease the statistical accuracy of the training procedure. It is thereby important to have a small number of good features. In section 3 a review is given of ways to reduce the number of features by selection or by combination (so called feature extraction). The evaluation of classifiers, discussed in section 4, is an important topic. As the characteristics of new applications are often unknown before, the best algorithms for feature reduction and classification have to be found iteratively on the basis of unbiased and accurate testing procedures.

This chapter builds further on earlier reviews of the area of statistical pattern recognition by Fukunaga 12 and by Jain et al 16. It is inevitable to repeat and summarize them partly. We will, however, also discuss some new directions like one-class classifiers, combining classifiers, dissimilarity representations and techniques for building good classifiers and reducing the feature space simultaneously. In the last section of this chapter, the discussion, we will return to these new developments.

2. Classifiers

For the development of classifiers, we have to consider two main aspects: the basic assumptions that the classifier makes about the data (which results in a functional form of the classifier), and the optimization procedure to fit the model to the training data. It is possible to consider very complex classifiers, but without efficient methods to fit these classifiers to the data, they are not useful. Therefore, in many cases the functional form of the classifier is restricted by the available optimization routines.

We will start discussing the two-class classification problem. In the first three sections, 2.1, 2.2 and 2.3, the three basic approaches with their assumptions are given: first, modeling the class posteriors, second, modeling class conditional prob-abilities and finally modeling the classification boundary. In section 2.4 we discuss how these approaches can be extended to work for more than two classes. In the next section, the special case is considered where just one of the classes is reli-ably sampled. The last section, 2.6, discusses the possibilities to combine several (non-optimal) classifiers.

2.1. Bayes classifiers and approximations A classifier should assign a new object x to the most likely class. In a probabilistic setting this means that the label of the class with the highest posterior probability should be chosen. This class can be found when p(wi|x) and p(w2|x) (for a two class classification problem) are known. The classifier becomes:

if p(wi|x) > p(w2|x) assign object x to u>\, otherwise to W2- (1) When we assume that p(ui\x) and p(w2|x) are known, and further assume that misclassifying an object originating from o>i to W2 is as costly as vise versa, classifier (1) is the theoretical optimal classifier and will make the minimum error. This classifier is called the Bayes optimal classifier.

6

P(wi|x) = 1 , ( , T ^, P(w2|x) = 1 -p(w 2 | x ) , (2)

In practice p(wi|x) and p(w2|x) are not known, only samples x, are available, and the misclassification costs might be only known in approximation. Therefore approximations to the Bayes optimal classifier have to be made. This classifier can be approximated in several different ways, depending on knowledge of the classification problem.

The first way is to approximate the class posterior probabilities p(wc|x). The logistic classifier assumes a particular model for the class posterior probabilities:

1 + exp(wTx) where w is a p-dimensional weight vector. This basically implements a linear clas-sifier in the feature space.

An approach to fit this logistic classifier (2) to training data Xu, is to maximize the data likelihood L:

N

L = np(wi|xOn i ( x )p(w2|xi)n a ( x )> (3) i

where nc(x) is 1 if object x belongs to class o>c, and 0 otherwise. This can be done by, for instance, an iterative gradient ascent method. Weights are iteratively updated using:

Wnew = W 0 l d + ? 7 - , (4)

where rj is a suitably chosen learning rate parameter. In Ref. 1 the first (and second) derivative of L with respect to w are derived for this and can be plugged into (4).

2.2. Class densities and Bayes rule

Assumptions on p(w|x) are often difficult to make. Sometimes it is more convenient to make assumptions on the class conditional probability densities p(x|w): they indicate the distribution of the objects which are drawn from one of the classes. When assumptions on these distributions can be made, classifier (1) can be derived using Bayes' decision rule:

P H * ) = ? ! . (5) This rule basically rewrites the class posterior probabilities in terms of the class conditional probabilities and the class priors p(w). This result can be substituted into (1), resulting in the following form:

if P(X\LOI)P(U>I) > p(x|w2)p(w2) assign x to wi, otherwise to o;2- (6) The term p(x) is ignored because this is constant for a given x. Any monotonically increasing function can be applied to both sides without changing the final decision. In some cases, a suitable choice will simplify the notation significantly. In particular,

7

using a logarithmic transformation can simplify the classifier when functions from the exponential family are used.

For the special case of a two-class problem the classifiers can be rewritten in terms of a single discriminant function / (x) which is the difference between the left hand side and the right hand side. A few possibilities are:

/ (x) =p(wi |x) -p(w 2 | x ) , (7) / (x) = p(x|on)p(wi) - p(x|w2)p(w2), (8) / ( x ) = l n ^ + l n ^ l l . (9)

p(x|w2) P{LU2) The classifier becomes:

if / (x) > 0 assign x t o w j , otherwise to o>2. (10) In many cases fitting p(x|a>) on training data is relatively straightforward. It is

the standard density estimation problem: fit a density on a data sample. To estimate each p(x|w) the objects from just one of the classes UJ is used.

Depending on the functional form of the class densities, different classifiers are constructed. One of the most common approaches is to assume a Gaussian density for each of the classes:

p(x|w) = J V ( X ; / I , E ) = ( 2 7 r ) P /2 |S | i /2 e x P ( - ^ ( x - M f S - ^ x - M ) ) - ( n )

where /x is the (p-dimensional) mean of the class u>, and is the covariance matrix. Further, | | indicates the determinant of and - 1 its inverse. For the explicit values of the parameters /x and E usually the maximum likelihood estimates are plugged in, therefore this classifier is called the plug-in Bayes classifier. Extra com-plications occur when the sample size TV is insufficient to (in particular) compute _ 1 . In these cases a standard solution is to regularize the covariance matrix such that the inverse can be computed:

EA = + AJ, (12) where X is the p x p identity matrix, and A is the regularization parameter to set the trade off between the estimated covariance matrix and the regularizer I.

Substituting (11) for each of the classes u>\ and w2 (with their estimated /Xj, /x2 and Ei , 2) into (9) results in:

/ (x) = i x W - E r > + ^(MiSr 1 - M2S2-1)Tx

- ^ E r V i + ^ S j V s - i l n l E t l + i l n l E . I + l n ^ i j . (13)

This classifier rule is quadratic in terms of x, and it is therefore called the normal-based quadratic classifier.

For the quadratic classifier a full covariance matrix has to be estimated for each of the classes. In high dimensional feature spaces it can happen that insufficient

8

data is available to estimate these covariance matrices reliably. By restricting the covariance matrices to have less free variables, estimations can become more reliable. One approach the reduce the number of parameters, is to assume that both classes have an identical covariance structure: Ei = 2 = E. The classifier simplifies to:

/ (x) = \{^ - . / S - ' x - i / i f E - V i + \l%*- V 2 + In ^ (14) Because this classifier is linear in terms of x, this classifier is called the normal-based linear classifier.

For the linear and the quadratic classifier, strong class distributional assumptions are made: each class has a Gaussian distribution. In many applications this cannot be assumed, and more flexible class models have to be used. One possibility is to use a 'non-parametric' model. An example is the Parzen density model. Here the density is estimated by summing local kernels with a fixed size h which are centered on each of the training objects:

1 N p(x|w) = - ^ J V f c x i . f t l ) , (15)

il

where X is the identity matrix and h is the width parameter which has to be op-timized. By substituting (15) into (6), the Parzen classifier is defined. The only free parameter in this classifier is the size (or width) h of the kernel. Optimizing this parameter by maximizing the likelihood on the training data, will result in the solution h = 0. To avoid this, a leave-one-out procedure can be used 9.

2.3. Boundary methods

Density estimation in high dimensional spaces is difficult. In order to have a reliable estimate, large amounts of training data should be available. Unfortunately, in many cases the number of training objects is limited. Therefore it is not always wise to estimate the class distributions completely. Looking at (1), (6) and (10), it is only of interest which class is to be preferred over the other. This problem is simpler than estimating p(x\u>). For a two-class problem, we just a function / (x) is needed which is positive for objects of UJI and negative otherwise. In this section we will list some classifiers which avoid estimating p(x|w) but try to obtain a suitable / (x ) .

The Fisher classifier searches to find a direction w in the feature space, such that the two classes are separated as well as possible. The degree in which the two classes are separated, is measured by the so-called Fisher ratio, or Fisher criterion:

j = \\~mf. (16)

sl + s22 y ' Here mi and mo, are the means of the two classes, projected onto the direction w: mi = ~wTHi and rri2 = wT/x2. The si and s% are the variances of the two classes projected onto w. The criterion therefore favors directions in which the means are far apart and the variances are small.

9

This Fisher ratio can be explicitly rewritten in terms of w. First we rewrite sl = E x 6 a , c ( w T x - w 7 > c ) 2 = Exea,c w T ( x - Mc)(x - ^ c ) T w = w r S c w . Second we write (mi - m2)2 = (wT/x1 wT /z2)2 = wT(/x1 /x2)(Ati _ i u 2) T w = wTSflW. The term 5 s is also called the between scatter matrix. J becomes:

_ \rn-j - m 2 | 2 _ w r 5 B w _ wTSBw sf + s?, w T 5 i w + wTS,2W w T 5 i y w '

where Sw = S\ + S2 is also called the within scatter matrix. In order to optimize (17), we set the derivative of (17) to zero and obtain:

(wT5.Bw)5iyw = (wrS,^vw)S'BW. (18) We are interested in the direction of w and not in the length, so we drop the scalar terms between brackets. Further, from the definition of SB it follows that SBW is always in the direction fj,x fj,2- Multiplying both sides of (18) by Sw gives:

w ~ S ^ ( M i - M 2 ) - (19) This classifier is known as the Fisher classifier. Note that the threshold b is not defined for this classifier. It is also linear and requires the inversion of the within-scatter Sw- This formulation yields an identical shape of w as the expression in (14), although the classifiers use very different starting assumptions!

Most classifiers which have been discussed so far, have a very restricted form of their decision boundary. In many cases these boundaries are not flexible enough to follow the true decision boundaries. A flexible method is the k-nearest neighbor rule. This classifier looks locally which labels are most dominant in the training set. First it finds the k nearest objects in the training set AW(x), and then counts the number of these neighbors are from class u>i or u>2-

if ni > 712 assign x t o u i , otherwise to cj2- (20) Although the training of the fc-nearest neighbor classifier is trivial (it only has to store all training objects, k can simply be optimized by a leave-one-out estimation), it may become expensive to classify a new object x. For this the distances to all training objects have to be computed, which may be prohibitive for large training sets and high dimensional feature spaces.

Another classifier which is flexible but does not require the storage of the full training set is the multi-layered feed-forward neural network4. A neural network is a collection of small processing units, called the neurons, which are interconnected by weights w and v to form a network. A schematic picture is shown in Figure 2. An input object x is processed through different layers of neurons, through the hidden layer to the output layer. The output of the j - th output neuron becomes:

j(*) = hj nT.vMvrTy)] (21)

(see Figure 2 for the meaning of the variables). The object x is now assigned to the class j for which the corresponding output neuron has the highest output Oj.

10

X\

X2%

Xp #

- < * n

Fig. 2. Schematic picture of a neural network.

To optimize this neural network, the squared error between the network output and the desired class label is defined:

N K = EE(ni^)-iW)2- (22)

where rij (x) is 1 if object x belongs to class w,-, and 0 otherwise. To simplify the notation, we will combine all the weights w, and v into one weight vector w.

This error E is a continuous function of the weights w, and the derivative of E with respect to these weights can easily be calculated. The weights of the neu-ral network can therefore be optimized to minimize the error by gradient descent, analogous to (4):

dE W n e w = W o l d - T y , (23)

where 77 is the learning parameter. After expanding this learning rule (23), it appears that the weight updates for each layer of neurons can be computed by back-propagating the error which is computed at the output of the network (nj(xj) Oj(xi)). This is therefore called the back-propagation update rule.

The advantage of this type of neural networks is that they are flexible, and that they can be trained using these update rules. The disadvantages are, that there are many important parameters to be chosen beforehand (the number of layers, the number of neurons per layer, the learning rate, the number of training updates, etc.), and that the optimization can be extremely slow. To increase the training speed, several additions and extensions are proposed, for instance the inclusion of momentum terms in (23), or the use of second order moments.

Neural networks can be easily overtrained. Many heuristic techniques have been developed to decrease the chance of overtraining. One of the methods is to use weight decay, in which an extra regularization term is added to equation (22). This regularization term, often something of the form

E = E + \\\w\\2, (24) tries to reduce the size of the individual weights in the network. By restricting the size of the weights, the network will adjust less to the noise in the data sample

l + e x p ( - v T h )

11

and become less complex. The regularization parameter A regulates the trade-off between the classification error E and the classifier complexity. When the size of the network (in terms of the number of neurons) is also chosen carefully, good performances can be achieved by the neural network.

A similar approach is chosen for the support vector classifier32. The most basic version is just a linear classifier as in Eq. (10) with

/ (x) = w T x + 6. (25) The minimum distance from the training objects to the classifier is thereby maxi-mized. This gives the classifier some robustness against noise in the data, such that it will generalize well for new data. It, appears that this maximum margin p is in-versely related to | |w||2 , such that maximizing this margin means minimizing ||w||2 (taking the constraints into account that all the objects are correctly classified).

o

o o

o

1 1

I

o /*-;

o 4 1

i 1

1 ' /

t 1

wrx + 6 = 0 1 1

f

| 1 1 1

/ , P f i

f

/ 1 i i i

o cti = 0

o

Fig. 3. Schematic picture of a support vector classifier

Given linearly separable data, the linear classifier is found which has the largest margin p to each of the classes. To allow for some errors in the classification, some slack variables are introduced to weaken the hard constraints. The error to minimize for the support vector classifier therefore consists of two parts: the complexity of the classifiers in terms of w T w , and the number of classification errors, measured by X j^Ci- The optimization can be stated by the following mathematical formulation:

min w T w + C y " & , (26) w '-'

t such that

(w(Xi + b > 1 - & if x e w i , 1 w ' x , + b < - l + ?i otherwise.

12

Parameter C determines the trade-off between the complexity of the classifier, as measured by w T w, and the number of classification errors.

Although the basic version of the support vector classifier is a linear classifier, it can be made much more powerful by the introduction of kernels. When the constraints (27) are incorporated into (26) by the use of Lagrange multipliers a , this error can be rewritten in the so-called dual form. For this, we define the labels y, where y = 1 when x$ ui\ and j/j = 1 otherwise. The optimization becomes:

m a x a T x x T a - l T a , s.t. y T a = 0, 0 < at < C, Vi (28)

with w = J2i aiVixi- Due to the constraints in (28) the optimization is not trivial, but standard software packages exist which can solve this quadratic programming problem. It appears that in the optimal solution of (28) many of the Qj become 0. Therefore only a few Qj ^ 0 determine the w. The corresponding objects Xi are called the support vectors. All other objects in the training set can be ignored.

The special feature of this formulation is that both the classifier / (x) and the error (28) are completely stated in terms of inner products between objects xfXj. This means that the classifier does not explicitly depend on the features of the objects. It depends on the similarity between the object x and the support vectors Xj, measured by the inner product xTXj. By replacing the inner product by another similarity, defined by the kernel function isT(x, x,), other non-linear classifiers are obtained. One of the most popular kernel functions is the Gaussian kernel:

i r (x ,x , ) = e x p ( - l | x ^ 1 1 2 ) , (29)

where a is still a free parameter. The drawback of the support vector classifier is that it requires the solution

of a large quadratic programming problem (28), and that suitable settings for the parameters C and a have to be found. On the other hand, when C and a are optimized, the performance of this classifier is often very competitive. Another ad-vantage of this classifier is, that it offers the possibility to encode problem specific knowledge in the kernel function K. In particular for problems where a good feature representation is hard to derive (for instance in the classification of shapes or text documents) this can be important.

2.4. Multi-class classifiers In the previous section we focused on the two-class classification problems. This sim-plifies the formulation and notation of the classifiers. Many classifiers can trivially be extended to multi-class problems. For instance the Bayes classifier (1) becomes:

assign x to wc., c* = argmaxp(wc|x) (30) c

Most of the classifiers directly follow from this. Only the boundary methods which were constructed to explicitly distinguish between two classes, for instance the

13

Fisher classifier or the support vector classifier, cannot be trivially extended. For these classifiers several combining techniques are available. The two main ap-proaches to decompose a multi-class problem into a set of two-class problems are:

(1) one-against-all: train K classifiers between one of the classes and all others, (2) one-against-one: train K(K l ) /2 classifiers to distinguish all pairwise classes. Afterward classifiers have to be combined using classification confidences (posterior probabilities) or by majority voting. A more advanced approach is to use Error-Correcting Output Codes (ECOC), where classifiers are trained to distinguish spe-cific combinations of classes, but are allowed to ignore others7. The classes are chosen such that a redundant output labeling appears, and possible classification errors can be fixed.

2.5. One-class classifiers A fundamental assumption in all previous discussions, is that a representative train-ing set XtT is available. That means that examples from both classes are present, sampled according to their class priors. In some applications one of the classes might contain diverse objects, or its objects are difficult or expensive to measure. This happens for instance in machine diagnostics or in medical applications. A suf-ficient number of representative examples from the class of ill patients or the class of faulty machines are sometimes hard to collect. In these cases one cannot rely on a representative dataset to train a classifier, and a so-called one-class classifier30 may be used.

Fig. 4. One-class classifier example.

In one-class classifiers, it is assumed that we have examples from just one of the classes, called the target class. From all other possible objects, per definition the outlier objects, no examples are available during training. When it is assumed that the outliers are uniformly distributed around the target class, the classifier should

14

circumscribe the target object as tight as possible in order to minimize the chance of accepting outlier objects.

In general, the problem of one-class classification is harder than the problem of conventional two-class classification. In conventional classification problems the decision boundary is supported from both sides by examples of both classes. Because in the case of one-class classification only one set of data is available, only one side of the boundary is supported. It is therefore hard to decide, on the basis of just one class, how strictly the boundary should fit around the data in each of the feature directions. In order to have a good distinction between the target objects and the outliers, good representation of the data is essential.

Approaches similar to standard two-class classification can be used here. Us-ing the uniform outlier distribution assumption, the class posteriors can be esti-mated and the class conditional distributions or direct boundary methods can be constructed. For high dimensional spaces the density estimators suffer and often boundary methods are to be preferred.

2.6. Combining classifiers In practice it is hard to find (and train) a classifier which fits the data distribution sufficiently well. The model can be difficult to construct (by the user), too hard to optimize, or insufficient training data is available to train. In these cases it can be very beneficial to combine several "weak" classifiers in order to boost the classifica-tion performance21. It is hoped that each individual classifier will focus on different aspects of the data and err on different objects. Combining the set of so-called base classifiers will then complement their weak areas.

y, /

feature/ space \

base classifier 1

base classifier 2

base classifier 3

combining classifier classification

/ base classifier outputs (e.g. confidences)

Fig. 5. Combining classifier

The most basic combining approach is to train several different types of classifiers on the same dataset and combine their outputs. One has to realize that classifiers can only correct each other when their outputs vary, i.e. the set of classifiers is diverse 22. It appears therefore to be more advantageous to combine classifiers which were trained on objects represented by different features. Another approach to force classifiers to become diverse is to artificially change the training set by resampling (resulting in a bagging6 or a boosting8 approach).

15

The outputs of the classifier can be combined using several combining rules18, depending on the type of classifier outputs. If the classifiers provide crisp output labels, a voting combining rule has to be used. When the real valued outputs are available, they can be averaged, weighted averaged or multiplied, the maximum or minimum output can be taken or even an output classifier can be trained. If fixed (i.e. not trained) rules are used, it is important that the output of a classifier is properly scaled. Using a trainable combining rule, this constraint can be elevated but clearly training data is required to optimize this combining rule10.

3. Feature reduction

In many classification problems it is unclear what features have to be taken into account. Often a large set of k potentially useful features is collected, and by feature reduction the k most suitable features are chosen. Often the distinction between feature selection and feature extraction is made. In selection, only a subset of the original features is chosen. The advantage is that in the final application just a few features have to be measured. The disadvantage is that the selection of the appropriate subset is an expensive search. In extraction new features are derived from the original features. Often all original features are used, and no reduction is obtained in the number of measurement. Butt in many cases the optimization is easier. In Section 3.1 we will discuss several evaluation criteria, then in Section 3.2 feature selection and finally in 3.3 feature extraction.

3.1. Feature set evaluation criteria

In order to evaluate a feature set, a criterion J has to be defined. Because feature reduction is often applied in classification, the most obvious criterion is thus the performance of the classifier. Unfortunately, the optimization of a classifier is often hard. Other evaluation criteria might be a cheaper approximation to this classifica-tion performance. Therefore approximate criteria are used, measuring the distance or dissimilarity between distributions, or even ignoring the class labels, but just focusing on unsupervised characteristics.

Some typical evaluation criteria are listed in Table 1. The most simple ones use the scatter matrices characterizing the scatter within classes (showing how samples scatter around their class mean vector, called Sw, the within scatter) and the the scatter between the clusters (showing how the means of the clusters scatter, SB, the between scatter matrix, see also the discussion of the Fisher ratio in section 2.3). These scatter matrices can be combined using several functions, listed in the first part of Table 1. Often Si = SB is used, and 52 = Sw or 5 2 = Sw + SB-

The measures between distributions involve the class distributions p(x\u>i), and in practice often single Gaussian distributions for each of the classes are chosen. The reconstruction errors still contain free parameters in the form of a matrix of basis vectors W or a set of prototypes fj,k. These are optimized in their respective

16

procedures, like the Principal Component Analysis or Self-Organizing Maps. These scatter criteria and the supervised measures between distributions are mainly used in the feature selection, Section 3.2. The unsupervised reconstruction errors are used in feature extraction 3.3.

Table 1. Feature selection criteria for measuring the difference between two distributions or for measuring a reconstruction error.

Measures using scatter matrices For explanation J = tr(5^"15i)

of Si and S2 J = l n | 5 ^ 1 5 i | see text. J = gf2-

Measures between distributions Kolmogorov J = / |p(wi|x) p(cj2|x)|p(x)cbc Average separation J = ^xjewi Sx^ewi Divergence J = / (p(x|wi)p(x|w2)) (p (* ]^ j ) dx Chernoff J = - log/p s (x |wi)p 1 _ s (x |u;2)dx Fisher J = y/ (p(x |wi)p(wi) -p(x |w 2)p(w 2)) 2dx

Reconstruction errors PCA E=\\x- ( W ( W ' i ' W ) _ 1 W ' r ) x f SOM B = min f c | |x-Ai f c | |2

3.2 . Feature selection

In feature selection a subset of the original features is chosen. A feature reduction procedure consist of two ingredients: the first is the evaluation criterion to evaluate a given set of features, the second is a search strategy to search over all possible feature subsets1 6 . Exhaustive search is in many applications not feasible. When we star t with k = 250 features, and we want to select k = 10, we have to consider in principle (250) 2 101 7 different subsets, which is clearly too much.

Instead of exhaustive search, a forward selection can be applied. It s tar ts with the single best feature (according to the evaluation criterion) and adds the feature which gives the biggest improvement in performance. This is repeated till the re-quested number of features k is reached. Instead of forward selection, the opposite approach can be used: backward selection. This s tar ts with the complete set of fea-tures and removes tha t feature such tha t the performance increase is the largest. These approaches have the significant drawback tha t they might miss the optimal subset. These are the subsets for which the individual features have poor discrim-inability but combined give a very good performance. In order to find these subsets, a more advanced search strategy is required. It can be a floating search where adding

17

and removing features is alternated. Another approach is the branch-and-bound al-gorithm 12, where all the subsets of features is arranged in a search tree. This tree is traversed in such order that as soon as possible large sub branches can be dis-regarded, and the search process is shortened significantly. This strategy will yield the optimal subset when the evaluation criterion J is monotone, that means that when for a certain feature set a value of Jk is obtained, a subset of the features cannot have higher value for Jj,. Criteria like the Bayes error, the Chernoff distance or the functions on the scatter matrices fulfill this.

Currently, other approaches appear which combine the traditional feature selec-tion and subsequent training of a classifier. One example is a linear classifier (with the functional form of (25)) called LASSO, Least Absolute Shrinkage and Selection Operator 31. The classification problem is approached as a regression problem with an additional regularization. A linear function is fitted to the data by minimizing the following error:

n

m i n ^ ( y i - w T x i - 6 ) 2 + C||w||. (31) i

The first part defines the deviation of the linear function wTXj + 6 from the expected label yi. The second part shrinks the weights w, such that many of them become zero. By choosing a suitable value for C, the number of retained features can be changed. This kind of regularization appears to be very effective when the number of feature is huge (in the thousands) and the training size is small (in the tens). A similar solution can be obtained when the term w T w in (26) is replace by |w| 3.

3.3. Feature extraction

Instead of using a subset of the given features, a smaller set of new features may be derived from the old ones. This can be done by linear or nonlinear feature extraction. For the computation of new features usually all original features are used. Feature extraction will therefore almost never reduce the amount of measurements. The optimization criteria are often based on reconstruction errors as in Table 1.

The most well-known linear extraction method is the Principal Component Anal-ysis (PCA) 17. Each new feature i is a linear combination of the original features: x\ = WjX. The new features are optimized to minimize the PCA mean squared error reconstruction error, Table 1. It basically extracts the directions W , in which the data set shows the highest variance. These directions appear to be equivalent to the eigenvectors of the (estimated) covariance matrix E with the largest eigenvalues. For the i-th principal component Wj therefore holds:

EWi = XiWi, Xi > Xj, iff < j . (32) An extension of the (linear) PCA is the kernelized version, kernel-PCA 24. Here

the standard covariance matrix E is replaced by a covariance matrix in a feature space. After rewriting, the eigenvalue problem in the feature space reduces to the

18

following eigenvalue problem: Koti = AjCKj. Here K is a N x N kernel matrix (like for instance (29)). An object x is mapped onto the i-th principal component by:

^ = J>}if(x,X j) . (33) j

Although this feature extraction is linear in the kernel space, in the feature space it will obtain non-linear combinations of features.

There are many other methods for extracting nonlinear features, for instance the Self-Organizing Map (SOM) 20. The SOM is an unsupervised clustering and feature extraction method in which the cluster centers are constrained in their placing. The construction of the SOM is such that all objects in the input space retain as much as possible their distance and neighborhood relations in the mapped space. In other words, the topology is preserved in the mapped space.

The mapping is performed by a specific type of neural network, equipped with a special learning rule. Assume that we want to map an /c-dimensional measurement space to a fc'-dimensional feature space, where k' < k. In fact, often k' = 1 or k' = 2. In the feature space, we define a finite orthogonal grid with grid points. At each grid point we place a neuron or prototype . Each neuron stores an fc-dimensional vector /ifc that serves as a cluster center. By defining a grid for the neurons, each neuron does not only have a neighboring neuron in the measurement space, it also has a neighboring neuron in the grid. During the learning phase, neighboring neurons in the grid are enforced to also be neighbors in the measurement space. By doing so, the local topology will be preserved. Unfortunately, training a SOM involves the setting of many unintuitive parameters and heuristics (similar to many neural network approaches).

A more principled approach to the SOM is the Generative Topographic Mapping, GTM5. The idea is to find a representation of the original p-dimensional data x in terms of //-dimensional latent variables z. For this a mapping function y(z|W) has to be defined. In the GTM it is assumed that the distribution of z in the latent variable space is a grid of delta functions z*:

1 K W = ^

19

The distribution p(x) can then be obtained by integration over the z distribution:

p(x|W,(T)= /"p(x|z,W,(z), then the parameters W and a can be optimized.

An even simpler model to optimize is the Local Linear Embedding, LLE28. Here also the goal is to find a low dimensional representation of the training set Xu'. But unlike the GTM, where an explicit manifold is fitted, here the low dimensional representation is optimized such that the objects can be reconstructed from their neighbors in the training set in the same manner in the low dimensional representa-tion as in the high dimensional one. First, the weights Wij for reconstructing each object Xj from its neighbors Xj are optimized (minimizing the LLE reconstruction error, Table 1, under the constraint that V Wij = 1). Given the weights, the loca-tion of low-dimensional feature vectors Zj,i = 1,...,N is optimized, using the same LLE reconstruction error, but where x, is replaced by z*. This can be minimized by solving a eigenvalue problem (similar to finding the principal components).

The feature extraction methods presented above, are all unsupervised, i.e. other information like class labels is not used. This can be a significant drawback when the feature reduction is applied as a preprocessing for solving a classification problem. It might actually happen that all informative features are removed. To avoid this, supervised feature extraction has to be used. Very well known is Linear Discriminant Analysis (LDA)27, which is using the weight vector w from the Fisher classifier (see section 2.3) as feature direction. A multi-class extension is presented in 27 but it assumes equal covariance matrices for all classes and the number of features is restricted to if 1. The LDA can be extended to include the difference in covariance matrix by using the Chernoff criterion instead of the Fisher criterion 23.

4. Error estimation

At various stages in the design of a pattern classification system an estimation of the performance of a procedure, or the separability of a set of classes is needed. Examples are the selection the 'best' feature during feature selection, the feature subspace to be used when several feature extraction schemes are investigated, the performance of the base classifiers in order to find a good set of classifiers to be combined, the optimization of various parameters in classification schemes like the smoothing parameter in the Parzen classifier and the number of hidden units used in a neural network classifier, and the final selection of the overall classification procedure if various competing schemes are followed consecutively. Moreover, at the end an estimate of the performance of the selected classifier is desired.

20

In order to find an unbiased error estimate, a set of test objects with known labels is desirable. This set should be representative for the circumstances expected during the practical use of the procedures under study. Usually this implies that the test set has to be randomly drawn from the future objects to be classified. As their labels should be known for proper testing, these objects are suitable for training as well. Once an object is used for training, however, the resulting classifier is expected to be good for this object. Consequently, if it is also used for testing it generates an optimistic bias in the error estimate. Below two techniques will be discussed to solve this problem. The first is cross-validation, which aims at circumventing the bias. The second is a bootstrap technique by which the bias is estimated and corrected.

4 .1 . Cross-validation

Assume that a design set Xd is available for the development of a pattern recognition system, or one of its subsystems, and that in addition to the classifier itself an unbiased estimate of its performance is needed. If Xd is split (e.g. at random) into a training set Xtv and a test set Xte then we want Xtr to be as large as possible to train a good classifier, but simultaneously Xte has to be sufficiently large for an accurate error estimate. The standard deviation of this estimate is sqrt(e * (1 e) Nte) (e.g. 0.003 for e = 0.01, Nte = 1000 and 0.03 for e = 0.1 and Nte = 100). When the design set is not sufficiently large to split it into a test set and a training set of appropriate sizes, a cross validation procedure might be used in which the design set is split into B ( B > = 2) subsets of about the same size. In total B different classifiers are trained, each by a different group of B 1 of these subsets. Each classifier is tested by the single subset not used for its training. Finally the B test results are averaged. Consequently all objects are used for testing once. The classifiers they are testing are all based on an (B 1)/B part of the training set. For larger B these classifiers are expected to be similar and they will be just slightly worse than the classifier based on all objects. A good choice seems to be a 10-fold stratified cross-validation, see 19, i.e. N = 10, and objects are selected evenly from the classes, i.e. in agreement with their prior probabilities.

4.2. Bootstrap procedures

Instead of following a procedure that tries to minimize the bias in the error estimate, one may try to estimate the bias 13 '15. A general procedure (independent of the used classifier) can be based on a comparison of the expected apparent error E^pp of a classifier trained by bootstrapping the design set with its error E\ estimated by the entire design set. The difference can be used as an estimate for the bias in the apparent error: E\,ias = E^E^, which can be used as a correction for the apparent error E% of the classifier based on the design set: Eboot = E%pp + Ebd Ehapp.

A second estimator based on bootstrapping is the so called Ees2 error n ' 1 3 ' 1 5 . It is based on a weighted average of the apparent error of the classifier based on

21

the design set E% and an error estimate EQ for the bootstrap classifier based on the out-of-bootstrap part of the design set. The first is optimistically biased (an apparent error) and the second is an unbiased error estimate (tested by independent samples) of a classifier that is somewhat worse (based on just a bootstrap sample) than the target classifier based on the design set. The weights are given by the asymptotic probability that a sample will be included in a bootstrap sample: 0.632. The 6 3 2 error estimate thereby is given by: E632 = 0.368 * E%pp + 0.632 * #.

4.3. Error curves

The graphical representation of the classification error is an important tool to study, compare and understand the behavior of classification systems. Some examples of such error curves are:

Learning curve : the error as a function of the number of training samples. Simple classifiers decrease faster, but have often a higher asymptotic value than more complex ones.

Complexity curve : the error as a function of the complexity of the classifier, e.g. the feature size or the number of hidden units. Such a curve often shows an increasing error after an optimal feature size or complexity.

Parameter curve : the error as a function of a parameter in the training proce-dure, e.g. the smoothing parameter in the Parzen classifier. The optimum that may be observed in such curves is related to the best fit of the underlying model in the classification system w.r.t. the data.

Error-reject trade off : the error as function of the reject probability. If a clas-sifier output (e.g. a confidence estimate) is thresholded to reject unreliably classified objects, then this curve shows the gain in error reduction.

ROC curves : the trade-off between two types of errors, e.g. the two types of error in a 2-class problem. These Receiver Operator Curves were first studied in com-munication theory and are useful to select a classifier if the point of operation may vary, e.g. due to unknown classification costs or prior probabilities.

5. Discussion

In the previous sections an overview is given of well established techniques for statistical pattern recognition with a few excursion to more recent developments. Modern scientific and industrial developments, the use of computers and internet in daily life and the fast growing sensor technology raise new problems as well as they enable new solutions. We will summarize some new developments in statistical pattern recognition, partially introduced above, partially not yet discussed.

Other types of representation than the traditional features enable other ways to incorporate expert knowledge. The dissimilarity representation is an example for this, as it offers the possibility to express knowledge in the definition of the dissimilarity measure, but it opens also other options. Instead of being based on the

22

raw data like spectra, images, or time signals it may be defined on models of objects, like graphs. In such cases structural knowledge is used for the object descriptions. In addition to the nearest neighbor rule, dissimilarity based classifiers offer a richer set of tools with more possibilities to learn from examples, thereby bridging the gap between structural and statistical pattern recognition. Several problems, however, have still to be solved, like the selection of a representation set, optimal modifications of a given dissimilarity measure and the construction of dedicated classifiers.

More complicated pattern recognition problems may not be solved by a single off-the-shelf classifier. By the combining classifier technique a number of partial solutions can be combined. Several questions are still open here, like the selection or generation of the base classifiers, the choice of the combiner, the use of a finite training set. Moreover, an overall mathematical foundation is still not available.

One-class classifiers are a good way to handle ill sampled problems, or to build classifiers when some of the classes are undersampled. This is important for applica-tions like man or machine monitoring when one of the classes, e.g. normal behavior, is very well defined. Such classifiers may also be used when it is not possible to select a representative training set by an appropriate sampling of the domain of objects. In such cases a domain based class description may be found, locating the class boundary in the representation, without building a probability density function.

The well spread availability of computers and sensors, and the costs of labeling objects by human experts, may sometimes result in large databases in which just a small fraction of the objects is labeled. Techniques for training classifiers by par-tially labeled datasets are still in their early years. This may also be considered as combining clustering and classification.

For such problems in which the costs of expert labeling are high, one may also try to optimize the set of objects to be labeled. This technique is called active learning. Several competing strategies exist, e.g. sampling close to an initial decision boundary, or retrieving objects in the modes of the class density distributions. Another variant is online learning, in which the order of the objects to be presented to a decision function is determined by the application, e.g. by a production line in a factory. It has now to be decided whether objects can be safely classified, or whether a human expert has to be solicited, not only to reduce the risk of misclassification, but also to optimally improve the available classification function.

An often returning question in dynamic environments is whether a trained clas-sification function is still valid, or whether it should be retrained due to new cir-cumstances. In such problems 'learning' and 'forgetting' are directly related. If a new situation demands retraining, old objects may not be representative anymore and should be forgotten. (They may still be stored in case the old situation appears to return after some time).

Many techniques are proposed and many more are to come for solving problems as the above. A difficulty that cannot be easily handled is that they are often ill defined. Consequently, generally valid benchmarks are not available, by which it

23

is not straightforward to detect the good procedures tha t may work well over a series of applications. As good and bad procedures cannot easily be distinguished, it is to be expected tha t the set of tools used in statistical pa t te rn recognition will significantly grow in the near future.

References

1. J.A. Anderson. Logistic discrimination. In P.R. Kirshnaiah and L.N. Kanal, editors, Classification, Pattern Recognition and Reduction of Dimensionality, volume 2 of Handbook of Statistics, pages 169-191. North Holland, Amsterdam, 1982.

2. A.G. Arkadev and E.M. Braverman. Computers and Pattern Recognition. Thompson, Washington, D.C., 1966.

3. C. Bhattacharyya, L.R. Grate, A. Rizki, D. Radisky, F.J. Molina, M. I. Jordan, M.J. Bissell, and I.S. Mian. Simultaneous relevant feature identification and classification in high-dimensional spaces: Application to molecular profiling data. Signal Processing, 83:729-743, 2003.

4. C M . Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Wal-ton Street, Oxford 0X2 6DP, 1995.

5. C M . Bishop, M. Svensen, and C.K.I. Williams. The generative topographic mapping. Neural Computation, 10(l):215-234, 1998.

6. L. Breiman. Bagging predictors. Machine Learning, 26(2):123-140, 1996. 7. T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-

correcting output codes. Journal of Artificial Intelligence Research, 2:263-286, 1995. 8. H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, and V. Vapnik. Boosting and other

ensemble methods. Neural Computation, 6, 1994. 9. R.P.W. Duin. On the choice of the smoothing parameters for Parzen estimators of

probability density functions. IEEE Trans, on Computers, C-25(ll):1175-1179, 1976. 10. R.P.W. Duin. The combining classifier: To train or not to train? In International

Conference on Pattern Recognition, volume II, pages 765-770, Quebec, Canada, 2002. 11. B. Efron and R.J. Tibshirani. Improvements on cross-validation: the .632+ bootstrap

method. J. Amer. Statist. Assoc, 92:548-560, 1997. 12. K. Fukanaga. Introduction to Statistical pattern recognition. Academic press, San

Diego, 2nd edition, 1990. 13. D.J. Hand. Recent advances in error rate estimation. Pattern Recognition Letters,

4(5):335-346, 1986. 14. S.S. Haykin. Neural Networks, a comprehensive foundation. Prentice-Hall, 1999. 15. A.K. Jain, R .C Dubes, and Chen, C.-C Bootstrap techniques for error estima-

tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):628-633, September 1987.

16. A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(l):4-37, 2000.

17. I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986. 18. J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 20(4):226-239, 1998. 19. R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and

model selection. In Proc. of the 15th Int. Joint Conference on Artificial Intelligence, pages 1137-1143, 1995.

20. T. Kohonen. Self-organizing maps. Springer-Verlag, Heidelberg, Germany, 1995. 21. L.I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley, 2004.

24

22. L.I. Kuncheva and C.J. Whitaker. Measures of diversity in classifier ensembles. Ma-chine Learning, 51:181-207, 2003.

23. M. Loog, R.P.W. Duin, and R. Haeb-Umbach. Multiclass linear dimension reduction by weighted pairwise fisher criteria. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 23(7):762-766, 2001.

24. S. Mika, B. Scholkopf, A.J. Smola, K.-R. Miiller, M. Scholz, and G. Ratsch. Kernel PCA and de-noising in feature spaces. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 536-542. MIT Press, 1999.

25. T. M. Mitchell. Machine Learning. Mc Graw-Hill, New York, 1997. 26. E. Pekalska and R.P.W. Duin. Dissimilarity representations allow for building good

classifiers. Pattern Recognition Letters, 23(8):943-956, 2002. 27. C.R. Rao. The utilization of multiple measurements in problems of biological clas-

sification (with discussion). Journal of the Royal Statistical Society, B, 10:159-203, 1948.

28. S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323-2326, 2000.

29. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

30. D.M.J. Tax. One-class classification. PhD thesis, Delft University of Technology, http://www.ph.tn.tudelft.nl/~davidt/thesis.pdf, June 2001.

31. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(l):267-288, 1996.

32. V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

CHAPTER 1.2

HIDDEN MARKOV MODELS FOR SPATIO-TEMPORAL PATTERN RECOGNITION

Brian C. Lovell" and Terry Caellib aThe Intelligent Real-Time Imaging and Sensing (IRIS) Group

The School of Information Technology and Electrical Engineering The University of Queensland, Australia QLD 4072

E-mail:lovell@ itee. uq. edu.au

National Information and Communications Technology Australia (NICTA) Research School of Information Sciences and Engineering

Australian National University, Australia email: [email protected]

The success of many real-world applications demonstrates that hidden Markov models (HMMs) are highly effective in one-dimensional pattern recognition problems such as speech recognition. Research is now focussed on extending HMMs to 2-D and possibly 3-D applications which arise in gesture, face, and handwriting recognition. Although the HMM has become a major workhorse of the pattern recognition community, there are few analytical results which can explain its remarkably good pattern recognition performance. There are also only a few theoretical principles for guiding researchers in selecting topolo-gies or understanding how the model parameters contribute to performance. In this chapter, we deal with these issues and use simulated data to evaluate the performance of a number of alternatives to the traditional Baum-Welch algorithm for learning HMM parameters. We then compare the best of these strategies to Baum-Welch on a real hand gesture recognition system in an attempt to develop insights into these fundamental aspects of learning.

1. Introduction

There is an enormous volume of literature on the application of hidden Markov Models (HMMs) to a broad range of pattern recognition tasks. In the case of speech recognition, the patterns we wish to recognise are spoken words which are audio signals against time. Indeed, the value of Markov models to model speech was recognised by Shannon26 as early as 1948. In the case of hand gesture recognition, the patterns are hand movements in both space and time we call this a spatio-temporal pattern recognition problem. The suitabil-ity and efficacy of HMMs to such problems is undeniable and they are now established as one of the major tools of the pattern recognition community. Yet, when one looks for re-search which address fundamental problems such as efficient learning strategies for HMMs or perhaps analytically determining the most suitable architectures for a given problem, die number of papers is greatly diminished. So despite the enormous uptake of HMMs since

25

26

their introduction in the 1960's, we believe that there is still a great deal of unexplored territory.

Much of the application of HMMs in the literature is based firmly on the methodol-ogy popularised by Rabiner et al. (1983) 25>16-24 for speech recognition and these studies are the primary reference for many HMM researchers resulting in two common practices. One, to use the forward algorithm to determine the MAP(maximum posterior probabil-ity) of the model, given an observation sequence, as a classification metric. Two, to use the Baum-Welch as a model estimation/update procedure. We will see how these are not ideal strategies to use as, in the former case, classification is reduced to a single number without directly using the model (data summary) parameters, attributes, per se. As for the latter, the Baum-Welch 4 algorithm (a version of the famous Expectation-Maximisation algorithm14,1'21) is, in the words of Stolke and Omohundro 28,"... far from foolproof since it uses what amounts to a hill-climbing procedure that is only guaranteed to find a local likelihood maximum." Moreover, as observed by Rabiner 24, results can be very dependent on the initial values chosen for the HMM parameters.

The problem of finding local rather that global maxima is encountered in many other ar-eas of learning theory and optimisation. These problems are familiar territory to researchers in the artificial neural network community and many techniques have been proposed to counter them. Moreover genetic and evolutionary algorithmic techniques specialise in solving such problems albeit often very slowly, especially in the case of biological evolution11. With this in mind, we use simulated data to investigate other approaches to learning HMMs from observation sequences in an attempt to find superior alternatives to the traditional Baum-Welch Algorithm. Then we compare and test the best of the alternate strategies on real data from a hand gesture recognition system to see if the real data trials corroborate the conclusions drawn from simulated trials.

1.1. Background and Notation In this study, we focus on the discrete HMM as popularised by Rabiner24. Using the famil-iar notation from his tutorial paper, a hidden Markov model consists of a set of N nodes, each of which is associated with a set of M possible observations. The parameters of the model include an initial state vector

7T= \Pl,P2,P3,-,PN}T

with elements pn, n [1, AT] which describes the distribution over the initial node set, a transition matrix

/ a n a12 ... aiw \ 0 2 1 a22 02JV

\ajvi ajv2 OJVJV/ with elements atj with i,j [l,N] for the transition probability from node i to node j

27

b) Left-Right

Fig. 1. Cyclic and Left-Right structures. Bold arrows indicate high probability transistions. No arrow between vertices indicates a forbidden (zero-probability) transition.

conditional on node i, and an observation matrix

B

/bn 6i2 . . . bXM \ 621 b22 b2M

\ &W1 bN2 bNM J

with elements bim for the probability of observing symbol m G [1,M] given that the system is in state i G [1, N]. We denote the HMM model parameter set by A = (A, B, 7r).

The model order pair (N, M) together with additional restrictions on allowed transi-tions and emissions defines the topology or structure of the model (see figure 1 for an illustration of two different transition structures). One commonly used topology is called Fully-Connected (FC) or Ergodic. In the FC HMM there is not necessarily a defined start-ing state and all state transitions are possible such that a^ ^ 0 V i, j [1, N]. Another topology, especially popular in speech recognition applications, is called Left-Right. In an LR HMM there is a defined starting state (usually state 1) and only state transitions to higher-index states are allowed such that a^ = 0 V i > j where i, j G [1, N].

Rabiner24 defines the three basic problems of HMMs by:

Problem 1 Given the observation sequence O = 0\02 -OT, and a model A = (A, B, IT), how do we efficiently compute P(0\X), the probability of the observa-tion sequence given the model?

Problem 2 Given the observation sequence O = O1O2 Or, and the model A, how do we choose a corresponding state sequence Q = q\q2 ... qr which is optimal in some meaningful sense {i.e., best "explains" the observations)?

Problem 3 How do we adjust the model parameters A = (A, B, n) to maximize P((9|A)? Problems 1 and 2 are elegantly and efficiently solved by the forward and Viterbi29'12

algorithms respectively as described by Rabiner in his tutorial. The forward algorithm is used to recognise matching HMMs (i.e., highest probability models, MAP) from the obser-vation sequences. Note, again, that this is not a typical approach to pattern classification as

28

it does not involve matching model with observation attributes. That would involve com-paring the model parameters and estimated observation model parameters. MAP does not perform this and so it cannot be as sensitive a measure as exact parameter comparisons. Indeed, a number of reports have already shown quite different HMMs can have identical emissions( observation sequences) 18'3. The Viterbi algorithm is used less frequently as we are normally more interested in finding the matching model than in finding the state sequence. However, this algorithm is critical in evaluating the precision of the HMM; in other words, how well the model can reconstruct (predict) the observations.

Rabiner proposes solving Problem 3 via the Baum-Welch algorithm which is, in essence, a gradient ascent algorithm a method which is guaranteed to find local maxima only. Solving Problem 3 is effectively the problem of learning to recognise new patterns, so it is really the fundamental problem of HMM learning theory; a significant improvement here could boost the performance of all HMM based pattern recognition systems. There-fore it is somewhat surprising that there appear to be relatively few papers devoted to this topic the vast majority are devoted to applications of the HMM. In the next section we compare a number of alternatives to and variations of Baum-Welch in an attempt to find superior learning strategies.

2. Comparison of Methods for Robust HMM Parameter Estimation We focus on the problem of reliably learning HMMs from a small set of short observation sequences. The need to learn rapidly from small sets arises quite often in practice. In our case, we are interested in learning hand gestures which are limited to just 25 observations. The limitation arises because we record each video at 25 frames per second and each of our gestures takes less than one second to complete. Moreover, we wish to obtain good recognition performance from small training sets to ensure that new gestures can be rapidly recognised by the system.

Four HMM parameter estimation methods are evaluated and compared by using a train and test classification methodology. For these binary classification tests we create two ran-dom HMMs and then use each of these to generate test and training data sequences. For normalization, we ensure that each test sequence can be correctly recognized by its true model; thus the true models obtain 100% classification accuracy on the test data by con-struction. The various learning methods are then used to estimate the two HMMs from their respective training sets and then the recognition performance of the pair of estimated HMMs is evaluated on the unseen test data sets. This random model generation and eval-uation process is repeated 16 times for each data sample to provide meaningful statistical results.

Before parameter re-estimation, we initialize with two random HMMs which should yield 50% recognition performance on average. So an average recognition performance above 50% after re-estimation shows that some degree of learning must have taken place. Clearly if the learning strategy can perfectly determine both of the HMMs which generated the training data sets, we would have 100% recognition performance on the test sets.

We compare four learning methods 1) traditional Baum-Welch, 2) ensemble averaging

29

0.95

3 H i ! i= rt S 0)

o t :

0-

on

rt! o

sit

O)

30

In earlier related work, Starner and Pentland27 developed a HMM-based system to recognise gesture phrases in American Sign Language. Later, Lee and Kim15 used HMM-based hand gesture recognition to control viewgraph presentation in data projected semi-nars. Our system recognizes gestures based on the letters of the alphabet traced in space in front of a video camera. The motivation for this application is to produce a way of typing messages into a camera-equipped mobile phone or PDA using video gestures instead of the keypad or pen interface. We use single stroke letter gestures similar to those already widely used for pen data entry in PDAs. For example, figure 3 shows the hand gestures for the letters "Z" and "W." The complete gesture set is shown in figure 6.

Fig. 3. "Fingerwriting:" Single stroke video gesture for letters "W" and "Z."

Each video sequence comprises 25 frames corresponding to one second of video. Skin colour segmentation in YUV colour space is applied to locate the hand. Pre-processing (morphological) operations smooth the image and remove noise before tracking the hand with a modified Camshift algorithm5. After segmenting the hand, we calculate image mo-ments to find the centroid in each frame. Along the trajectory, the direction (angle) of motion of each of the 25 hand movements is calculated and quantized to one of 18 dis-crete symbols. The resultant discrete angular observation sequence is input to the HMM classification module for training and recognition.

We compare traditional Baum-Welch with the most promising alternative from the stim-ulated study, VPC. We evaluate recognition performance over all 26 character gestures us-ing fully connected (FC), left-right (LR), and left-right banded (LRB) model topologies with the number of states ranging from 1 to 14. A LRB model is an LR model which has a transition structure containing self-transitions and next state transitions only (i.e., states cannot be skipped) as shown in figure 5. More formally, a^ ^ 0 V j = i or j = * + 1, and 0 otherwise, i, j e [1, N].

Our video gesture database contains 780 video gestures with 30 examples of each ges-ture. Recognition accuracy is evaluated using threefold cross-validation where 20 gestures are used for training and 10 for testing in each partition. These HMMs are initialized with random HMM parameters before using either Baum-Welch or VPC for learning.

From figure 4 the best average recognition accuracy achieved is 97.31% when VPC is used for training, topology is LRB, and the number of states is 13. Although this corrobo-

31

Number of States 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mean Max

Baum-Welch FC

80.00 72.69 66.54 80.00 75.20 75.60 77.60 76.80 77.60 76.00 65.20 74.80 84.80 72.80 75.40 84.80

LR 80.00 94.23 92.31 84.80 81.20 84.80 86.40 86.00 85.60 81.60 86.80 86.80 84.00 81.60 85.44 92.31

LRB 80.00 93.85 96.15 85.38 90.77 85.77 89.62 89.62 90.00 88.46 89.23 88.08 90.00 88.46 88.96 96.15

VPC FC

80.38 71.15 63.85 53.20 59.60 55.20 45.60 44.40 49.20 43.20 42.80 40.80 39.60 38.80 51.98 63.85

LR 80.38 91.92 91.15 91.20 91.20 90.40 91.20 90.40 90.40 90.00 90.00 90.00 90.00 90.40 89.90 91.20

LRB 80.38 90.77 93.08 90.38 95.00 93.85 94.23 94.23 94.62 95.00 95.00 95.77 97.31 93.46 93.08 97.31

Fig. 4. Average percent correct recognition for all 26 video letter gestures against topology and training method.

0808 Fig. 5. Left-Right banded topology.

rates the stronger VPC performance exhibited in our simulated data performance trials, a closer investigation of Table 4 raises some doubts about this conjecture through the follow-ing observations.

The Baum Welch algorithm did almost as well as VPC with a best performance of 96.15% correct recognition with only 3 states. Moreover we achieve a very surprising 80% correct recognition with just a single state.

Topology (i.e., constraints on the initial value of the A matrix) has more impact on performance than the choice of learning algorithm.

Good recognition performance can be obtained over a very broad range of N, the number of states.

3.1. Comments on Learning Algorithm Performance We do not suggest that the above observations can be generalized to other real-world ap-plication domains but anecdotal evidence from other researchers suggests that similar be-

32

A R

s

6 h /N im^""'' '

I

c

0 u

0 J p \J

K q 0)

r* L xf

Fig. 6. The alphabet of single-stroke letter hand gestures.

haviour is often encountered. When we designed this gesture system, we thought that this pattern recognition problem was quite challenging and would significantly differentiate learning strategies. Yet the surprisingly good performance over a number of learning algo-rithms, topologies, and a broad range N suggests that the problem is significantly easier than we suspected.

Our intuition suggests that 3 states is far too small a number to adequately model all of these complex letter gestures, but results show that it is indeed possible to find a three state HMM which yields very good recognition performance. We conjecture that the observation matrix B seems to provide most of the recognition performance and that recognition may be only weakly affected by good estimation of the transition matrix A.

In support of this idea, we may consider the following interpretation of the HMM. Consider each row of the B matrix as the probability mass function of the observation symbols emitted in a given state. In the limiting case of a single state HMM, the B matrix becomes a vector of source symbol probabilities and application of the forward algorithm for recognition is thus equivalent to the well-known and powerful MAP classifier. Indeed from figure 4, we see that this single state degenerate HMM can achieve 80% recognition performance. So sometimes even if the state transitions are poorly modelled, it is quite possible to find good classifiers based on source statistics.

Now clearly if three states can yield strong performance, good HMMs with more than three states must also exist a simple way to prove this is to note that we can always add additional states which are unreachable {i.e., transition probability of zero) without affecting recognition performance. This may help explain why performance stays much the same over a broad range of N as we increase N beyond three.

The question that arises is, "Why does the Baum-Welch algorithm perform so well on real-world data despite its theoretical flaws and rather poor performance on the simu-lated HMM data?" Once again, a possible explanation is that this particular spatio-temporal recognition task is relatively easy, so all methods can do quite well. This conjecture may be

33

true for other common HMM applications and is a focus of current research. Unfortunately, unlike simulated data, the effort of gathering very large and diverse databases of real-world pattern recognition problems to evaluate the performance of different training algorithms is immense, so progress is slow.

3.2. Comments on Topology The FC topology allows transitions from any state to any other state and does not have a defined starting state. Being the most general topology, it is hardly surprising that perfor-mance is relatively poor. In this case we are required to search for a good solution in a parameter space of much higher dimensionality than for LR so it would be much harder to locate globally optimal solutions. Gestures have a natural start and finish and thus it is reasonable to adopt the LR model as used in speech recognition to great effect.

An even simpler topology is the LRB HMM where only self-transitions and next state transitions are allowed. Thus the A matrix is of the form:

/ o i i l - o n O . . . 0 \ 0 a22 1 - a22 . . . 0

\0 0 0 . . . a N N J (1)

In this case the expected number of observations, n, (i.e., duration) in state i is simply given by24

n = ^ . (2) 1 -an

One can interpret the LRB HMM as being an adjustable clock that ideally synchronizes the changes in observation statistics with the changes in state to produce a time-variant MAP classifier. In many ways this HMM topology may be considered as a form of dynamic time-warping22 a earlier technique used in speech recognition that has fallen out of favour since the advent of HMMs. The good performance of LR and LRB topologies on the gesture data set help make the point that simple HMM topologies often work best on real data.

4. Direct Calculation of HMM Parameters from Video Gestures A major dissatisfaction with the foregoing treatment of HMM learning strategies on real data is that the learning procedure is be treated like a black-box with little real insight into the learning process. In contrast to simulated data, we are unlikely to know the true HMM that originally generated the data indeed real data is very rarely generated by a process that even remotely resembles a HMM. Thus we usually don't really know the best topology or number of states a priori practitioners just try ranges of values and pick the one that yields good classification performance. Sometimes, as in the above example, this utilitarian method can find HMMs that are far too simple to accurately characterize the

34

state transitions of the underlying pattern. Alternatively, we may find HMMs that are far more complicated than is necessary. In either case this may lead to poorer generalization ability than might otherwise have been the case.

To investigate this topic further, we devised two simple video gestures as shown in figure 7 where it is possible to approximately determine the form of the true HMM directly from the gestures themselves. In the case of the triangle gesture, this can be modelled by a 3-state HMM each state corresponding to one side of the triangle. There is no need for skipped states to model this gesture, so the LRB topology is the most appropriate.

4.1. Analytic Calculation of A and B The triangle gesture is intended to be an equilateral triangle, so the expected number of observations in each state is equal and corresponds to n = 24/3 = 8. Thus the expected values of the transition matrix {A) parameters can be calculated from the duration equation (2).

The triangle gesture consists of a horizon