Pattern Recognition Review

8/6/2019 Pattern Recognition Review

1/34

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22. NO. 1. JANUARY 2000

Statis tical Pattern R ecognition: A R eviewAnil K. Jain, Fellow, IEEE, Robert P.W. Duin, and Jianchang Mao, Senior Member, fEEE

Abstract-The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in

which pattern recognition has been traditionally formulated. the statistical approach has been most intensively studied and used inpractice. More recently. neural network techniques and methods imported from statistical learning theory have been receivingincreasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes,sensing environment, pattern representation. feature extraction and selection. cluster analysis, classifier design and learning, selectionof training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this f ie ld , thegeneral problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emergingapplications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition,require robust an d efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some ofthe well-known methods used in various stages of a pattern recognition system and identify research topics and applications which areat the forefront of this exciting and challenging field,

Index Terms~Statistical pattern recognition, classification. clustering, feature extraction, feature selection, error estirnanon. classifiercombination, neural networks.

1 INTRODUCTION

B y the time they are five years old, most children canrecognize digits and letters. Small characters, largecharacters, handwritten, machine printed, or rotated-allare easily recognized by the young. The characters may bewritten on a cluttered background, on crumpled paper ormay even be partially occluded. We take this ability forgmnted until we face the task of teaching a machine how todo the same. Pattern recognition is the study of howmachines can observe the environment, learn to distinguishpatterns of interest from their background, and make soundand reasonable decisions about the categories of thepatterns. In spite of almost 50 years of research, design of

a general purpose machine pattern recognizer remains anelusive goal.

The best pattern rccognizors in most instances arehumans, yet we do not understand how humans recognizepatterns. Ross [140] emphasizes the work of Nobel Laureatel Ierhert Simon whose central finding was that patternrecognition is critical in most human decision making tasks:"The more relevant patterns at your disposal, the betteryour decisions will be. This is hopeful news to proponentsof artificial intelligence, since computers can surely betaught to recognize patterns. Indeed, successful computerprograms that help banks score credit applicants, helpdoctors diagnose disease and help pilots land airplanes

AK. Jni~1is w itll th e Vep nrtm clit o f C om p~ lter S cie nce lin d E ng in eerin g,Michigl1tl Stnle University, Ens / L ~n sillg , M I 4 88 2 4.E-tlinil:jilil1@c~L'.m~II,L'dll,

RP. W, Duin is with the Department ofApplied Pl ly~ics ,Delft Ullivfl"snyof Teciuwlogy, 2600 G ADelft. tile Netllcl'imlrls.L-lIIl1i/.'[email protected].

J .M no is w ith th e IBM A lm aden R esenrdl C fllter, 65 U [ Inrry I(olld, S anJOS(, CA 9 51 20 .F.-m~il: [email protected]/mI.com,

Mml ll sc ri (l t r e cc iocd23 Jilly 1999; accepted 12 O ct. 1 99 9.R ec omme nd ed fa r a cc ep ta nc eby K. Bow!Jer.Fo r i nf orma ti on on O / lt ~i ll l' lI g n~pr i! lt 5 o f t hi s a rt ic le ,please send c-rlll1il to ;tl'lImi@(om pllter.mg, an d rcfrrcncc IFFEeS Lng Numb ff ll0 29 6.

+

depend in some way on pattern recognition ... We need topay much more explicit attention to teaching patternrecognition." Our goal here is to introduce pattern recogni-tion as the best possible way of utilizing available sensors,processors, and domain knowledge to make decisionsautomatically.

1.1 What is Pattern Recognit ion?Automatic (machine) recognition, description, classifica-tion, and grouping of patterns are important problems in avariety of engineering and scientific disciplines such asbiology, psychology, medicine, marketing, computer vision,

artificial intelligence, and remote sensing. But what is apattern? Watanabe [163] defines a pattern "as opposite of achaos: it is an entity, vaguely defined, that could be given aname." For example, a pattern could be a fingerprint image,a handwritten cursive word, a human face, or a speechsignal. Given a pattern, its recognition/classification mayconsist of one of the following two tasks [163]:1)supervisedclassification (e.g., discriminant analysis) in which the inputpattern is identified as a member of a predefined class,2) unsupervised classification (o.g., clustering) in which thepattern is assigned to a hitherto unknown class. Note thatthe recognition problem here is being posed as a classifica-tion or categorization task, where the classes are eitherdefined by the system designer (in supervised classifica-

tion) or are learned based on the similarity of patterns (inunsupervised classification).

Interest in the area of pattern recognition has beenrenewed recently due to emerging applications which arenot only challenging but also computationally moredemanding (see Table I). These applications include datamining (identifying a "pattern," e.g., correlation, or anoutlier in millions of multidimensional patterns), documentclassification (efficiently searching text documents), finan-cial forecasting, organization and retrieval of multimediadatabases, and biometrics (personal identification based on

OJ 6 2 SB 2B IOO I $j 0 ,0 0 { ~ 2 00 0IEEE
mailto:[email protected]:[email protected]/mI.com,mailto:[email protected]/mI.com,mailto:[email protected].


2/34

JAIN ET AL. : STATISTICAL PATTERN RECOGNITION: A REVIEW 5

TABLE 1Examples of Pattern Recognition Applications

Application

.----------------,--------- -------,------------------------------------~

I ProblemDoma i n Pattem ClaSff'Snput Pattern-Sequenceanalysis

S e arc hin g fo r l ' o in t . s lO" ;~ l ld t i - C om pact and w ell-rncauingfulpatterns dim ensional space separatedc lusters

-t--------;-;--:-----'--'------,------'----,------t-------;;;;--------;------'------;-----,,;-_:_----,-,-- -----Intnrnct search Text document ,Semanticcawgorios

( e.g. busines s ,sports,iide,)

Do e um on t i ll la g c - - + - "" 'R : - r .- fl c -- ;C ] j :- n - !l ;-m - a - c - - ;- h - :- II - 1 c - f;C - o - r-+- - - -c ;D" ' "o-c-u-m -c-n- , - - t--;-h-Il-ag- 'o- '---'1--- A f llh allu lr icr ic-- - -analysis _ lhe blind ' characters, words

I-=-II-[(- : -h- ls- t l - ' - i -a-7

1u tomat ion Printed circui t board lntonsityo r l'ilIig;-II,--lli'foctivC/llOll-ddcdivcinspection imago n atu re o f p ro (jn ct

11 ' 1111i ln ( !d G I(lat-a- ' ---b-f1~-r-+------ ;r ' - --I l- :-I ,c- ,rn-- ' --c- ,t-s-ca-[-'ch;----+---------;-;-vi(le0e liji -- - . Vi d e ogolll'l'S (o .g . ,retrieval action,dialogue, etc.)

I li ome tr ie r ec ogni ti on Per sona l i dontH i ca ti oi ~ -- - .FrtC(),iris. Author.sed users for

fingerprint

f---Da ta mininR

Documentclassification

DNA/Prot,elnHCq lW I I C r :Knownty p es o f gm w f,/patterns

Remot e s en si ng \ Iu l tl sp ec tr al image

---- ,-_

access con trol

Speech waveform

Land u se c ate go ri es ,g rowth p atte rn o f r ;r op s

-- S]lOI0~l'ordfielephone direc toryenquiry without

ope ra to r a s si s tance---~----------~-------------

various physical attr ibutes such as taco and fingerprints).Picard [125] has identified a novel application of patternrecognition, called affective computing which will give acomp-uter the ability to recognize and express emotions, torespond intelligently to human emotion, and to employmechanisms of emotion that contribute to rational decisionmaking. A common characteristic of a number of theseapplications is that the available features (typica lly. in thethousands) arc not usually suggested by domain experts,but must be extracted and optimized by data-drivenprocedures.

The rapidly growing and available computing power,while enabling faster processing of huge data sets, has alsofacilitated the use of elaborate and diverse methods for datilanalysis and classification. At the same time, demands onautomatic pattern recognition systems are rising enor-mously due to the availability of large databases andstringent performance requirements (speed, accuracy, andcost). In many of the emerging applications, it is clear that

no single approach for classif ication is "optimal" and thatmultiple methods and approaches have to be used.Consequently, combining several sensing modalities andclassifiers is now a commonly used practice in patternrecognition.

The design of a pattern recognition system essentiallyinvolves the following three aspects: 1)data acquisition andpreprocessing, 2) data representation, and 3) decisionmaking. The problem domain dictates the choice ofsensorts), preprocessing technique, representation scheme,and the decision making model. T t is generally agreed that a

well-defined and sufficiently constrained recognition pro-blem (small intraclass variations and largo interclassvariations) will lead to a compact pattern representationand a simple decision making strategy. Learning from a setof examples (training set) is an important and desi redattribute of most pattern recognition systems, The four bestknown approaches for pattern recognition are: 1) templatematching, 2) statistical classification, 3) syntactic or struc-tural matching, and 4) neural networks. These models arenot necessarily independent and sometimes the samepattern recognition method exists with different interpreta-tions, Attempts have been made to design hybrid systcmsinvolving multiple models [57]. A brief description andcomparison of these approaches is given below andsummarized in Table 2.

1.2 Template MatchingOne of the simplest and earliest approaches to patternrecognition is based on template matching. Matching is ageneric operation in pattern recognition which is used to

determine the similarity between two entities (points,curves, or shapes) of the same type, In template matching,a template (typically, a 2 D shape) or a prototype of thepattern to be recognized is available. The pattern to berecognized is matched against the stored template whiletaking into account all allowable pose (translation androtation) and scale changes. The similarity measure, often acorrelation, may be optimized based on the availabletraining set. Often, the template itself is learned from thetraining set. Template matching is computationally de-manding, but the availability of faster process Drs has now


3/34

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL. 22, NO. 1, JANUARY 2000

TABLE 2Pattern Recognition Models

[ Approach I

~==-.ci--'-

Itcplescnl:ntiDll Typical Criterioutcc' 'gnitioll Fnnction

'Iornplato matching Samples, p;"rls, curves Correlation, rli~1 mr:o mr~Rm~ Classification error

Syntactic or atruct.ural Prirriit.ivr-s

Discriminant function !Classification error

Ru le s , r \l 'm nm :- tr

Mean squar e erroreural networks Samples, pixels, features

--

made this approach more feasible, The rigid templatematching mentioned above, while effective in someapplication domains, has a number of disadvantages. For

instance, it would fail if the patterns are distorted due to theimaging process, viewpoint change, or large intraclassvariations among the patterns. Deformable template models[6Sl]or rubber sheet deformations [9 ] can be used to matchpatterns when the deformation cannot be easily explainedor modeled directly,

1.3 Statistical ApproachIn the statistical approach, each pattern is represented interms of d features or measurements and is viewed as apoint in a d-dimensional space, The goal is to choose thosefeatures that allow pattern vectors belonging to differentcategories to occupy compact and disjoint regions in itr/-d imensional feature SPace. The effectiveness of therepresentation space (feature set) is determined by howwell patterns from different classes can be separated. Givena set of training patterns from each class, the objective is toestablish decision boundaries in the feature space whichseparate patterns belonging to different classes. In thestatistical decision theoretic approach, the decision bound-aries are determined by the probability distributions of thepatterns belonging to each class, which must either bespecified or learned [41], [44].

One can also take a discriminant analysis-based ap-proach to classification: First a paramotric form of thedecision boundary (e.g., linear or quadratic) is specified;then the "best" decision boundary of the specified form is

found based on the classification of training patterns. Suchboundaries can be constructed using, for example, a meansquared error criterion. The direct boundary constructionapproaches are supported by Vapnik's philosophy [162]: "Ifyou possess a restricted amount of information for solvingsome problem, try to solve the problem directly and neversolve a more general problem as an intermediate step. It ispossible that the available information is sufficient for adirect solution but is insufficient for solving a more generalintermediate problem."

Nutwork Ir.nction

1.4 SyntactiC ApproachIn many recognition problems involving complex patterns,it is more appropriate to adopt a hierarchical perspectivewhere a pattern is viewed as being composed of simplesubpattcrns which are themselves built from yet simplersubpattcrns [561, [121], The simplest! elementary subpat-terns to be recognized are called primitives and the givencomplex pattern is represented in terms of the interrelation-ships between these primitives. In syntactic pattern recog-nition, a formal analogy is drawn between the structure ofpatterns and the syntax of a language. The patterns arcviewed as sentences belonging to a language, primitives arcviewed as the alphabet of the language, and the sentencesar e generated according to a grammar. Thus, a largecollection of complex patterns can be described by a smallnumber o r primitives and grammatical rules. The grammarfor each pattern class must be inferred from the availabletmining samples.

Structural pattern recognition is intuitively appealingbecause, in addition to classification, this approach alsoprovides a description of how the given pattern isconstructed from the primitives. This paradigm has beenused in situations where the patterns have a definitestructure which can be captured in terms of


4/34

JAIN ET AL: STATISTICAL PATTERN RECOGNITION: A REVIEW

computation) in a network of weighted directed graphsin which the nodes arc artificial neurons and directededges (with weights) are connections between neuronoutputs and neuron inputs. The main characteristics ofneural networks are that they have the ability to learncomplex nonlinear input-output relationships, usc se-quential training procedures, and adapt themselves to

the data.The most commonly used family of neural networks for

pattern classification lash; [83] is the feed-forward network,which includes multilayer perceptron and Radial-BasisFunction (REP) networks. These networks are organizedinto layers and have unidirectional connections between thelayers. Another popular network is the Self-OrganizingMap (SaM), or Kohonen-Network [92], which is mainlyused for data clustering and feature mapping. The learningprocess involves updating network architecture and con-nection weights so that a network can efficiently perform aspecific classification r clustering task The increasing popu-larity of neural network models to solve pattern recognitionproblems has been primarily due to their seemingly low

dependence on domain-specific knowledge (relative tomodel-based and rule-based approaches) and due to theavailability of efficient learning i11gorithms for practitionersto usc.

Neural networks provide a new suite of nonlinearalgorithms for feature extraction (using hidden layers)and classification (e.g., multilayer perceptrons). Tnaddition,existing feature extraction and classification algorithms canalso be mapped on neural network arch itectures forefficient (hardware) irnplemcntation. In spite of the see-mingly different underlying principles, most of the well-known neural network models are implicitly equivalent orsimilar to classical statistical pattern recognition methods(sec Table 3). Ripley [136] and Anderson et al. [5] also

discuss this relationship between neural networks andstatistical pattern recognition. Anderson et al, point out that"neural networks are statistics for amateurs ... Most NNsconceal the statistics from the user." Despite these simila-rities, neural networks do offer several advan tages such as,unified approaches for feature extraction and classificationand flexible procedures for finding good, moderatelynonlinear solutions,

1.6 Scope and OrganizationIn the remainder of this paper we will primarily reviewstatistical methods for pattern representation and classifica-tion, emphasizing recent developments. Whenever appro-priate, we will also discuss closely related algorithms from

the neural networks literature. We omit the whole body ofliterature on fuzzy classification and fuzzy clustering whichare in our opinion beyond the scope of this paper.Interested readers can refer to the well-written books onfuzzy pattern recognition by Bezdek [15] and [Iri], In mostof the sections, the various approaches and methods aresummarized in tables as an easy and quick reference for thereader. Due to space constraints, we are not able to providemany details and we have to omit some of the approachesand the assoc i ated references, au r goal is to ernphasizcthose approaches which have been extensively evaluated

7

and demonstrated to be useful in practical applications,along with the new trends and ideas.

The literature on pattern recognition is vast andscattered in numerous journals in several disciplines(e.g., applied statistics, machine learning, neural net-works, and signal and image processing). A quick scan ofthe table of contents of all the issues of the J F. E E

Transactions Ol! P attern A nalys is lind M a chine Intelligence,since its first publication in January 1979, reveals thatapproximately 350 papers deal with pattern recognition.Approximately 300 of these papers covered the statisticalapproach and can be broadly categorized into thefollowing subtopics: curse of dimensionality (15), dimen-sionality reduction (50), classifier design (175), classifiercombination (10), error estimation (25) and unsupervisedclassification (50). In addition to the excellent textbooksby Duda and Hurt [44],1 Fukunaga [58], Devijver andKittler [39], Devroye et


5/34

8 IEF.ETRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO, 1, JANUARY 2000

TABLE 3Links Between Statistical and Neural Network Methods

L _Swt iK1 : ic !( l P a tt er n Recogn it lO l l JLinear lJif;nililinanL function Pcrccptronl ' r indj ;u l Compcmcllt A n aiy sis , A l lto -A s so '- - ci: - I l- "I .i: - v-e -- ;:~ -;- r (- - ;- ~t-w - () 'r k- ,-a -H 'd ',- ,u -l"C O U SCA ndworks

'J\p['-~I(!riori Probability l:Ist.illlation , MultilaycrPorccptron - ---jNonlinear Discriminant Anulysis , :vlull.ilily(!rl'('l'l;(~]ltl'lJl1POLI'ir'llWindow ])(;noil,y-hilsedClassifier , Radlal Basis FUl l c t . i oHNet,work

~! K-~N Rille ' j ](oh()llerl;g r.VQ

test feature----- Preprocessing i,----- Measurement -----pattern

- -- - ( ,: ':::::~ iO "- --t --------~-----1-- - - -Featuretraining Preprocessing Extraction/pattern Selection

The topic of probabilistic distance measures is cur-rently not as important as 20 years ago, since it is verydifficult to estimate density functions in high dimensionalfeature Gpacl's. Instead, the complexity of classificationprocedures and the resulting accuracy have g


6/34

JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW

Tho decision making procoss in statistical patternrecognition can be summarized as follows: A given patternis to be assigned to one of c categories Wl, W2," ., We basedon a vector of d feature values a : = XI 1X2, .. ,:l:,t). Thefeatures are assumed to have a probability density or mass(depending on whether the features arc continuous ordiscrete) function conditioned on the pattern class. Thus, apattern vector a : belonging to class Wi is viewed as anobservation drawn randomly from the class-conditionalprobability function p (x lw ; ) . A number of well-knowndecision rules, including the Bayes decision rule, themaximum likelihood rule (which can be viewed as aparticular case of the Bayes rule), and the Neyman-Pearsonrule are available to define the decision boundary. The"optimal" Bayes decision rule for minimizing the risk(expected value of the loss function) can be slated asfollows: Assign input pattern a : to class W i for which theconditional risk

t:

R (w ; l x ) ;=: LL(Wi,Wj)' P(wilx)j'~

is minimum, where !,(WilWj) is the loss incurred in decidingW iwhen the true class is Wj and P(wjlx) is the posteriorprobability [44]. In the case of the 0/1 loss function, asdefined in (2), the conditional risk becomes the conditionalprobability of misclassification.

L(WiIWj)={~i=ji oj j"

For this choice of loss function, the Bayes decision rule canbe simplified as follows (also called the maximum aposteriori (MAP) rule): Assign input pattern x to class Wi if

P (Wi l x ) > P (w j l x ) for all) oj i.

Various strategies are utilized to design a classifier instatistical pattern recognition, depending on the kind ofinformation available about the class-conditional densities.If all of the class-conditional densities are completelyspecified, then the optimal Bayes decision rule can beused to design a classifier. However, the class-conditionaldensities are usually not known in practice and must belearned from the available training patterns. If the form ofthe class-conditional densities is known (e.g., multivariateGaussian), but some of the parameters of the densities(e.g., mean vectors and covariance matrices) are un-known, then we have a parametric decision problem. Acommon strategy for this kind of problem is to replace

the unknown parameters in the density functions by theirestimated values, resulting in the so-called Bayes plug-inclassifier. The optimal Bayesian strategy in this situationrequires additional information in the form of a priordistribution on the unknown parameters. If the form ofthe class-conditional densities is not known, then weoperate in a nonpararnetrlc mode. In this case, we musteither estimate the density function (e.g., Parzcn windowapproach) or directly construct the decision boundarybased on the training data (e.g., k-nearest neighbor rule).In fact, the multilayer perceptron can also be viewed as a

9

(1 )

supervised nonparametric method which constructs adecision boundary.

Anotlu.r dichotomy in statistical pattern recognition isthat of supervised learning (labeled training samples)versus unsupervised learning (unlabeled training sam-ples). The label of a training pattern represents thecategory to which that pattern belongs. In an unsuper-vised learning problem, sometimes the number of classesmust be learned along with the structure of each class.The various dichotomies that appear in statistical patternrecognition are shown in the tree structure of Fig. 2. Aswe traverse the tree from top to bottom and left to right,less information is available to the system designer and i1Sa result, the difficulty of classification problems increases.In some sense, most of the approaches in statisticalpattern recognition (leaf nodes in the tree of Pig. 2 ) areattempting to implement the Bayes decision rule. Thefield of cluster analysis essentially deals with decisionmaking problems in the nonparametric and unsupervisedlearning mode [81]. further, in cluster analysis thenumber of categories or clusters may not even bespecified; the task is to discover a reasonable categoriza-tion of the data (if one exists). Cluster analysis algorithmsalong with various techniques for visualizing and project-ing multidimensional data are also referred to ase xp lo ra to ry d at a1I11111ysisethods.

Yet another dichotomy in statistical pattern recognitioncan be based on whether the decision boundaries areobtained directly (geometric approach) or indirectly(probabilistic density-based approach) as shown in Fig. 2 .The probabilistic approach requires tn estimate densityfunctions first, and then construct the discriminantfunctions which specify the decision boundaries. On theother hand, the geometric approach often constructs the

decision boundaries directly from optimizing certain costfunctions. We should point out that under certainassumptions on the density functions, the two approachesare equivalent. We will sec examples of each category inSection 5.

No matter which classification or decision rule is used, itmust be trained using the available training samples. As aresult, the performance of a classifier depends on both thenumber of available training samples as well as the specificvalues of the samples. At the same time, the goal ofdesigning a recognition system is to classify future testsamples which are likely to be different from the trainingsamples. Therefore, optimizing a classifier to maximize itsperformance on the training set may not always result in the

desired performance on a test set. The generalization abilityof a classifier refers to its performance in classifying testpatterns which were not used during the [raining stage. Apoor generalization ability of a classifier can be attributed toanyone of the following factors: 1) the number of features istoo largo relative to the number of training samples (curseof dimensionality [80]1, 2) the number of unknownparameters associated with the classifier is large(e.g., polynomial classifiers or a large neural network),and 3) a classifier is too intensively optimized on thetraining set (overtrained); this is analogous to the

(2 )


7/34

10 IEEE TRANSACTIONS ON PATIERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO,1, JANUARY 2000

,.._.. __ ___'~7-~

r ' \\ I Bayes Decision II I\ Theory I \\ \\ ",


8/34

JAIN ET AL. : STATISTICAL f 'ATTERN RECOGNITION: A REVIEW

Details of this dataset are available in [160]. In ourexperiments we always used the same subset of 1,000patterns for testing and various subsets of the remaining1,000 patterns for training.;>'Throughout this paper, whenwe refer to "the digit dataset," just the Karhunen-Loevefeatures (in item 3) are meant, unless stated otherwise.

3 THE CURSE OF DIMENSIONALITY AND PEAKINGPHENOMENA

The performance of a classifier depends on the intcrrcla-tionship between sample sizes, number of features, andclassifier complexity. 1\ naive table-lookup technique(partitioning the feature space into cells and associating aclass label with each cell) requires the number of trainingdata points to be an exponential function of the featuredimension [18]. This phenomenon is termed as "curse ofdimensionality," which leads to the "peaking phenomenon"(sec discussion below) in classifier design. It is well-knownthat the probability of misclassiflcation of a decision ruledoes not increase as the number of features increases, as

long as the class-conditional densities arc completelyknown (or, equivalently, the number of training samplesis arbitrarily large and representative of the underlyingdensities). However, it has been often observed in practicethat the added features may actually degrade the perfor-mance of a classifier if the number of training samples thatare used to design the classifier is small relative to thenumber of features, This paradoxical behavior if ; referred toas the peaking phenomenon" [BO],[131], [132]. A simpleexplanation for this phenomenon is as follows: The mostcommonly used parametric classifiers estimate the un-known parameters and plug them in for the true parametersin the class-conditional densities. For a fixed sample size, asthe number of features is increased (with a corresponding

increase in the number of unknown parameters), thereliability of the parameter estimates decreases. Conse-quently, the performance of the resulting plug-in classifiers,for a fixed sample size, may degrade with an increase in thenumber of features.

Trunk [157] provided a simple example to illustrate thecurse of dimensionality which we reproduce beluw.Consider the two-class classification problem with equalprior probabilities, and a d-dimensional multivariate Gaus-sian distribution with the identity covariance matrix foreach class. The mean vectors for the two classes have thefollowing components

Note that the features arc statistically independent and thediscriminating power of the successive features decreasesmonotonically with the first feature providing the max-

2. The dataset is available through the Unlversity of California. IrvineMachine , .~arning Repository (www.ic5.uci.edll/-mlearn/MLRcposilory.html)

3. In the res t of Ih.s pi \p~r, we do nut make disf uct ion be!ween the curseof dimensionality and the peaking phenomenon.

11

irnum discrimination between the two classes. The onlyparameter in the densities is the mean vector,1n = m] ~ - -m2.

Trunk considered the following two cases:

I . The mean vector m is known. In this situation, wecan use the optimal Bayes decision rule (with a 0;1

loss function) to construct the decision boundary.The probability of error as a function of d can beexpressed as:

(4 )

lt is easy to verify that limd...,,,,,P,,(d) = O. In otherwords, we can perfectly discriminate the two classesby arbitrarily increasing the number of features, d.

2. The mean vector m is unknown and n labeledtraining samples are available. Trunk found themaximum likelihood estimate rh . of m and used theplug-in decision rule (substitute m for Tn in theoptimal Bayes decision rule). Now the probability oferror which is a function of both nand d can bewritten as:

(G )

(G )

Trunk showed that lim., '00 P "(n , d) = , which impliesthat the probability of error approaches the maximumpossible value of 0.5 for this two-class problem. Thisdemonstrates that, unlike case 1) we cannot arbitrarily

increase the number of features when the parameters ofclass-conditional densities are estimated from a finitenumber of training samples. The practical implication ofthe curse of dimensionality is that a system designer shouldtry to select only a small number of salient features whenconfronted with a limited training set.

All of the commonly used classifiers, including multi-layer feed-forward networks, can suffer from tho curse ofdimensionality, While an exact relationship between theprobability of misclassification, the number of trainingsamples, the number of features and the true parameters ofthe class-conditional densities is vcry difficult to establish,some guidelines have been suggested regarding the ratio ofthe sample size to dirnenaionality. It is generally accepted

that using at least ten times as many training samples perclass as the number of features (nld > 10)is a good practiceto follow in classifier design [BOJ.The more complex theclassifier, the larger should the ratio of sample size todimensionality be to avoid the curse of dimensionality.

4 DIMENSIONALITY REDUCTION

There an' two main reasons to keep the dimensionality ofthe pattern representation (i.e., the number of features} assmall as possible: measurement cost and classification


9/34

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO.1, JANUARY 2000

accuracy, A limited yet salient feature set simplifies both thepattern representation and the classifiers that are built onthe selected representation. Consequently, the resultingclassifier will be faster and will use less memory. Moreover,as stated earlier, a small number of features can alleviate thecurse of dimensionality when the number of trainingsamples is limited. On the other hand, a reduction in the

number of features may lead to a loss in the discriminationpower and thereby lower the accuracy of the resultingrecognition system. Watanabe's ugly d uck lin g theo rem [16 3]also supports the need for a careful choice of the features,since it is possible to make two arbitrary patterns similar byencoding them with a sufficiently large number ofredundant features.

It is important to make a distinction belween featureselection and feature extraction. The term feature selectionrefers to algorithms that select the (hopefully) best subset ofthe input feature set. Methods that create new featuresbased on transformations or combinations of the originalfeature set are called feature extraction algorithms, How-ever, the terms feature selection and feature extraction are

used interchangeably in the literature. Note that oftenfeature extraction precedes feature selection; first, featuresare extracted from the sensed data (e.g., using principalcomponent or discriminant analysis) and then some of theextracted features with low discrimination ability arcdiscarded, The choice between feature selection and featureextraction depends on the application domain and thespecific training data which is available. 8eature selectionleads to savings in measurement cost (since some of thefeatures are discarded) and the selected features retain theiroriginal physical interpretation. In addition, the retainedfeatures may be important for understanding the physicalprocess that generates the patterns, On the other hand,transformed features generated by feature extraction may

provide a better discriminative ability than the best subsetof given features, but these new features (a linear or anonlinear combination of given features) may not have aclear physical meaning,

In many situations, it is useful to obtain a two- or three-dimensional projection of the given multivariate data (n x 11pattern matrix) to permit a visual examination of the data.Several graphical techniques also exist for visually obser-ving multivariate data, in which the objective is to exactlydepict each pattern as a picture with d degrees of freedom,where Ii is the given number of features. For example,Chernoff [291 represents each pattern as a cartoon facewhose facial characteristics, such as nose length, mouthcurvature, and eye size, are made to correspond to

individual features. Pig. 3 shows three faces correspondingto the mean vectors of lris Serosa, Iris Versicolor, and IrisVirginica classes in the Iris data (150 four-dimensionalpatterns; 50 patterns per class), Note that the face associatedwith Iris Setosa looks quite different from the other twofaces which implies that the Setosa category can be wellseparated from the remaining lwo categories in the four-dimensional feature space (This is also evident in the two-dimensional plots of this data in Fig. 5).

The main issue in dimensionality reduction is the choiceof a criterion function. A commonly used criterion is the

classification error of a feature subset nut the classificationerror itself cannot be reliably estimated when the ratio ofsample size to the number of features is small. In addition tothe choice of a criterion function, we also need to determinethe appropriate dimensionality of the reduced featurespace, The answer to this question is embedded in thenotion of the intrinsic dimensionality of data. Intrinsic

dimensiona lity essentially determines whether the givend-dimensional patterns can be described adequately in asubspace of dimensionality less than d. Tor example,d-dimensional patterns along a reasonably smooth curvehave an intrinsic dimensionality of one, irrespective of thevalue of d. Note that the intrinsic dimensionality is not thesame as the linear dimensionality which is a global propertyof the data involving the number of significant eigenvaluesof the covariance matrix of the data, While severalalgorithms are available to estimate the intrinsic dimension-ality [SI], they do not indicate how a subspace of theidentified dimensionality can be easily identified.

We now briefly discuss some of the commonly usedmethods for feature extraction and feature selection.

4.1 Feature ExtractionFeature extraction methods determine an appropriate sub-space of dimensionality 'In (either in a linear or a nonlinearway) in the original feature space of dimensionality d(m ::; d) . Linear transforms, such as principal componentanalysis, factor analysis, lineal ' discriminant analysis, andprojection pursuit have been widely used in patternrecognition for feature extraction and dimensionalityreduction. The best known linear featu re extractor is theprincipal component analysis (PCA) or Karhunen-Loeveexpansion, that computes the m largest eigenvectors of thed . x d covariance matrix of the n d-dimensional patterns. Thelinear transformation is defined as

y ~ XII, (7)

where X ill the given n x rl pattern matrix, Y is the derived'It x m pattern matrix, and H is the d x m matrix of lineartransformation whose columns me the eigenvectors, SincepeA uses the most expressive features (eigenvectors withthe largest eigenvalues), it effectively approximates the databy a linear subspace using the mean squared error criterion.Other methods, like projection pursuit [53] andindependent component analysis (lCA) [31], [11]. [24], [96Jarc more appropriate for non-Gaussian distributions sincethey do not rely on the second-order property of the data.lCA has been successfully used for blind-source separation[781; extracting linear feature combinations that define

independent sources. This demixing is possible if at mostone of the sources has a Gaussian distribution.

Whereas PCA is an unsupervised linear feature extrac-tion method, discriminant arialvsis uses the categoryinformation associated with each pattern for (linearly)extracting the most discriminatory features. In discriminantanalysis, interclass separation is emphasized by replacingthe total covariance matrix in PCA by a general separabilitymeasure like the Fisher criterion, which results in findingthe eigenvectors of 8,,;l8& (the product of the inverse of thewithin-class scatter matrix, S"" and the between-class


10/34


11/34

14 IEEE mANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22. NO.1. JANUARY 2000

X3(a)

X d x X1 2 X3(b )

Fig. 4. AutoassoGiative networks lor linding a three-dimensional subspace. (a) Linear and (b) nonlinear (not all the connections are shown).

presented to the network in a random order. At eachpresentation, the winner whose weight vector is the closestto the input vector is first identified. Then, all the neurons inthe neighborhood (defined on the grid) of the winner areupdated such that their weight vectors move towards theinput vector. Consequently, after training is done, theweight vectors of neighboring neurons in the grid are likelyto represent input patterns which are close in the originalfeature space. Thus, a "topology-preserving" map isformed. When the grid is plotted in the original space, thegrid connections are more or less stressed according to thedensity of the training data. Thus, SOM offers an'III-dimensional map with a spatial connectivity, which canbe interpreted as feature extraction. SOM is different fromlearning vector quantization (LVQ) because no neighbor-hood is defined in LVQ.

Table 4 summarizes the feature extraction and projectionmethods discussed above. Note that the adjective nonlinearmay be used both for the mapping (being a nonlinearfunction of the original features) as well as for the criterion{unction (for non-Gaussian data). Pig. 5 shows an exampleof four different two-dimensional projections of the four-dimensional Iris dataset. Fig. Sa and Fig. 5b show two linearmappings, while Fig. 5c and Fig, Sd depict two nonlinearmappings. Only the Fisher mapping (fig. 5b) makes usc of

the category information, this being the main reason whythis mapping oxhibits the best separation between the threecategori es.

4.2 Feature Selec1ionThe problem of feature selection is defined as follows: givena set of d features, select a subset of size 'In that leads to thesmallest classification error. There has been a resurgence ofinterest in applying feature selection methods due to thelarge number of features encountered in the followingsituations: 1) multisensor fusion: features, computed from

different sensor modalities, are concatenated to form afeature vector with a large number of components;2) integration of multiple data models: sensor data can bemodeled using different approaches, where the modelparameters selve as features, and the parameters fromdifferent models can be pooled to yield a high-dimensionalfeature vector.

Let Y be the given set of feahues, with cardinality tl andlet m represent the desired number of features in theselected subset X, X c : :Y. Let the feature selection criterionfunction for the set X be represented by J(X). Let usassume that a higher value of J indicates a better featuresubset; a natural choice for the criterion function isJ ~ (1 - l~.), where J~ denotes the classification error. Thelise of r 'c in the criterion function makes feature selectionprocedures dependent on the specific classifier that is usedand the sizes of the training and test sets. The moststraightforward approach to the feature selection problemwould require 1) examining all C 1 Jpossible subsets of sizeIn, and 2) selecting the subset with the largest value of .1(.).However, the number of possible subsets grows combina-toriallv, making this exhaustive search impractical for evenmoderate values of 'In and [ 1 . Cover and Van Carupenhout[35] showed that no nonexhaustive sequential featureselection procedure can be guaranteed to produce theoptimal subset. They further showed that any ordering ofthe classification errors of each of the 21/ feature subsets ispossible. Therefore, in order to guarantee the optimality of,Bay, a 12-dimeni>ional feature subset out of 24 availablefeatures, approximately 2.7 million possible subsets must beevaluated. The only "optimal" (in terms of a class ofmonotonic criterion functions) feature selection methodwhich avoids the exhaustive search is based on the branchand bound algorithm. This procedure avoids an exhaustivesearch by using intermediate results for obtaining boundson the final criterion value. The key to this algorithm is the


12/34

JAIN ET AL. : STATISTICAL PATIERN RECOGNITION: A REVIEW

". '

.r .

(a)

.:,),

~,:' """

..!; ~.~.

; ~' ".-, : .

(e )

15

~ -1

. , . " I~~

(b)

,! ~:, "

";9

,", .Jc

I ":\;

,-

t~~: ! : - '

(d)

Fig. 5. Two-dimensional mappings of the Iris dataset (+: Iris 8etosa: ' : Iris Versicolor; 0: Iris Virginica). (a) PCA, (b) Fisher Mapping, (c) Sammon

Mapping, and (d) Kernel PCA with second order polynomial kernel.

monotonicity property of the criterion function J(.); giventwo features subsets Xl and X2, if Xl C X2, thenJ(XI) < J(X2)' In other words, the performance of afeature subset should improve whenever a feature is addedto it. Most commonly used criterion functions do not satisfythis mono tonicity property.

Ith8S been argued that since feature selection is typicallydone in an off-line manner, the execution time of aparticular algorithm is not as critical as the optimality ofthe feature subset it generates. While this is true for featuresets of moderate size, several recent applications, particu-larly those in data mining and document classification,

involve thousands of features. In such cases, the computa-tional requirement of a feature selection algorithm isextremely important. As the number of feature subsetevaluations may easily become prohibitive for large featmesizes, a number of suboptimal selection techniques havebeen proposed which essentially tradeoff the optimality ofthe selected subset for computational efficiency.

Table 5 lists most of the well-known feature selectionmethods which have been proposed in the literature [851.Only the first two methods in this table guarantee anoptimal subset. All other strategies Me suboptimal due to

the fact that the best pair of features need not contain thebest single feature [341.In general. good, larger feature setsdo not necessarily include the good, small sets. As a result,the simple method of selecting just the best individualfeatures may fail dramatically. It might still be useful,however, as a first step to select some individually goodfeatures in decreasing very large feature sets (e.g., hundredsof features). Further selection has to be done by moreadvanced methods that take feature dependencies intoaccount. These opemte either by evaluating growing fealuresets (forward selection) or by evaluating shrinking fe8turesets (backward selection). A simple sequential method like

SFS (SBS) adds (deletes) one feature at a time. Moresophisticated techniques arc the "Plus 1 - take away r"strategy and the Sequential Floating Search methods, S~fSand ST3FS[126]. These methods backtrack as long as theyfind improvements compared to previous feature sets of thesame size. In almost any Jarge feature selection problem,these methods perform better than the straight sequentialsearches, SF5 and SBS. SFFS and snFS methods find"nested" sets of features that remain hidden otherwise,but the number of feature set evaluations, however, mayeasily increase by a factor of 2 to 10.


13/34

16 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL 22. NO.1. JANUARY 2000

TABLE 4Feature Extraction and Projection Methods

Property Comments

Principul ComponentAnalysis (peA)

'[\w litio nal, c ipp nV l'c to r b as ed m eth od,1\180 knownas Karhuncn- Loi lvr ' CXPlll1SiOIl; good 1'01' Gaussian([a.m.

_\T()uTi~W11.l:--P"C,-.-,,\----+TL~h-l"-';-",-1- ' -1 . (J , . - ' -) ; - ' - l '- I ' -1-7(" j~-1l-"- : - i ' ian

("l'it orinn; usual ly i terative1":-,-C'-1I'11"'h-I"-.,,-,-m-)"'t'-,_----+"N'''-ll'h'll-'-';H''-rl~4-\i;;-'I-"-)l-I--:Gr;-a-u-~--+'1"'1-"t,.,t'I(-,!-l'-"',k-ll-,,"-lw-'-)1"I-(""i"'tl"-H-'~~-v-~~l-,,',I-:l'li'd'd'{;l~lnj~er~;t.lw

a~!-:iuc i.i\.tjv~netw-ork s ian criterion; iterative. nonlinear m ap Is optunlzcd bya. nnnlinuarj '(' ~onst r\ 1et ion; input is l lK\,d ' \" t.'lrgN ..Of ten poor ! - \f :H Ima1 iz a t, loU ; HHlnpi c! -HTZ t1limited; noise scnsitivo; mainly 118t'd fo r2-dilueusioual visunllzntion.

L i ne ar D i s cr im i n an tAnalysis'1'1Il.jcdion Pu r s u i t

Indcponrlent ComponentAnalysis (TCA)

Kernel PCA

peA Network

}l ultidimenstnnal

IH(aling (ivIUS), andSammon's projection

Linear map; fast;oigollvcetor-bMcd,

Supnrvlscd ltuonr map: Bet.ter than PCA forclassification: limited cO (c - 1)f~st; cigonvcetor-bascd. components with non-zorn oigoHvnhlOs.Linea r miLl;; it'('l 'at.i;;-,,;--+"'l\-'I-ai:-'n'-;I-Y-!-ls-e-:dC'~"'o-r.ciC'n"t"c"ra_:,-'.t7h-'- ' ,-.,' - c ' ' 'x ' 'P, - ]o- ' -1 ' ' 'a - ' t -01- } - , - , , f i ' t . ,; :

non-Gaussian, analysis.

Linear map, lterutlvo, Blind sourcn rinlm;:~~t.i{Jn,used for de-mixingnon-Gausslau. nOll-GUl.Ifi!->iall distributed :o1{Hlfecs (features).N~ ; l , l ~ ~wm 'ap; rCA-ha; il lgforward !inl(~(~t,ionand then delet e J' features 'loi1\1\backward snlection,

Avoids tho problem of [ralllrr sub-sot "nostiug" encountered in SFSfindSD'; mcthoda: need to selectvalues (I f I and 1'(/ > r).Provides eloso l o~}l.h~lnl olution ..at au nffordablc ~~ornpul.n.Lionn.1 '"0~1..

A gUllorali:zatiOll of "plus-j takel)wa.y--/~mr.-t .hod; the values of 1 andr a rc dc te rrntnnd aurornat icn lly andupdated dynmnicully,


14/34

JAIN ET AL.: STATISTICAL PATTE:RN RECOGNITION: A REVIEW

selection algorithms in terms of classification error and runtime. The general conclusion is that the sequential forwardfloating search (SBFS) method performs almost as well asthe branch-and-bound algorithm and demands lowercomputational resources. Somal et 0 11 .[154] have proposedan adaptive version of the SFFSalgorithm which has beenshown to have superior performance.

The feature selection methods in Table 5 can be usedwith ilrty of the well-known classifiers. But, if a multilayerfeed forward network is used for pattern classification, thenthe node-pruning method simultaneously determines boththe optimal feature subset and the optimal networkclassifier [26], [103]. First train a network and then removethe least salient node (in input or hidden layers). Thereduced network is trained again, followed by a removal ofyet another least salient node. This procedure is repeateduntil the desired trade-off between classification errol' andsize of the network is achieved. The pruning of an inputnode is equ iva lent to removing the corresponding feature.

How reliable are the feature selection results when theratio of the available number of training samples to the

number of features is small? Suppose the Mahalanobisdistance [58] is used as the feature selec tion criterion. Itdepends on the inverse of the average class covariancematrix. The imprecision in its estimate in small sample sizesituations can result in an optimal feature subset which isquite different from the optimal subset that would beobtained when the covariance matrix is known. Jain andZongker [85] illustrate this phenomenon for a two-classc lassification problem involving 20-dimensiollal Gaussianclass-conditional densities (the same data was also used byTrunk [157J to demonstrate the curse of dimensionalityphenomenon), As expected, the quality of the selectedfeature subset for small training sets is poor, but improvesas the training set size increases. For example, with 20patterns in the training set, the branch-and-bound algo-rithm selected a subset of 10 features which included onlyfive features in common with the ideal subset of 10 features(when densities were known). With 2,500 patterns in thetraining set, the branch-and-bound procedure selected a 10-feature subset with. only one wrong feature.

Fig. 6 shows an example of the feature selectionprocedure using the floating search technique on the peAfeatures; in the digit dataset for two different training setsizes. The test set size is fixed at 1,000 patterns. In each ofthe selected feature spaces with dimcnsionalities rangingfrom :I to 64, the Bayes pLug-in classifier is designedassuming Gaussian densities with equal covariancematrices and evaluated on the test set. The feature selectioncriterion is the minimum pairwise Mahalanobis distance, In

the small sample size case (total of 100 training patterns),the curse of dimensionalitv phenomenon can be dearlyobserved. In this case, the optimal number of Icalurcs isabout 20 which equals nfj (n ~ 100), where n is the numberof training patterns. The rule-of-thumb of having less thannllO features is on the safe side in genera I.

5 CLASSIFIERS

Once a feature selection or classification procedure finds aproper representation, a classifier can be designed using a

17

..:I .:::,.: ._,~_.~.~""::::.r",..

100 lraining patterns,....1000 Irtliningpatterns:

0.1 ,-,

i0.31

~ Ia i~0.2

~

10 20 30 40No . o fFeatures

50 60

Fig. 6 . C la ss if ic at io n e rr orVS . th e n um be r o f fe atu re s us in g th e flo atings ea rc h fe atu re s ele ctio n t ec hn iq ue (s ee te xt).

number of possible approaches. In practice, the choice of aclassifier is a difficult problem and it is often based onwhich classifier(s) happen to be available, or best known, tothe user.

We identify three different approaches to designing aclassifier. 'L11esimplest and the most intuitive approach toclassifier design is based on the concept of similarity:patterns that are similar should be assigned to the sameclass. So, once a good metric has been established to definesimilarity, patterns can be classified by tempbte matchingor the minimum distance classifier using a few prototypesper class. The choke: of the metric and the prototypes iscrucial to the success of this approach. In the nearest meanclassifier, selecting prototypes is very simple and robust:each pattern class is represented by a single prototypewhich is the mean vector of all the training patterns in thatclass, More advanced techniques for computing prototypesare vector quantization [115], [171] and learning vectorquantization [92], and the data reduction methods asso-ciated with the one-nea-cst neighbor decision rule (1-NN),such as editing an d condensing [39]. The moststraightforward I-NN rule can be conveniently used as abenchmark for all the other classifiers since it appears toalways provide a reasonable classification performance inmost applications. Further, as the 1-N N classifier does notrequ ire tlny user-specified parameters (except perhaps thedistance metric used to find the nearest neighbor, butEuclidean distance is commonly used), its classificationresults arc implementation independent.

In many classification problems, the classifier isexpected to have some desired invariant properties. Anexample is the shift invaria-icc of characters in characterrecognition; a change in a character's location should notaffect its classification. If the preprocessing or therepresentation scheme does not normalize the inputpattern for this invariance, then Ow same character maybe represented at multiple positions in the feature space.These positions define a one-dimensional subspace, Asmore invariants are considered, the dimensionality of thissubspace correspondingly increases. Template matching


15/34

18 I EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000

or the nearest mean classifier can be viewed as findingthe nearest subspace f116].

The second main concept used for designing patternclassifiers is based on the probabilistic approach, Theoptimal Bayes decision rule (with the 0/1 loss function)assigns a pattern to the class with the maximum posteriorprobability. This rule can be modified to take into account

costs associated with different types of misclassifications,For known class conditional densities, the Bayes decisionrule gives the optimum classifier, in the sense [hat, forgiven prior probabilities, loss function and class-condi-tiona I densities, no other decision rule will have a lowerr isk (i.c, expected value of the loss function, for example,probability of error), If the prior class probabilities areequal and a 011 loss function is adopted, the Bayesdecision rule and the maximum likelihood decision ruleexactly coincide, In practice, the empirical Bayes decisionrule, or "plug-in" rule, is used: the estimates of thedensities


16/34

JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW

One of the interesting characteristics of multilayerpcrceptrons is that in addition to classifying an inputpattern, they a.so provide a confidence in the classification,which is an approximation of the posterior probabilities.These confidence values may be used for rejecting a testpattern in case of doubt. The radial basis function (about aGaussian kernel) is better suited than the sigmoid transferfunction for handling outliers. A radial basis network,however, is usually trained differently than a multilayerperceptron. Instead of a gradient search on the weights,hidden nell rons are added unti I s ome preset performance isreached. The classification result is comparable to situationswhere each class conditional density is represented by aweighted Stl m of Gaussians (a so-called Gaussian mixture;sec Section 8.2).

A special type of classifier is the decision tree [2'2], [3~!,[129] ,which is trained by an iterative selection of individualfeatures that are most salient at each node of the tree. Thecrite ria for feature selection and tree generation include theinforrnation content, the node purity, or Fisher's criterion.During classification, just those features are under con-sideration that are needed for the test pattern undercon sidera tion, so feature selection is implicitly built-in.The most commonly used decision tree classifiers are binaryin nature and lise 8 single feature at each node, resulting indecision boundaries that are parallel to the feature axes[1491. Consequently, such decision trees are intrinsicallysuboptimal for most applications, However, the mainadvantage o f the tree classifier, besides its speed, is thepossibility to interpret the decision rule in terms ofindividual features. This makes decision trees attractivefor Interactive usc by experts. Like neural networks,decision trees can be easily overtrained, which can beavoided by using a pruning stage [63], [ 10 6 ], [1 211] .Decisiontree classification systems such as CART [22] and C4.5 [129]are available in the public domain" and therefore, oftenused as a benchmark,

One of the most interesting recent developments inclassifier design is the introduction of the 5UppOl't vectorclassif ier by Vapnik [162] which has also been studied byother authors [23], [1441, [146]. It is primarily a two-classclassifier. The optimization criterion here is the width of themargin between the classes, i.e., the empty area around thedecision boundary defined by the distance to the nearesttraining patterns. These patterns, called slIpport vectors,finally define the classification function, Their number isminimized by maximizing the margin.

The decision function for a two-class problem derived bythe support vector classifier can be written as follows usinga kernel function K{::t:j,x) of a new pattern x (to beclassified) and a training pattern X"

D(x) '= L itiA,K(Xi'X) + 00 ,Vr.jf:S

where 5' is the support vector set (a subset of the trainingset), and Ai ~ l the label of object a:i' The parametersd:i ~ 0 are optimized during training by

4. http://www.gmd.d[.jmi- O. ForC = CX), no overlap is allowed. Equation (13) is the dualform of maximizing the margin (plus the penalty term).During optimlzation, the values of all (Yj become 0, exceptfor the support vectors. So the support vectors are the onlyones that arc finally needed, The ad hoc character of thepenalty term (errol' penally) and the computational com-plexity of the training procedure (a quadratic minimizationproblem) arc the drawbacks of this method. Varioustraining algorithms have been proposed in the literature[23], including chunking [161], OSl111a's decompositionmethod [1191, and sequential minimal optimization [124].An appropriate kernel function [( (as in kernel rCA, Section4.1) needs to be selected. In its most simple form, it is just adot product between the input pattern x and a member of

the support set: !((Xi, x) ,~ Xi ' x, resulting in a linearclassifier. Nonlinear kernels, such as

K(Xi'X) = (Xi' X+ ll,

(12)

result in a pth-order polynomial classifier. Gaussian radialbasis functions can also be used. The important advantageof the support vector classifier is that it offers a possibility totrain g,meralizable, nonlinear classifiers in high-dimen-sional spaces using a small training set. Moreover, for largetraining sets, it typically selects a small support set which isnecessary for designing the classifier, thereby minimizingthe computational requirements during testing.

The support vector classifier can also be understood interms of the traditional template matching techniques. The

support vectors replace the prototypes with the maindifference being tha] they characterize the classes by adecision boundary. Moreover, this decision boundary is notjust defined by the minimum distance function, but by a moregeneral, possibly nonlinear, combination of these distances,

"Ve summarize the most commonly used classifiers inTable 6. Many of them represent, in fact, an entire family ofclassifiers and allow the user to modify several associatedparameters and criterion functions. All (01' almost all) ofthese classifiers arc admissible, in the sense that there existsome classification problems for which they are the bestchoice. An extensive comparison of a large set of classifiersover many different problems is the StatLog project [109]which showed a large variability over their relativeperformances, illustrating that there is no such thing as anoverall optimal classif ication rule.

The differences between the decision boundaries obtainedby different classifiers a rc illustrated in f-iig.7 using datase t 1(2-dimensional, two-class problem with Caussian densities).Note the two small isolated areas for iii in fig. 7c for the1-NN rule. The neural network classifier in Fig. 7d evenshows a "ghost" region that seemingly has nothing to dowith the data. Such regions are less probable for a smallnumber of hidden layers at the cost of poorer classseparation.


17/34

20 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL. 22. NO. 1. JANUARY 2000

TABLE 6Classification Methods

\ki.bnd Coimuont

As si gn p at te rn st o tho neares t class A lm ost no trainiug needed, I'a,,(t ,"8t(nr; 'mean. scale (mdrie) dependent.

i--cS-ll'b-Sl-"-H-:o--;;-'''I,(,:''ir.l-,,-)('1---+-- ' , . - ' \ ," ' ,""i j4" ' ' ' l - l- p - a .t ; -; . t- c - n -l s - t ' ., - )"11-, ,- ,J - l I " -~.u- ' -{~s--; I- .':l-(L~-'"-'-I Instead of nonnuliz.ng on invariants, tho

subspace. subspace of the-invariants i. '1used: 'i('nit:

f--o~--~~"77--=--,-----I-,----~----,-,--,----;-;.,-----,--c=-- _(J_l lc l . rk) dcJ ic ll_ : rl , :-c l .:_ l l, ' -, ,_c--c- ._-o- -----il-Noarost T\uigltbol' Rulo As:o:;ignpatterns to tho class of UlO -:- ' : :0training needed; robust porlonuanco;

neares t (raining paucrn, slow I< '"ting; scale (m etric) dependent.

'I r- rn pl at c m a tc h i ug i .t\ssig;n pattnrns t.o r.lw innst similartemplate.

The h~mplnt.[!:; anrl t .11(~m r.-Ir ic l liLY~to hes upplied b y th ~ us er; I'h ~ procedu re m ay

i nc lude IIOll1illNH' Ilorula]i-I.I-\t.i(lllS; ~cr"J . . l c(metric) depcudcnt.

Assigll patt,1l'nS to thr: majority dass Asymptot.ically optim.il: scalf1 (metric]aIHOI1g:k nearest H(~igllboI' H~iIlg a (h~lH~rl(h~Ilt;slow k:;t.illg.pcrfouuuu:c optiHliz[~d value for k.

Ilaycs I'IUfl-iu Assign pattern lO the class which hasth e m ax im um e stim ate dposteriorprohabillty.

Yield, s imple classif iers (Iiucar 01' qua-dratic) [01' C au ss lan dis trib ution s; s cn si-L iv e c odonsltv e st im a ti on e rr or s.

Maximum likelihood rule lor legis- Linear cln ..silier: iternti V{! procednrc: opr.i -ti (~igll\oidHI)postorinr prohahilt- mal fnr It fmllily of rliff


18/34

JAIN ET AL. : STATISTICAL PADERN RECOGNITION: A REVIEW

.

o

(e )

21

----- ------------------,

/,-

/ 0 : ~" . 0

** v ; o ' 11>0o

0R j " . .I> '" 0

0 0

R2

0

0

(b)

o

(d)

Fig. 7. Decision boundaries for two bivariate Gaussian distributed classes, using 30 patterns per class. The following classifiers are used: (a) Bayes-

normal-quadratic, (b) Bayes-nDrmal-linear, (e) 1-NN, and (d) ANN-5 (a feed-forward neural network with one hidden layer containing 5 neurons). Theregions R, and /(,1 for classes Wl and "-'""respectively, are found by classifying all the points in the two-dimensional feature space.

thereby taking advantage of all the attempts to learnfrom the data,

In summary, we may have different feature sets,different training sets, different classification methods ordifferent training sessions, all resulting in a set of classifierswhose outputs may be combined, with the hope ofimproving the overall classification accuracy. If this set ofclassifiers is fixed, the problem focuses on the combinationfunction. It is also possible to use a fixed combiner andoptimize the set of input classifiers, see Section 6.1.

A large number of combination schemes have been

proposed in the literature [172]. A typical combinationscheme consists of a set of individual classifiers and acombiner which combines the results of the individualclassifiers to make the final decision. When the individualclassifiers should be invoked or how they should interactwith each other is determined by the architecture of thecombination scheme, Thus, various combination schemesmay differ from each other in their architectures, thecharacteristics of the combiner, and selection of theindividual classifiers,

Various schemes for combining multiple classifiers canbe grouped into three main categories according to theirarchitecture: 1) parallel, 2) cascading (or serial combina-tion), and 3) hierarchical (tree-like). In the parallel archi-tecture, all the individual classifiers are invokedindependently, and their results are then combined by acombiner. Most combination schemes in the literaturebelong to this category. In the gated parallel variant, theoutputs of individual classifiers are selected or weighted bya gating device before they are combined. In the cascadingarchitecture, individual classifiers are invoked in a linear

sequence, The number ofpossible classes for a given patternis gradually reduced as more classifiers in the sequencehave been invoked, For the sake of efficiency, inaccurate butcheap classifiers (low computational and measurementdemands) are considered first, followed by more accurateand expensive classifiers. In the hierarchical architecture,individual classifiers arc combined into a structure, whichis similar to that of a decision tree classifier. The tree nodes,however, may now be associated with complex classif iersdemanding a large number of features. The advantage ofthis architecture is the high efficiency and flexibility in


19/34

22 IEEE TRANSACTIONS ON PATTERN ANALVSIS AND MACHINE INTELLIG ENCE. VOL. 22 . NO.1. JANUARY 2000

. -.--- .._----

I _ , ( - -a- i n - - ln - g-- s- e- - ' e"-'(-'('-o'rL :_ .t es t s et e rr or ------1

1,eW

0.2 /\I1

1 \ , .. " , ~ _ /_

\\\

" / \ I'\ ... V ',J \... \/\

,_ . . . . . . ._, - - . .. . 1

0'o 10 20 30 40

N um b e r o f T ra in in g E po ch s

Fig , 8 . Classification e r ro r of a n eu ra l n etw o rk cla ss ifie r u sin g1 0 hiddenunits trained by the Levenberg-Marquardt rule ja r 50 epoch s from tw oclasses w ith 30patterns each (Dataset I). T es t se t error is based on anindependent set ot 1 ,000 patterns,

exploiting the discriminant power of different types offeatures. Using these three basic architectures, we can buildeven more complicated classifier combination systems.

6.1 Selection and Training of Individual ClassifiersA classifier combination is especially useful if the indivi-dual classifiers are largely independent. If this is not alreadyguaranteed by the use of different training sets, variousrcsampling techniques like rotation and bootstrapping maybe used to artificially create such differences. Examples arestacking [168], bagging [21], and boosting (or ARCing)[142]. In stacking, the outputs of the individual classifiersare used to train tho "stacked" classifier. The final decisionis made based on the outputs of the stacked classifier inconjunction with the outputs of individual classifiers.

In bagging, different data sets arc created by boot-strapped versions of the origin[ll dataset and combinedusing a fixed rule like averaging. Boosting [52] is anotherresampling technique for generating a sequence of trainingdata sets. The distribution of a particular training set in thesequence is overrepresented by patterns which weremisclassified by the earlier classifiers in the sequence. Inboosting, the individual classifiers are trained hierarchicallyto learn to discriminate more complex regions in the featurespace. The original algorithm was proposed by Schapire[142], who showed that, in principle, it is possible for acombination of weak classifiers (whose performances areonly slightly better than random guessing) to achieve an

error rate which is arbitrarily small on the training data.Sometimes cluster analysis may be used to separate the

individual classes in the training set into subclasses.Consequently, simpler classifiers (e.g., linear) may be usedand combined later to generate, for instance, a piecewiselinear result (120).

Instead of building different classifiers on different setsof training patterns, different feature sets may be used. Thiseven more explicitly forces the individual classifiers tocontain independent information. An example is therandom subspace method [75].

50

6_2 CombinerAfter individual classifiers have been selected, they need tobe combined together by a module, called the combiner.Various combiners can be distinguished from each other intheir trainability, adaptivity, and requirement on the outputof individual classifiers. Combiners, such as voting. aver-aging (or sum), and Borda count [74] are static, with no

training required, while others are trainable. The trainablecombiners may lead to a better improvement than staticcombiners at the cost of additional training as well as therequirement of additional training data .

Some combination schemes arc adaptive in the sense thatthe combiner evaluates (or weighs) the decisions ofindivid ua I classifiers depending on the input pattern. Incontrast, nonadaptive combiners treat all the input patternsthe same. Adaptive combination schemes can furtherexploit the detailed error characteristics and expertise ofindividual classifiers. Examples of adaptive combinersinclude adaptive weighting [156], associative switch,mixture of local experts (MLE) [79], and hierarchicalMLE [87].

Different combiners expect different types of outputfrom individual classifiers. Xu et al. [172] grouped theseexpectations into three levels: 1) measurement (or con-f idence), 2) rank, and 3) abstract. At the confidence level, aclassifier outputs a numerical value for each class indicatingthe belief or probability that the given input pattern belongsto that class, At the rank level. a classifier assigns a rank toeach class with the highest rank being tho first choice. Rankvalue cannot be used in isolation because the highest ran kdoes not necessarily mean a high confidence in theclassification. At the abstract level, a classifier only outputsa unique class label or several class labels (in which case,the classes are equally good). TIle confidence level conveysthe richest information, while the abstract level contains theleast amount of information about the decision being made.

Table 7 lists a number of representative combinationschemes and their characteristics. This is by no means anexhaustive list.

6.3 Theoretical Analysis of Combination SchemesA large number of experimental studies have shown thatclassifier combination can improve the recognitionaccuracy. However, there exist only a few theoreticalexplanations for these experimental results. Moreover, mostexplanations apply to only the simplest combinationschemes under rather restrictive assumptions. One of themost rigorous theories on classifier combination ispresented by Kleinberg [91].

A popular analysis of combination schemes is based on

the well-known bias-variance dilemma [64], [93]. Regres-sion or classification error can be decomposed into a biasterm and a variance term. Unstable classifiers or classifierswith a high complexity (or capacity), such as decision trees,nearest neighbor classifiers, and large-size neural networks,can have universally low bias, but a large variance. On theother hand, stable classifiers or classifiers with a lowcapacity can have a low variance but a large bias.

Turner and Ghosh [158] provided a quantitative analysisof the improvements in classification accuracy by combin-ing multiple neural networks. They showed that combining


20/34


21/34

24 IEEE TRANSACTIONS ON PATIERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000

the combined classifier. They demonstrated that theboosting algorithm can effectively improve the margindistribution, This fi nd ing is similar to the property of thesupport vector classifier, which shows the importance oftraining patterns near the margin, where the margin isdefined as the area of overlap between the class conditionaldensities.

6.4 An ExampleWe will illustrate the characteristics of a number of differentclassiflers and combination rules on a digit classificationproblem (Dataset 3, see Section 2). The classifiers used in theexperiment were designed using Matlab and were notoptimized for the data set. All the six different feature sotsfor the digit dataset discussed in Section 2 will be used,enabling us to illustrate the performance of variousclassifier combining rules over different classifiers us wellas over different feature sets. Confidence values in theoutputs of all the classifiers are computed, either directlybased on the posterior probabilities or on the logistic outputfunction as discussed in Section 5. These outputs are alsoused to obtain multiclass versions for intrinsically two-classdiscriminants such as the Fisher Linear Discriminant andthe Support Vector Classifier (SVC). For these twoclassi fiers, Z I total or 10 discriminants are computed betweeneach of the 10 classes and the combined set of the remainingclasses, A test pattern is classified by selecting the class forwhich the discriminant has the highest confidence.

The following 12 classifiers are used (also see Table 8):the nayes-plug-in rule assuming normal distributions withdifferent (Bayes-normal-quadratic) or equal covariancematrices (Bayes-normal-linear), the Nearest Mean (NM)nile, l-NN, k-NN, l'arzen, Fisher, a binary decision treeusing the maximum purity criterion [21] and early pruning,two feed-forward neural networks (based on the MatlabNeural Network Toolbox) with a hidden layer consisting of

2() (ANN-20) and 50 (ANN-50) neurons and the linear(SVC-linear) and quadratic (SVC-quadratic) Support Vectorclassifiers, The number of neighbors in the /.:-NN rule andthe smoothing parameter in the Parzcn classifier arc bothoptimized over the classification result using the leave-one-out error estimate on the traini.ng set. For combiningclassifiers, the median, product, and voting rules are used,as well < IS two trained classifiers (NM and 1-NN). Thetraining set used for the individual classifiers is also used inclassifier combination.

The 12 classifiers listed in Table S were trained on thesame 500 (10 x 50) training patterns from each of the sixfeature sets and tested on the same 1,000 (10 x 100) testpatterns, The resu lting classification errors (in percentage)

are reported; for each feature set, the best result over theclassifiers is printed in bold. Next, the 12 individualclassifters for a single feature set were combined using thefive combining rules (median, product, voting, nearestmean, and I-NN). For example, the voting rule (row) overthe classifiers using feature set Number 3 (column) yieldsan error of 3.2 percent. It is underlined to indicate that thiscombination result is better than the performance ofindividual classifiers for this feature set. Fina lly, the outputsof each classifier and each classifier combination schemeover nil the six feature sets are combined using the five

combination rules (last five columns). POI' example, thevoting rule (column) over the six decision tree classifiers(row) yields an error of 21.S percent. Agllin, it IS underlinedr o indicate that this combination result is better than each ofthe six individual results of the decision tree. The 5 x [jblock in the bottom right P,11't of Table 8 presents thecombination results, over the six rea tu re sets, for theclassifier combination schemes for each of the separatefca lure sets.

Some of the classifiers, for example, the decision tree, donot perform well on this data, Also, the neural networkclassifiers provide rather poor optimal solutions, probablydue to nonconverging training sessions. Some of the simpleclassifiers such as the I-NN, Bayes plug-in, and Parzen givegood results: the performances of different classifiers varysubstantially over different feature sets, Due to therelatively small twining set for some of the large featuresets, the Bayes-normal-quadratic classifier is outperformedby the linear one, but the SVC-quadratic generally performsbetter than the SVC-linear. This shows that the SVCclassifier can find nonlinear solutions without increasingthe overtra ining risk.

Considering-the classifier combination results, it appearsthat the trained classifier combination rules arc not alwaysbetter than the use of fixed rules. Still, the best overall result(1.5 percent error) is obtained by a trained combination rule,the nearest mean method. The combination of differentclassifiers for the same feature set ( CO I L lmns in the tub lc)only slightly improves the best individual classificationresults. The best combination rule for this dataset is voting.The product rule behaves poorly, as can be expected,because different classifiers on the same feature set do notprovide independent confidence values. The combination ofresults obtained by the same classifier over different featuresets (rows in the table) frequently outperforms the bestindividual classifier result. Sometimes, the improvementsMe substantial as is the case for the decision tree. Here, theproduct rule does much better, but occasionally it performssurprisingly bad, similar to the combination of neuralnetwork classifiers. This combination rule (li ke the mini-mum and maximum rules, not used in this experiment) issensitive to poorly trained individual classifiers. Finally, itis worthwhile to observe that in combining the neuralnetwork results, the trained combination rules do very well(classification errors between 2.1 percent and 5.6 percent) incomparison with the fixed rules (classification errorsbetween 16.3 percent to 90 percent).

7 ERROR ESTIMATION

The classification error or simply the error rate, P, is theultimate measure of the performance of a classifier.Competing classifiers can also be evaluated based ontheir error probabilities, Other performance measuresinclude the cost of measuring features and the computa-tional requirements of the decision rule. While it is easyto define the probability of error in terms of the class-conditional densities, it is very difficult to obtain a closed-form expression for P e. Even in the relatively simple caseof mult ivar iato Causs ian densities with unequalcovariance matrices, it is not possible ('0 write a simple


22/34

JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 25

TABLE 8Error Rates ( in Percent age) of Different Classifier s and Classif ier Combination Schemes

~,,;rm' I

Fcnturn cd. (0"" Suet.ion 2) Comhl nat ion Ru to ' _ _ j)lo, 1

[:io. 2

I

N o . 3

I! \ ( ) . 4

Ii \ ""o , ;;

INo, 6

-IVIN]. I'rorl. Voting f'LlJl-[\[\mbin lng ru le

naYl'8-lICll 'Inal- 20.7 I 5,8 12,8 G .2 2D 31.0 2 ,1 : \ G.3 (j,8 G . 7 5.0quadratic:I-;':- .

21,3 3.4 ii.7 s .o 18.U 20,1 3,7 :1.1 ;,.1 J.fI {2lay'''-Jlol'mnl-linear

"N ,',m ,


23/34

26 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL 22, NO.1, JANUARY 2000

TABLE 9Error EstimationMethods

NlcLhod Prnport.y Comments

Itesubsritution Method All the available data is used fortrai ni ng as wel l as testi ng; t rainingawl t,(,';t. snts arr rho samn,futlf tl1c"dl\t;\i~ 'ilscd fo rtraillirii.z-----and tho remaining data i~ mE~(1ortesti ng; t raining and t est sr.!.,; areindependent.

Optimi~ticnlly biased estimat.e,especially when the ratio of samplesi~,' tD dunnnaionality i,; SHl


24/34

JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A nEVIEW

0.1

1i

0.08; .

( ...

test set error-Iraining set error ..

gUJ 0.08

"QO ilo

'"~0.04

'"3

O .O~

0'o

_ ,_ ,,_ : . .= ... t _

20 40 60 BONumber of Irainings patterns pot class

Fig. 9. Classif ication error 01 the Bayes plug-in llnear classifier (equalcovariance matrices) as a function of the number of training samples

(learning curve) for the test set and the training set on the digit dataset.

made with a low confidence. A better alternative wou ldbe to reject these doubtful patterns instead of assigningthem to one of the categories under conslderation. Howdo we decide when to reject a test pattern? For the Bayesdecision rule, a well-known reject option is to reject apattern if its maximum a posteriori probability is below athreshold; the larger the threshold, the higher the rejectrate. Tnvoking the reject option reduces the error rate; thelarger the reject rate, the smaller the error rate, Thisrelationship is represented as


25/34


26/34

J il iN ET AL: STATISTICAL PATTEnN RECOGNITION: A REVIEW 29

TABLE 10Custering Algorithms

-

I

- -

I

-

-~Alg(]rit1IlIl Pr-op ..u-ty Comments

K-mp.an~ Idontifies hypersphericnl elustors; N"",l t" "I",,;ify K and the initial

could hn modified to lind hyper- eluster cp.njp.1"8, Adriitional rmraml)-ollipsoidul dJ"Itf'llI limng tl'1"I\ for cmating new dusters , lJlerg-M:lhallmnhiB di"lanr:llj ing !,;)Iisting clusters and outliereomput atiumilly pllld".nt, dill ed.iull ean bo pruvidod,

fuzzy X'-ffiP .. an.""} Siinilcr to K -means except that Need tn sp",;ify K, initial dush,.-every pattern hall a rl egro!; lluf mam- cooter" Hnd dUBt;>Immnhnrnhiphership into the K (;hll ll".1"8 fuzzy functi on.partition},

Millimum t;pauuing 'n'ffi(MS'l ') C luflt l'I S am formed by dl'Jl!ting Nood to provi rle the d[,fi niti nn of aninU"l lI i>1l .mtedg( 'S in the Ms' r of in('un!';lItent ooge.the ,hlt.a.

Mutual N[)ighhnrhnud Cd NI~lmtn sperify the nr)ighhnrhl}(Jrlvalue (MNV) fur ovnry puir of p .. t- dopth, R.tema, If ~j is tho pH. neur nlligi1hnr{]f Xl and Xi is thu qi i, near n~gll-her of ,r;j, HPn

MNV("'i, 9!j) = P + qip,q=i,'" ,K.

-Sin,gle-Link (SL) A hieI'ill'f;hi(~ll clustering algorithm -SingJ.n-link dlllltel"lll'Jl.'lily dlllin

whirh an:epts an '11>)( n proxlmity tng"thl'r nnd ;ore (lft.!'.!1 "str a.ggll''' irnntrix; output: is a (ilmdl"u,r;ram or a nood II heuristic to cut the tl'll!>tntI'!~l strueture; n single-link C:lllStllf Ihrm clusters (< Ipartitiou).is JI muxirnatly !~l1Uler(!fi suhgrephon tho patterns,

ClImplot()-Link (eL) A hiorlll'f:hir:l1 'F.\I;h plittI'm is 1l.':"~llIneri o he TIIIl fnrm tlnd the number (K) IIfdrawn from Ilno of K Ilndllflying underlying population dllI'lsitil\'\ arepnpulalinns, [JI' dust[)rsj population assumed tn he known; K ean ho

p


27/34

30 IEEE TRIINSIICTIONS ON PATTERN ANALYSIS liND MACHINE INTELLIGENCE, VOL. 22, NO.1, JANUARY 2000

optimize the resulting partition [ 110 ] , [ 119 ],and 3) mappingit onto a neural network [103 J for possibly effic ientimplementation. However, many of these so-ca lledenhancements to the 1( -mcans algorithm arc computation-ally demanding and require additional user-specifiedparameters for which no general guidelines are available.Judd et al, [88] show that a combination of illgorithmicenhancements 10 a square-error clustering algorithm anddistribution of the computations over a network ofworkstations can be used to cluster hundreds of thousandsof multidimensional patterns in just a few minute's.

It is interesting II I noll' how seemingly different conceptsfor partitiunal clustering can lead II I essentially the same'algorithm. It is easy to verify that the generalized Llnydvector quantization ulgorithm used in the communicationand compression domain is equivalent to the {(-meansillgorithm. A vector qllilntizer (VQ) is described as fIcumbinution of art encoder find a decoder. A II-dimensionalVQ consists of two mappings: ,]11 encoder "I which maps theinput alphabet (Al to the channel symbol set (M), and adecoder (j which maps the channel symbol set (M) to theoutput alphabet (A), i.e., "( (Y ) :A --7 M and d(v) : J'vT__.A .A distortion mea sure '0(11,1)) specifies the cost associatedwith quantization, where 1 J= .i3(",{!J)).Usually, an optimalquantizer minimizes the average distortion under a sizeconstraint on M, Thus, the problem of vector quantizationcan be posed as a clustering problem, where tile number ofclusters J( is now the size of the output alphabet,A: {Pi, i ,~ , ... , J(}, and the goal is to find a quantiza.ion(referred to as a partition in the K-means algorithm) of thed-dimensional feature space which minimizes the uveragedistortion (mean square error) of the input patterns. Vectorquantization has boon widely used in a number ofcompression and coding applications, such as speechwaveform coding, image coding, otc., where only thesymbols for the output alphabet or the cluster centers Metransmitted instead of the entire signal [()7J, [32]. Vectorquantization also provides an efficient tool or densityestimation [68]. A kernel-based approach (e.g., a mixture ofCaussian kernels, where each kernel is placed at a clustercenter) can be used to estimate the probability density of thetraining samples. A major issue in VQ is the selection of theoutput alphabet size. A number of techniques, such as theminimum description length (MDL) principle [1381, can beused to select this parameter (see Section 8.2), Thesupervised version of VQ is called learning vector quantiza-tion (LVQ) [92[.

8.2 Mixture DecompositionFinite mixtures are a flexible i111dpowerful probabilistic

modeling tool. In statistical pattern recognition, the mil i nusc of mixtures is in defining formal (i.e. model-based)approaches to unsupervised classification [81]. The reasonbehind this is that mixtures adequately model situationswhere each pattern has been produced by one of a set ofalternative (probabil istically modeled) sources [155J. Never-theless, it should be kept in mind that strict adherence tothis interpretation is not required: mixtures can also be seenus a clUBS of models that are able to represent arbitrarilycomplex probability density functions, This makes mixturesa lso well suited for representing complex c lass-conditiona l

densities in supervised learning scenarios (see [137J andreferences therein). Finite mixtures can also be used as afeature selection tool [1271.

8 .2, 1 Basic DefinitionsConsider the fol lowing scheme for generating randomsamples. There are .K random sources, each characterized

by a probability (mass or density) function J!m(yIOuJ,parameterized by 0"" for m - 1, , ..,I(. E ach tim ea sampleis to be generated, we randomly choose one of thesesources, with probabilit ies (n I , . .. , ( I' ll}, an d then samplefrom the chosen source, The random variable defined bythis (two-stage) compound generating mechanism is char-actcrized by a finite mixture distribution; formally. itsprobabil

Documents

Pattern Recognition Review