10
A class-dependent weighted dissimilarity measure for nearest neighbor classification problems Roberto Paredes * , Enrique Vidal Instituto Tecnol ogico de Inform atica, Universidad Polit ecnica de Valencia, Camino de Vera S/N, 46071 Valencia, Spain Received 13 July 1999; received in revised form 18 July 2000 Abstract A class-dependent weighted (CDW) dissimilarity measure in vector spaces is proposed to improve the performance of the nearest neighbor (NN) classifier. In order to optimize the required weights, an approach based on Fractional Programming is presented. Experiments with several standard benchmark data sets show the eectiveness of the proposed technique. Ó 2000 Published by Elsevier Science B.V. Keywords: Nearest neighbour classification; Weighted dissimilarity measures; Iterative optimization; Fractional programming 1. Introduction Let P be a finite set of prototypes, which are class-labelled points in a vector space E and let d(; ) be a dissimilarity measure defined in E. For any given point x 2 E, the nearest neighbor (NN) classification rule assigns the label of a prototype p 2 P to x such that d p; x is minimum. The NN rule can be extended to the k-NN rule by classi- fying x in the class which is more heavily repre- sented by the labels of its k nearest neighbours. The great eectiveness of these rules when the number of prototypes is growing to infinity is well known (Cover and Hart, 1967). However, in most real situations, the number of available prototypes is usually very small, which often leads to dramatic degradations of (k-)NN classification accuracy. Consider the following general statistical state- ment of a two-class Pattern Recognition classifi- cation problem: Let D n fX 1 ; Y 1 ; ... ; X n ; Y n g be a training data set of independent, identically distributed random variable pairs, where Y i 2f0; 1g; 1 6 i 6 n, are classification labels, and let X be an observation from the same distribution. Let Y be the true label of X and g n a classifi- cation rule based on D n . The probability of error is R n P fY 6 g n X g. Devroye et al. (1996) show that, for any integer n and classification rule g n , there exists a distribution of X ; Y with Bayes risk R 0 such that the expectation of R n is ER n P 1=2 e, where e > 0 is an arbitrary small number (Devroye et al., 1996). This theorem states that even though we have rules, such as the k-NN rule, that are universally consistent (that is, they asymptotically provide optimal performance for any distribution), their finite sample performance can be extremely bad for some distributions. This reason explains the increasing interest in finding variants of the NN rule and adequate www.elsevier.nl/locate/patrec Pattern Recognition Letters 21 (2000) 1027–1036 * Corresponding author. Tel.: +34-96-3877-069; fax: +34-96- 3877-239. E-mail addresses: [email protected] (R. Paredes), evi- [email protected] (E. Vidal). 0167-8655/00/$ - see front matter Ó 2000 Published by Elsevier Science B.V. PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 6 4 - 7

A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

Embed Size (px)

Citation preview

Page 1: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

A class-dependent weighted dissimilarity measure fornearest neighbor classi®cation problems

Roberto Paredes *, Enrique Vidal

Instituto Tecnol�ogico de Inform�atica, Universidad Polit�ecnica de Valencia, Camino de Vera S/N, 46071 Valencia, Spain

Received 13 July 1999; received in revised form 18 July 2000

Abstract

A class-dependent weighted (CDW) dissimilarity measure in vector spaces is proposed to improve the performance

of the nearest neighbor (NN) classi®er. In order to optimize the required weights, an approach based on Fractional

Programming is presented. Experiments with several standard benchmark data sets show the e�ectiveness of the

proposed technique. Ó 2000 Published by Elsevier Science B.V.

Keywords: Nearest neighbour classi®cation; Weighted dissimilarity measures; Iterative optimization; Fractional programming

1. Introduction

Let P be a ®nite set of prototypes, which areclass-labelled points in a vector space E and letd(�; �) be a dissimilarity measure de®ned in E. Forany given point x 2 E, the nearest neighbor (NN)classi®cation rule assigns the label of a prototypep 2 P to x such that d�p; x� is minimum. The NNrule can be extended to the k-NN rule by classi-fying x in the class which is more heavily repre-sented by the labels of its k nearest neighbours.The great e�ectiveness of these rules when thenumber of prototypes is growing to in®nity is wellknown (Cover and Hart, 1967). However, in mostreal situations, the number of available prototypesis usually very small, which often leads to dramaticdegradations of (k-)NN classi®cation accuracy.

Consider the following general statistical state-ment of a two-class Pattern Recognition classi®-cation problem: Let Dn � f�X1; Y1�; . . . ; �Xn; Yn�gbe a training data set of independent, identicallydistributed random variable pairs, whereYi 2 f0; 1g; 16 i6 n, are classi®cation labels, andlet X be an observation from the same distribution.Let Y be the true label of X and gn��� a classi®-cation rule based on Dn. The probability of error isRn � PfY 6� gn�X �g. Devroye et al. (1996) showthat, for any integer n and classi®cation rule gn,there exists a distribution of �X ; Y � with Bayes riskR� � 0 such that the expectation of Rn isE�Rn�P 1=2ÿ e, where e > 0 is an arbitrary smallnumber (Devroye et al., 1996). This theorem statesthat even though we have rules, such as the k-NNrule, that are universally consistent (that is, theyasymptotically provide optimal performance forany distribution), their ®nite sample performancecan be extremely bad for some distributions.

This reason explains the increasing interest in®nding variants of the NN rule and adequate

www.elsevier.nl/locate/patrec

Pattern Recognition Letters 21 (2000) 1027±1036

* Corresponding author. Tel.: +34-96-3877-069; fax: +34-96-

3877-239.

E-mail addresses: [email protected] (R. Paredes), evi-

[email protected] (E. Vidal).

0167-8655/00/$ - see front matter Ó 2000 Published by Elsevier Science B.V.

PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 6 4 - 7

Page 2: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

distance measures that help improve the NNclassi®cation performance in small data set situa-tions (Tomek, 1976; Fukunaga and Flick, 1985;Luk and Macleod, 1986; Urahama and Furukawa,1995; Short and Fukunaga, 1980, 1981; Fukunagaand Flick, 1982; Fukunaga and Flick, 1984; Mylesand Hand, 1990).

Here we propose a weighted measure which canbe seen as a generalization of the simple weightedL2 dissimilarity in a d-dimensional space

d�y; x� ��������������������������������Xd

j�1

r2j �xj ÿ yj�2

vuut ; �1�

where rj is the weight of the jth dimension. Assum-ing a m-class classi®cation problem, our proposedgeneralization is just a natural extension of (1):

d�y; x� ���������������������������������Xd

j�1

r2cj�xj ÿ yj�2

vuut ; �2�

where c � class�x�. We will refer to this extensionas class-dependent weighted (CDW) measure. Ifrij � 1, 1 < i < m, 1 < j < d, the weighted mea-sure is just the L2 metric. On the other hand, if theweights are the inverse of the variances in eachdimension, the Mahalanobis distance (MD) isobtained. Weights can also be computed as class-dependent inverse variances, leading to a measurethat will be referred to as class-dependent Maha-lanobis (CDM) dissimilarity.

In the general case, (2) is not a metric, sinced�x; y� can be di�erent from d�y; x� if class�x� 6�class�y�, which would not satisfy the symmetryproperty.

In this most general setting, we are interested in®nding an m� d weight matrix, M, which optimizesthe CDW-based NN classi®cation performance

M �r11 . . . r1d

..

. ...

rm1 . . . rmd

0B@1CA: �3�

2. Approach

In order to ®nd a matrix M that results in alow error rate of the NN classi®er with the CDW

dissimilarity measure, we propose the minimiza-tion of a speci®c criterion index.

Under the proposed framework, we expect NNaccuracy to improve by using a dissimilaritymeasure such that distances between points be-longing to the same class are small while inter-classdistances are large. This simple idea suggests thefollowing criterion index:

J�M� �P

x2S d�x; x�nn�Px2S d�x; x6�nn�

; �4�

where x�nn is the nearest neighbor of x in the sameclass �class�x� � class�x�nn�� and x6�nn is the nearestneighbor of x in a di�erent class �class�x� 6�class�x 6�nn��. In the sequel,

Px2S d�x; x�nn� will be

denoted as f �M�, andP

x2S d�x; x6�nn� as g�M�.That is

J�M� � f �M�g�M� :

Minimizing this index amounts to minimizing aratio between sums of distances, a problem whichis di�cult to solve by conventional gradient de-scent. In fact, the gradient with respect to a rij

takes the form:

oJ�M�orij

� �of �M�=orij�g�M� ÿ f �M��og�M�=orij�g�M�2 :

Taking into account that f �M� �Px2S d�x; x�nn�and g�M� �Px2S d�x; x6�nn� this leads to an ex-ceedingly complex expression. Clearly, an alter-native technique for minimizing (4) is needed.

2.1. Fractional programming

In order to ®nd a matrix M that minimizes (4), aFractional Programming procedure (Sniedovich,1992) is proposed. Fractional programming aimsat solving problems of the following type: 1

1 As in (Vidal et al., 1995), where another application of

Fractional Programming in Pattern Recognition is described,

here we consider minimization problems rather than maximiza-

tion problems as in (Sniedovich, 1992). It can be easily veri®ed

that the same results of Sniedovich (1992) also hold in our

formulation.

1028 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036

Page 3: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

Problem Q : q � minz2Z

v�z�w�z� ;

where v and w are real-valued functions on someset Z, and w�z� > 0 8z 2 Z. Let Z� denote the set ofoptimal solutions to this problem. An optimalsolution can be obtained via the solution of aparametric problem of the following type:

Problem Q�k� : q�k� �minz2Z�v�z� ÿ kw�z��; k 2 R:

Let Z��k� denote the set of optimal solutions to theproblem Q�k�. The justi®cation for seeking thesolution of Problem Q via Problem Q�k� is that ak 2 R exists such that every optimal solution to theproblem Problem Q�k� is also an optimal solutionto the Problem Q. The algorithm for ®nding thisk 2 R is known as Dinkelbach's Algorithm (Snie-dovich, 1992).

Dinkelbach's AlgorithmStep 1: Select z 2 Z and set k � 1 andk�k� � v�z�=w�z�.Step 2: Set k0 � k�k�; solve the problem Q�k�k��and select z 2 Z��k�k��.Step 3: Set k � k � 1 and k�k� � v�z�=w�z� ifk0 � k�k� stop, else go to step 2.Step 2 requires an optimal solution to the

problem Q�k� : q�k� � minz2z�v�z� ÿ kw�z��. If thisoptimal solution can be found, 2 then the algo-rithm ®nds a k (in a ®nite number of iterations) forwhich every optimal solution to Q�k� is an optimalsolution to Q as well. Unfortunately, however, ifQ�k� cannot be solved optimally (only local solu-tions can be found), then the algorithm does notguarantee that the global optimal solution can befound for the original problem Q. Since we will usegradient descent techniques to solve Q�k� which donot guarantee a globally optimal solution, ingeneral we will not ®nd the optimal solution to theproblem Q, but we expect to ®nd a good localoptimum.

In our case, Z is a set M of matrices of sizem� d as in (3) and z is one of these matrices,M 2M. Thus, using gradient descent to obtain alocally optimal solution to the problem

Q�k� � minM2M�f �M� ÿ kg�M�� leads to the fol-lowing equations:

r0ij � rij ÿ lijo�f �M� ÿ kg�M��

orij

16 i6m; 16 j6 d; �5�

where rij is a component of M at a certain itera-tion of the descent algorithm, r0ij is the value of thiscomponent at the next iteration and lij is a stepfactor (or ``learning rate'') for dimension j andclass i (typically lij � l 8i; j). By developing thepartial derivatives in (5) for our m-class classi®-cation problem and de®ning Si � fx 2 S : class�x�� ig, 16 i6m, the following update equations areobtained:

r0ij � rij ÿXx2Si

lijrij�x�nnjÿ xj�2

d�x; x�nn�; �6�

r0ij � rij �X

x62Si^x6�nn2Si

klijrij�x6�nnjÿ xj�2

d�x; x6�nn�: �7�

Finally, by embedding this gradient descent pro-cedure into Dinkelbach's Algorithm, we obtain thefractional programming gradient descent (FPGD)algorithm to ®nd a (local) minimum for the index(4). In this algorithm, shown in Fig. 1, the pa-rameters �g and �k are used to control the precisionof the minimum required to assess convergence.They are typically set to adequate small ®xedvalues. On the other hand, the learning rates lij aregenerally set to a single constant value or to valuesthat depend on the variances observed in thetraining data (cf. Section 3).

It is interesting to note that the computationsinvolved in (6) and (7) implicitly entail computingthe NN of each x 2 S, according to the CDWdissimilarity corresponding to the current values ofthe weights rij and prototypes S ÿ fxg. Therefore,as a byproduct, a leave one out (LOO) estimationof the error rate of the NN classi®er with theweighted measure can readily be obtained. Thisissue will be further explored on the next section.

Fig. 2 shows a typical evolution of this algo-rithm, as applied to the so-called ``Monkey Prob-lem'' data set which is described in Section 3.2 And other basic conditions are met (Sniedovich, 1992).

R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1029

Page 4: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

2.2. Finding adequate solutions in adverse situations

A negative side e�ect of the fact that only lo-cally optimal solutions can be obtained in eachstep of the Fractional Programming procedure isthat, if the additive factor in (7) is not su�cientlylarge, the algorithm may tend to set r-values tozero.

As an example of this kind of divergent beha-viour, consider the following two-class problem,with each class having 500 two-dimensional points(Fig. 3). Class A is a mixture of two Gaussiandistributions. The ®rst distribution has a standarddeviation of

�����10p

in the x1 dimension and a unitstandard deviation in the x2 dimension, while thesecond distribution has a unit standard deviationin the x1 dimension and a standard deviation of�����

10p

in the x2 dimension, with both distributionscentered at �0; 0�. Class B is a Gaussian distribu-tion with unit standard deviation in the x1 di-mension and a standard deviation of

�����10p

in the x2

dimension, centered at �6; 0�. Note the relativelylarge interclass overlapping on the x1 dimension.

As shown in Fig. 4, with this data set (and usingjust unit initialization weights and a constant valuefor the step factor l), the estimated error rate tendsto worsen, while the proposed criterion index (4)e�ectively decreases through successive iterations.

This undesirable e�ect is actually due to the factthat all rij tend to zero until the algorithm stops. Itis interesting to note that, despite this ``divergent''behaviour, a minimum error estimate is achievedat a certain step of the procedure, as can be seen in

Fig. 1. Fractional Programming Gradient Descent algorithm.

0

2

4

6

8

10

12

14

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

Iterations

FPGD evolution in the Monkey ProblemError estimation Index

Error estimationIndex

Fig. 2. Behaviour of the FPGD algorithm as applied to the

``Monkey Problem'' data set. Classi®cation error is estimated

through Leave One Out.

1030 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036

Page 5: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

Fig. 4. In other words, a low value of J�M� doesnot always necessarily mean a low value of the NNclassi®er error rate, but was only an assumption asmentioned in Section 2. Nevertheless it is possibleto ®nd a minimum of the estimated error some-where in the path that goes towards the minimumindex value. This suggests to us that, rather thansupplying the weight values obtained at the end ofthe FPGD procedure, a better choice for M in

general would be supplying the weights that led tothe minimum estimated error rate. In typical cases,such as that shown in Fig. 2, this minimum isachieved at the convergence point of the FPGDprocedure, while in adverse situations, such as thatin Fig. 4, the minimum-error weights will hope-fully be a better choice than the standard (L2 orMahalanobis) distance.

It worth noting that this simple heuristic guar-antees a LOO error estimation for the resultingweights which is never larger than the one ob-tained with the initial weights. Consequently, ifweights are initialized with values correspondingto a certain conventional (adequate) metric, the®nal weights are expected to behave at least as wellas this metric would.

2.3. Asymptotic behaviour

The previous section introduces an essentialfeature of our approach, namely, the estimation ofthe error rate of the classi®er by LOO using theweights at each step of the process. At the end ofthe process, the weights with the best estimationare selected.

Let n be the size of the training set. If M isinitialized to the unit matrix, in the ®rst step of the

-10

-5

0

5

10

-10 -5 0 5 10

Class AClass B

Fig. 3. Two-class problem with the Gaussian mixture distributions and interclass overlapping.

0

10

20

30

40

50

60

0 20 40 60 80 100 120 140 160 180

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

Iterations

Evolution on a Synthetic Data SetError estimation Index

Minimum estimated error

Error estimationIndex

Fig. 4. ``Divergent'' evolution of the FPGD algorithm with the

``adverse'' synthetic data shown in Fig. 3. The CDW index

converges as expected but the error rate tends to increase.

Nevertheless there is a step in which the error is minimum.

R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1031

Page 6: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

process a LOO error estimation, �̂nnn, of the stan-

dard Nearest Neighbor classi®er is obtained. Atthe end of the process the weight matrix with thebest error estimation, �̂n

w, is selected. Therefore�̂n

w6 �̂nnn.

It is well known that, under suitable conditions(Devroye et al., 1996), when n tends to in®nity theLOO error estimation of a NN classi®er tends tothe error rate of this classi®er. Therefore:

�̂nw6 �̂n

nn

limn!1 �̂nnn � �nn

limn!1 �̂nw � �w

9>=>; ! �w6 �nn: �8�

In conclusion, the classi®er using the optimalweight matrix is guaranteed to produce less thanor equal error rate than the standard NearestNeighbor, in this asymptotic case.

3. Experiments

Several standard benchmark corpora from theUCI Repository of Machine Learning Databasesand Domain Theories (UCI) and the StatlogProject (Statlog) have been used. A short descrip-tion of these corpora is given below:· Statlog Australian Credit Approval (Austra-

lian): 690 prototypes, 14 features, 2 classes. Di-vided into 10 sets for cross-validation.

· UCI Balance (Balance): 625 prototypes, 4 fea-tures, 3 classes. Divided into 10 sets for cross-validation. A di�erent design of the experimentwas made in (Shultz et al., 1994).

· Statlog Pima Indians Diabetes (Diabetes): 768prototypes, 8 features, 2 classes. Divided into11 sets for cross-validation.

· Statlog DNA (DNA): Training set of 2000 pro-totypes. Test set of 1186 vectors, 180 features, 3classes.

· Statlog German Credit Data (German): 1000prototypes, 20 features, 2 classes. Divided into10 sets for cross-validation.

· Statlog Heart (Heart): 270 prototypes, 13 fea-tures, 2 classes. Divided into 9 sets for cross-val-idation.

· UCI Ionosphere (Ionosphere): Training setof 200 prototypes (the ®rst 200 as in (Sigilito

et al., 1989)), Test set 151 vectors, 34 features,2 classes.

· Statlog Letter Image Recognition Letter (Let-ter): Training set of 15,000 prototypes, Testset of 5000 vectors, 16 features, 26 classes.

· UCI Monkey-Problem-1 (Monkey): Trainingset of 124 prototypes, Test set of 432 vectors,6 features, 2 classes.

· Statlog Satellite Image (Satimage): Training setof 4435 prototypes, Test set of 2000 prototypes,36 features, 6 classes.

· Statlog Image Segmentation (Segmen): 2310prototypes, 19 features, 7 classes. Divided into10 sets for cross-validation.

· Statlog Shuttle (Shuttle): Training set of 43,500prototypes, Test set of 14,500 vectors, 9 fea-tures, 7 classes.

· Statlog Vehicle (Vehicle): 846 prototypes, 18features, 4 classes. Divided into 9 sets forcross-validation.Most of these data sets involve both numeric

and categorical features. In our experiments, eachcategorical feature has been replaced by n binaryfeatures, where n is the number of di�erent valuesallowed for the categorical feature. For example,in a hypothetical set of data with two features: Age(Continuous) and Sex (Categorical: M, F), thecategorical feature would be replaced by two bi-nary features; i.e., Sex�M will be represented as�1; 0� and Sex�F as �0; 1�. The continuous featurewill not undergo any change, leading to an overallthree-dimensional representation.

Many UCI and Statlog data sets are small. Inthese cases, N-Fold Cross-Validation (Raudys andJain, 1991) has been applied to obtain the classi-®cation results. Each corpus is divided into Nblocks using N ÿ 1 blocks as a training set and theremaining block as a test set. Therefore, eachblock is used exactly once as a test set. The numberof cross validation blocks, N, is speci®ed for eachcorpus in the UCI and Statlog documentation. Forthe DNA, Letter, Monkey, Satimage and Shuttle,which are relatively larger corpora, a single speci®cpartition into training and test sets was providedby Statlog and, in these cases, no cross validationwas carried out. It should be ®nally mentionedthat, although classi®cation-cost penalties are avai-lable in a few cases, for the sake of presentation

1032 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036

Page 7: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

homogeneity, we have decided not to makeuse of them; neither for training nor for classi®-cation.

4. Results

Experiments with both the NN and the k-NNrules were carried out using the L2 metric, the MD,the CDM, and our CDW dissimilarity measures.As mentioned in Section 1, CDM consists inweighting each dimension by the inverse of thevariance of this dimension in each class.

In the case of the CDM dissimilarity, compu-tation singularities can appear when dealing withcategorical features, which often exhibit null class-dependent variances. This problem was solved byusing the overall variance as a ``back-o�'' forsmoothing the null values.

Initialization values for training the CDWweights were selected according to the followingsimple rule, which is based on LOO NN perfor-mance of conventional methods on the trainingdata: If raw L2 outperforms CDM, then set allinitial rij � 1; otherwise, set them to the inverse ofthe corresponding training data standard devia-tions. Similarly, the step factors, lij, are set to asmall constant (0.001) in the former case and tothe inverse of the standard deviation in the latter.Tables 1 and 2 summarize the results for NN andK-NN classi®cation, respectively. In the case of k-NN, only the results for the optimal value of

k; 1 < k < 21 observed in each method are re-ported.

For the NN classi®cation rule (Table 1) CDWoutperforms conventional methods in most of thecorpora. The greatest improvement (+13%) wasobtained in the Monkey-Problem, a categoricalcorpus with a small number of features and onlytwo classes. Similarly, good improvement (+9.2%)was obtained for the DNA corpus, which is also acorpus with categorical data, but with far morefeatures (180) and three classes. CDW has onlybeen slightly outperformed (by less than 1.6%) byother methods in a few cases: Australian, Iono-sphere and Shuttle.

For the K-NN classi®cation rule (Table 2),CDW outperforms conventional methods in manycorpora: DNA, Ionosphere, Letter, Monkey, Seg-men and Vehicle; again Monkey and DNA yieldedthe most signi®cant improvements (+12.7% and+7.7%, respectively). Also, in this K-NN case, inthe corpora where CDW is outperformed by someother method, the di�erence in accuracy was gen-erally small.

Error estimation 95% con®dence intervals 3

(Duda and Hart, 1973) for the best method arealso shown in Tables 1 and 2. It is interestingto note that in the few cases where CDW is

Table 1

Classi®cation accuracy (in %) of di�erent methods, using the NN rule on several data setsa

L2 MD CDM CDW CI

Australian 65.73 81.03 82.94 81.37 �2.7, )3.0

Balance 78.83 80.16 68.0 82.63 �2.9, )3.2

Diabetes 69.94 70.62 68.3 71.72 �3.2, )3.3

DNA 76.55 74.28 84.99 94.18 �1.3, )1.5

German 66.3 66.9 67.6 70.7 �2.8, )2.9

Heart 59.72 76.21 76.14 77.31 �4.8, )5.5

Ionosphere 92.05 85.22 82.95 91.39 �3.8, )5.5

Letter 95.8 95.26 92.98 96.6 �0.5, )0.5

Monkey 78.7 86.34 87.04 100 �0.0, )0.8

Satimage 89.45 89.35 85.3 90.15 �1.3, )1.4

Segmen 96.32 96.27 95.97 96.92 �0.7, )0.8

Shuttle 99.88 99.91 99.93 99.86 �0.04, )0.05

Vehicle 65.3 68.51 66.79 69.5 �3.1, )3.2

a Results in boldface correspond to the best accuracy. The last column is the 95% con®dence interval of the best method.

3 Computed by numerically solving the equations:Pk6K P �k; n; p1� � �1ÿ A�=2 and

Pk P K P �k; n; p0� � �1ÿ A�=

2, where P �k; n; p� is the binomial distribution, A� 0.95 is the

con®dence value and �p0; p1� the con®dence interval.

R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1033

Page 8: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

outperformed by other methods, the di�erence isgenerally well within the corresponding con®denceintervals. On the other hand, in many cases whereCDW was the best method, con®dence intervalswere small (notably DNA, Monkey and Letter),thus indicating a statistically signi®cant advantageof CDW.

Comparisons with the best method known foreach corpus (UCI; Statlog; Sigilito et al., 1989) aresummarized in Table 3, while Table 4 shows theresults achieved by several methods in a few cor-pora. 4 From these comparisons and the previ-ously discussed results (Tables 1 and 2) it can beseen that CDW exhibits a uniformly good behaviourfor all the corpora, while other procedures maywork very well for some corpora (usually only onecorpus) but typically tend to worsen (dramaticallyin many cases) for the rest.

5. Concluding remarks

A weighted dissimilarity measure for NN clas-si®cation has been presented. The required matrix

of weights is obtained through Fractional-Pro-gramming-based minimization of an appropriatecriterion index. Results obtained for several stan-dard benchmark data sets are promising.

Current results using the CDW index and theFPGD algorithm are uniformly better than thoseachieved by other more traditional methods. Thisalso applies to comparing FPGD with the directGradient Descent technique previously proposedin (Paredes and Vidal, 1998) to minimize a simplercriterion index.

Other more sophisticated optimization methodscan be devised to minimize the proposed index (4)and new indexes can be proposed which wouldprobably lead to improved performance. Inthis sense, an index which computes the relation

Table 3

Comparing CDW classi®cation accuracy (in %) with the best

accuracy achieved by other methods

CDW Other (method)

Australian 84.80 86.9 (Cal5)

Diabetes 75.13 77.7 (LogDisc)

DNA 94.43 95.9 (Radial)

Ionosphere 97.35 96.7 (IB3)

Letter 96.60 93.6 (Alloc80)

Monkey 100.00 100.0 (AQ17-DCI)a

Satimage 90.75 90.75 (KNN)

Segmen 96.92 97.0 (Alloc80)

Shuttle 99.86 99.0 (NewId)

Vehicle 71.85 85.0 (QuaDisc)

a Many other algorithms also achieve 100% accuracy.

Table 2

Classi®cation accuracy (in %) of di�erent methods using the K-NN rule on several data setsa

L2 MD CDM CDW CI

Australian 69.26 85.44 85.29 84.8 �2.5, )2.8

Balance 91.16 91.66 91.16 90.83 �2.0, )2.4

Diabetes 76.5 77.32 73.77 75.13 �2.9, )3.1

DNA 86.76 83.64 85.16 94.43 �1.2, )1.4

German 71.2 73.2 74.5 71.8 �2.7, )2.8

Heart 67.89 85.13 82.14 80.6 �4.0, )4.8

Ionosphere 94.7 85.22 90.34 97.35 �1.9, )4.0

Letter 96.1 95.56 92.98 96.6 �0.5, )0.5

Monkey 83.33 86.34 87.33 100 �0.0, )0.8

Satimage 90.75 90.65 87.25 90.75 �1.2, )1.3

Segmen 96.32 96.27 95.97 96.92 �0.7, )0.8

Shuttle 99.88 99.92 99.93 99.86 �0.04, )0.05

Vehicle 66.54 71.72 70.25 71.85 �3.0, )3.2

a Results in boldface correspond to the best accuracy. The last column is the 95% con®dence interval of the best method.

4 Corpora that make use of classi®cation-cost penalties

(Section 3), (such as Heart and German), other corpora which

are not comparable because of other di�erences in experiment

design, are excluded. Only those methods which have results in

many corpora, and corpora for which results with many

methods are available have been chosen for the comparisons in

Table 4.

1034 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036

Page 9: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

between the K-NN distances to the prototypes ofthe same class and the K-NN to the prototypes inthe nearest class (rather than the plain NN as in(4)), would be expected to improve current CDWK-NN results.

Another new weighting scheme that deserves tobe studied is one in which weights are assigned toeach prototype ± rather than (or in addition to)each class. This ``Prototype-Dependent Weighted(PDW)'' measure would involve a more ``local''con®guration of the dissimilarity function and isexpected to lead to an overall behaviour of thecorresponding k-NN classi®ers which is even moredata-independent.

Local prototype weighting can also be madefeature-independent; i.e., a single scalar weight isassigned to each prototype. The weight of eachprototype is intended to measure the value of thisprototype for improving classi®cation accuracy.Such a prototype weighting scheme can be seenfrom the viewpoint of prototype editing. This kindof weights can be learned using techniques similarto those introduced in this paper, leading to a re-cently studied very successful editing-orientedweighting method which we call WP-Edit (Paredesand Vidal, 2000).

References

Blake, C., Keogh, E., Merz, C.J. UCI Repository of Machine

Learning Databases. http://www.ics.uci.edu/�mlearn/

MLRepository.html. University of California, Irvine, De-

partment of Information and Computer Sciences.

Cover, T.M., Hart, P.E., 1967. Nearest neighbor pattern

classi®cation. IEEE Trans. Information Theory 13 (1), 21±

27.

Devroye, L., Gy�or®, L., Lugosi, G., 1996. A Probabilistic

Theory of Pattern Recognition. Springer, New York.

Duda, R., Hart, P., 1973. Pattern Recognition and Scene

Analysis. Wiley, New York.

Fukunaga, K., Flick, T.E., 1982. A parametrically de®ned

nearest neighbour measure. Pattern Recognition Letters 1,

3±5.

Fukunaga, K., Flick, T.E., 1984. An optimal global nearest

neighbour metric. IEEE Transactions on Pattern Recogni-

tion, Machine Intelligence and PAMI 6, 314±318.

Fukunaga, K., Flick, T.E., 1985. The 2-nn rule for more

accurate nn risk estimation. IEEE Transactions on Pattern

Analysis and Machine Intelligence 7 (1), 107±112.

Luk, A., Macleod, J.E., 1986. An alternative nearest neighbour

classi®cation scheme. Pattern Recognition Letters 4, 375±

381.

Myles, J.P., Hand, D.J., 1990. The multi-class metric problem

in nearest neighbour discrimination rules. Pattern Recogni-

tion 23 (11), 1291±1297.

Paredes, R., Vidal, E., 1998. A nearest neighbor weighted

measure in classi®cation problems. In: Proceedings of the

VIII Simposium Nacional de Reconocimiento de Formas y

An�alisis de Im�agenes, Bilbao, Spain, July.

Paredes, R., Vidal, E., 2000. Weighting prototypes. A new

editing approach. In: Proceedings of the 15th International

Conference on Pattern Recognition, ICPR2000, Barcelona,

Spain, September.

Raudys, S.J., Jain, A.K., 1991. Small sample e�ects in statistical

pattern recognition: recomendations for practitioners. IEEE

Trans. PAMI 13 (3), 252±264.

Short, R.D., Fukunaga, K., 1980. A new nearest neighbor

distance measure. In: Proceedings of the Fifth IEEE

International Conference on Pattern Recognition, Miami

Beach, FL.

Short, R.D., Fukunaga, K., 1981. An optimal distance measure

for nearest neighbour classi®cation. IEEE Trans. Informa-

tion Theory 27, 622±627.

Shultz, T.R., Mareschal, D., Schmidt, W.C., 1994. Modeling

cognitive development on balance scale phenomena. Ma-

chine Learning 16, 57±86.

Sigilito, V.G., Wing, S.P., Hutton, L.V., Baker, K.B., 1989.

Classi®cation of radar returns from the ionosphere using

neural networks. Johns Hopkins APL Technical Digest 10,

262±266.

Sniedovich, M., 1992. Dynamic Programming. Marcel Dekker,

New York.

Table 4

Comparing classi®cation error rate (in %) achieved by several methodsa

Alloc80 CART C4.5 Discrim NBayes QDisc Cal5 Radial CDW

Australian 20.1 14.5 15.5 14.1 15.1 20.7 13.1 14.5 15.2

DNA 5.7 8.5 7.6 5.9 6.8 5.9 13.1 4.1 5.5

Letter 6.4 ± 13.2 30.2 52.9 11.3 25.3 23.3 3.4

Satimage 13.2 13.8 15 17.1 ± 15.5 15.1 12.1 9.2

Segmen 3 4 4 11.6 26.5 15.7 6.2 6.9 3.1

Vehicle 17.3 23.5 26.6 21.6 55.8 15 27.9 30.7 28.1

a Results in boldface correspond to the best method for each corpus.

R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1035

Page 10: A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

Statlog Corpora. Department of Statistics and Modellong

Science (Stams). Stratchclyde University. ftp.strath.

ac.uk.

Tomek, I., 1976. A generalization of the k-nn rule. IEEE

Transactions on Systems, Man, and Cybernetics 6 (2), 121±

126.

Urahama, K., Furukawa, Y., 1995. Gradient descent learning

of nearest neighbor classi®ers with outlier rejection. Pattern

Recognition 28 (5), 761±768.

Vidal, E., Marzal, A., Aibar, P., 1995. Fast computation of

normalized edit distances. IEEE Transactions on Pattern

Analysis and Machine Intelligence 17 (9), 899±902.

1036 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036