Upload
roberto-paredes
View
228
Download
0
Embed Size (px)
Citation preview
A class-dependent weighted dissimilarity measure fornearest neighbor classi®cation problems
Roberto Paredes *, Enrique Vidal
Instituto Tecnol�ogico de Inform�atica, Universidad Polit�ecnica de Valencia, Camino de Vera S/N, 46071 Valencia, Spain
Received 13 July 1999; received in revised form 18 July 2000
Abstract
A class-dependent weighted (CDW) dissimilarity measure in vector spaces is proposed to improve the performance
of the nearest neighbor (NN) classi®er. In order to optimize the required weights, an approach based on Fractional
Programming is presented. Experiments with several standard benchmark data sets show the e�ectiveness of the
proposed technique. Ó 2000 Published by Elsevier Science B.V.
Keywords: Nearest neighbour classi®cation; Weighted dissimilarity measures; Iterative optimization; Fractional programming
1. Introduction
Let P be a ®nite set of prototypes, which areclass-labelled points in a vector space E and letd(�; �) be a dissimilarity measure de®ned in E. Forany given point x 2 E, the nearest neighbor (NN)classi®cation rule assigns the label of a prototypep 2 P to x such that d�p; x� is minimum. The NNrule can be extended to the k-NN rule by classi-fying x in the class which is more heavily repre-sented by the labels of its k nearest neighbours.The great e�ectiveness of these rules when thenumber of prototypes is growing to in®nity is wellknown (Cover and Hart, 1967). However, in mostreal situations, the number of available prototypesis usually very small, which often leads to dramaticdegradations of (k-)NN classi®cation accuracy.
Consider the following general statistical state-ment of a two-class Pattern Recognition classi®-cation problem: Let Dn � f�X1; Y1�; . . . ; �Xn; Yn�gbe a training data set of independent, identicallydistributed random variable pairs, whereYi 2 f0; 1g; 16 i6 n, are classi®cation labels, andlet X be an observation from the same distribution.Let Y be the true label of X and gn��� a classi®-cation rule based on Dn. The probability of error isRn � PfY 6� gn�X �g. Devroye et al. (1996) showthat, for any integer n and classi®cation rule gn,there exists a distribution of �X ; Y � with Bayes riskR� � 0 such that the expectation of Rn isE�Rn�P 1=2ÿ e, where e > 0 is an arbitrary smallnumber (Devroye et al., 1996). This theorem statesthat even though we have rules, such as the k-NNrule, that are universally consistent (that is, theyasymptotically provide optimal performance forany distribution), their ®nite sample performancecan be extremely bad for some distributions.
This reason explains the increasing interest in®nding variants of the NN rule and adequate
www.elsevier.nl/locate/patrec
Pattern Recognition Letters 21 (2000) 1027±1036
* Corresponding author. Tel.: +34-96-3877-069; fax: +34-96-
3877-239.
E-mail addresses: [email protected] (R. Paredes), evi-
[email protected] (E. Vidal).
0167-8655/00/$ - see front matter Ó 2000 Published by Elsevier Science B.V.
PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 6 4 - 7
distance measures that help improve the NNclassi®cation performance in small data set situa-tions (Tomek, 1976; Fukunaga and Flick, 1985;Luk and Macleod, 1986; Urahama and Furukawa,1995; Short and Fukunaga, 1980, 1981; Fukunagaand Flick, 1982; Fukunaga and Flick, 1984; Mylesand Hand, 1990).
Here we propose a weighted measure which canbe seen as a generalization of the simple weightedL2 dissimilarity in a d-dimensional space
d�y; x� ��������������������������������Xd
j�1
r2j �xj ÿ yj�2
vuut ; �1�
where rj is the weight of the jth dimension. Assum-ing a m-class classi®cation problem, our proposedgeneralization is just a natural extension of (1):
d�y; x� ���������������������������������Xd
j�1
r2cj�xj ÿ yj�2
vuut ; �2�
where c � class�x�. We will refer to this extensionas class-dependent weighted (CDW) measure. Ifrij � 1, 1 < i < m, 1 < j < d, the weighted mea-sure is just the L2 metric. On the other hand, if theweights are the inverse of the variances in eachdimension, the Mahalanobis distance (MD) isobtained. Weights can also be computed as class-dependent inverse variances, leading to a measurethat will be referred to as class-dependent Maha-lanobis (CDM) dissimilarity.
In the general case, (2) is not a metric, sinced�x; y� can be di�erent from d�y; x� if class�x� 6�class�y�, which would not satisfy the symmetryproperty.
In this most general setting, we are interested in®nding an m� d weight matrix, M, which optimizesthe CDW-based NN classi®cation performance
M �r11 . . . r1d
..
. ...
rm1 . . . rmd
0B@1CA: �3�
2. Approach
In order to ®nd a matrix M that results in alow error rate of the NN classi®er with the CDW
dissimilarity measure, we propose the minimiza-tion of a speci®c criterion index.
Under the proposed framework, we expect NNaccuracy to improve by using a dissimilaritymeasure such that distances between points be-longing to the same class are small while inter-classdistances are large. This simple idea suggests thefollowing criterion index:
J�M� �P
x2S d�x; x�nn�Px2S d�x; x6�nn�
; �4�
where x�nn is the nearest neighbor of x in the sameclass �class�x� � class�x�nn�� and x6�nn is the nearestneighbor of x in a di�erent class �class�x� 6�class�x 6�nn��. In the sequel,
Px2S d�x; x�nn� will be
denoted as f �M�, andP
x2S d�x; x6�nn� as g�M�.That is
J�M� � f �M�g�M� :
Minimizing this index amounts to minimizing aratio between sums of distances, a problem whichis di�cult to solve by conventional gradient de-scent. In fact, the gradient with respect to a rij
takes the form:
oJ�M�orij
� �of �M�=orij�g�M� ÿ f �M��og�M�=orij�g�M�2 :
Taking into account that f �M� �Px2S d�x; x�nn�and g�M� �Px2S d�x; x6�nn� this leads to an ex-ceedingly complex expression. Clearly, an alter-native technique for minimizing (4) is needed.
2.1. Fractional programming
In order to ®nd a matrix M that minimizes (4), aFractional Programming procedure (Sniedovich,1992) is proposed. Fractional programming aimsat solving problems of the following type: 1
1 As in (Vidal et al., 1995), where another application of
Fractional Programming in Pattern Recognition is described,
here we consider minimization problems rather than maximiza-
tion problems as in (Sniedovich, 1992). It can be easily veri®ed
that the same results of Sniedovich (1992) also hold in our
formulation.
1028 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036
Problem Q : q � minz2Z
v�z�w�z� ;
where v and w are real-valued functions on someset Z, and w�z� > 0 8z 2 Z. Let Z� denote the set ofoptimal solutions to this problem. An optimalsolution can be obtained via the solution of aparametric problem of the following type:
Problem Q�k� : q�k� �minz2Z�v�z� ÿ kw�z��; k 2 R:
Let Z��k� denote the set of optimal solutions to theproblem Q�k�. The justi®cation for seeking thesolution of Problem Q via Problem Q�k� is that ak 2 R exists such that every optimal solution to theproblem Problem Q�k� is also an optimal solutionto the Problem Q. The algorithm for ®nding thisk 2 R is known as Dinkelbach's Algorithm (Snie-dovich, 1992).
Dinkelbach's AlgorithmStep 1: Select z 2 Z and set k � 1 andk�k� � v�z�=w�z�.Step 2: Set k0 � k�k�; solve the problem Q�k�k��and select z 2 Z��k�k��.Step 3: Set k � k � 1 and k�k� � v�z�=w�z� ifk0 � k�k� stop, else go to step 2.Step 2 requires an optimal solution to the
problem Q�k� : q�k� � minz2z�v�z� ÿ kw�z��. If thisoptimal solution can be found, 2 then the algo-rithm ®nds a k (in a ®nite number of iterations) forwhich every optimal solution to Q�k� is an optimalsolution to Q as well. Unfortunately, however, ifQ�k� cannot be solved optimally (only local solu-tions can be found), then the algorithm does notguarantee that the global optimal solution can befound for the original problem Q. Since we will usegradient descent techniques to solve Q�k� which donot guarantee a globally optimal solution, ingeneral we will not ®nd the optimal solution to theproblem Q, but we expect to ®nd a good localoptimum.
In our case, Z is a set M of matrices of sizem� d as in (3) and z is one of these matrices,M 2M. Thus, using gradient descent to obtain alocally optimal solution to the problem
Q�k� � minM2M�f �M� ÿ kg�M�� leads to the fol-lowing equations:
r0ij � rij ÿ lijo�f �M� ÿ kg�M��
orij
16 i6m; 16 j6 d; �5�
where rij is a component of M at a certain itera-tion of the descent algorithm, r0ij is the value of thiscomponent at the next iteration and lij is a stepfactor (or ``learning rate'') for dimension j andclass i (typically lij � l 8i; j). By developing thepartial derivatives in (5) for our m-class classi®-cation problem and de®ning Si � fx 2 S : class�x�� ig, 16 i6m, the following update equations areobtained:
r0ij � rij ÿXx2Si
lijrij�x�nnjÿ xj�2
d�x; x�nn�; �6�
r0ij � rij �X
x62Si^x6�nn2Si
klijrij�x6�nnjÿ xj�2
d�x; x6�nn�: �7�
Finally, by embedding this gradient descent pro-cedure into Dinkelbach's Algorithm, we obtain thefractional programming gradient descent (FPGD)algorithm to ®nd a (local) minimum for the index(4). In this algorithm, shown in Fig. 1, the pa-rameters �g and �k are used to control the precisionof the minimum required to assess convergence.They are typically set to adequate small ®xedvalues. On the other hand, the learning rates lij aregenerally set to a single constant value or to valuesthat depend on the variances observed in thetraining data (cf. Section 3).
It is interesting to note that the computationsinvolved in (6) and (7) implicitly entail computingthe NN of each x 2 S, according to the CDWdissimilarity corresponding to the current values ofthe weights rij and prototypes S ÿ fxg. Therefore,as a byproduct, a leave one out (LOO) estimationof the error rate of the NN classi®er with theweighted measure can readily be obtained. Thisissue will be further explored on the next section.
Fig. 2 shows a typical evolution of this algo-rithm, as applied to the so-called ``Monkey Prob-lem'' data set which is described in Section 3.2 And other basic conditions are met (Sniedovich, 1992).
R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1029
2.2. Finding adequate solutions in adverse situations
A negative side e�ect of the fact that only lo-cally optimal solutions can be obtained in eachstep of the Fractional Programming procedure isthat, if the additive factor in (7) is not su�cientlylarge, the algorithm may tend to set r-values tozero.
As an example of this kind of divergent beha-viour, consider the following two-class problem,with each class having 500 two-dimensional points(Fig. 3). Class A is a mixture of two Gaussiandistributions. The ®rst distribution has a standarddeviation of
�����10p
in the x1 dimension and a unitstandard deviation in the x2 dimension, while thesecond distribution has a unit standard deviationin the x1 dimension and a standard deviation of�����
10p
in the x2 dimension, with both distributionscentered at �0; 0�. Class B is a Gaussian distribu-tion with unit standard deviation in the x1 di-mension and a standard deviation of
�����10p
in the x2
dimension, centered at �6; 0�. Note the relativelylarge interclass overlapping on the x1 dimension.
As shown in Fig. 4, with this data set (and usingjust unit initialization weights and a constant valuefor the step factor l), the estimated error rate tendsto worsen, while the proposed criterion index (4)e�ectively decreases through successive iterations.
This undesirable e�ect is actually due to the factthat all rij tend to zero until the algorithm stops. Itis interesting to note that, despite this ``divergent''behaviour, a minimum error estimate is achievedat a certain step of the procedure, as can be seen in
Fig. 1. Fractional Programming Gradient Descent algorithm.
0
2
4
6
8
10
12
14
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
Iterations
FPGD evolution in the Monkey ProblemError estimation Index
Error estimationIndex
Fig. 2. Behaviour of the FPGD algorithm as applied to the
``Monkey Problem'' data set. Classi®cation error is estimated
through Leave One Out.
1030 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036
Fig. 4. In other words, a low value of J�M� doesnot always necessarily mean a low value of the NNclassi®er error rate, but was only an assumption asmentioned in Section 2. Nevertheless it is possibleto ®nd a minimum of the estimated error some-where in the path that goes towards the minimumindex value. This suggests to us that, rather thansupplying the weight values obtained at the end ofthe FPGD procedure, a better choice for M in
general would be supplying the weights that led tothe minimum estimated error rate. In typical cases,such as that shown in Fig. 2, this minimum isachieved at the convergence point of the FPGDprocedure, while in adverse situations, such as thatin Fig. 4, the minimum-error weights will hope-fully be a better choice than the standard (L2 orMahalanobis) distance.
It worth noting that this simple heuristic guar-antees a LOO error estimation for the resultingweights which is never larger than the one ob-tained with the initial weights. Consequently, ifweights are initialized with values correspondingto a certain conventional (adequate) metric, the®nal weights are expected to behave at least as wellas this metric would.
2.3. Asymptotic behaviour
The previous section introduces an essentialfeature of our approach, namely, the estimation ofthe error rate of the classi®er by LOO using theweights at each step of the process. At the end ofthe process, the weights with the best estimationare selected.
Let n be the size of the training set. If M isinitialized to the unit matrix, in the ®rst step of the
-10
-5
0
5
10
-10 -5 0 5 10
Class AClass B
Fig. 3. Two-class problem with the Gaussian mixture distributions and interclass overlapping.
0
10
20
30
40
50
60
0 20 40 60 80 100 120 140 160 180
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
Iterations
Evolution on a Synthetic Data SetError estimation Index
Minimum estimated error
Error estimationIndex
Fig. 4. ``Divergent'' evolution of the FPGD algorithm with the
``adverse'' synthetic data shown in Fig. 3. The CDW index
converges as expected but the error rate tends to increase.
Nevertheless there is a step in which the error is minimum.
R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1031
process a LOO error estimation, �̂nnn, of the stan-
dard Nearest Neighbor classi®er is obtained. Atthe end of the process the weight matrix with thebest error estimation, �̂n
w, is selected. Therefore�̂n
w6 �̂nnn.
It is well known that, under suitable conditions(Devroye et al., 1996), when n tends to in®nity theLOO error estimation of a NN classi®er tends tothe error rate of this classi®er. Therefore:
�̂nw6 �̂n
nn
limn!1 �̂nnn � �nn
limn!1 �̂nw � �w
9>=>; ! �w6 �nn: �8�
In conclusion, the classi®er using the optimalweight matrix is guaranteed to produce less thanor equal error rate than the standard NearestNeighbor, in this asymptotic case.
3. Experiments
Several standard benchmark corpora from theUCI Repository of Machine Learning Databasesand Domain Theories (UCI) and the StatlogProject (Statlog) have been used. A short descrip-tion of these corpora is given below:· Statlog Australian Credit Approval (Austra-
lian): 690 prototypes, 14 features, 2 classes. Di-vided into 10 sets for cross-validation.
· UCI Balance (Balance): 625 prototypes, 4 fea-tures, 3 classes. Divided into 10 sets for cross-validation. A di�erent design of the experimentwas made in (Shultz et al., 1994).
· Statlog Pima Indians Diabetes (Diabetes): 768prototypes, 8 features, 2 classes. Divided into11 sets for cross-validation.
· Statlog DNA (DNA): Training set of 2000 pro-totypes. Test set of 1186 vectors, 180 features, 3classes.
· Statlog German Credit Data (German): 1000prototypes, 20 features, 2 classes. Divided into10 sets for cross-validation.
· Statlog Heart (Heart): 270 prototypes, 13 fea-tures, 2 classes. Divided into 9 sets for cross-val-idation.
· UCI Ionosphere (Ionosphere): Training setof 200 prototypes (the ®rst 200 as in (Sigilito
et al., 1989)), Test set 151 vectors, 34 features,2 classes.
· Statlog Letter Image Recognition Letter (Let-ter): Training set of 15,000 prototypes, Testset of 5000 vectors, 16 features, 26 classes.
· UCI Monkey-Problem-1 (Monkey): Trainingset of 124 prototypes, Test set of 432 vectors,6 features, 2 classes.
· Statlog Satellite Image (Satimage): Training setof 4435 prototypes, Test set of 2000 prototypes,36 features, 6 classes.
· Statlog Image Segmentation (Segmen): 2310prototypes, 19 features, 7 classes. Divided into10 sets for cross-validation.
· Statlog Shuttle (Shuttle): Training set of 43,500prototypes, Test set of 14,500 vectors, 9 fea-tures, 7 classes.
· Statlog Vehicle (Vehicle): 846 prototypes, 18features, 4 classes. Divided into 9 sets forcross-validation.Most of these data sets involve both numeric
and categorical features. In our experiments, eachcategorical feature has been replaced by n binaryfeatures, where n is the number of di�erent valuesallowed for the categorical feature. For example,in a hypothetical set of data with two features: Age(Continuous) and Sex (Categorical: M, F), thecategorical feature would be replaced by two bi-nary features; i.e., Sex�M will be represented as�1; 0� and Sex�F as �0; 1�. The continuous featurewill not undergo any change, leading to an overallthree-dimensional representation.
Many UCI and Statlog data sets are small. Inthese cases, N-Fold Cross-Validation (Raudys andJain, 1991) has been applied to obtain the classi-®cation results. Each corpus is divided into Nblocks using N ÿ 1 blocks as a training set and theremaining block as a test set. Therefore, eachblock is used exactly once as a test set. The numberof cross validation blocks, N, is speci®ed for eachcorpus in the UCI and Statlog documentation. Forthe DNA, Letter, Monkey, Satimage and Shuttle,which are relatively larger corpora, a single speci®cpartition into training and test sets was providedby Statlog and, in these cases, no cross validationwas carried out. It should be ®nally mentionedthat, although classi®cation-cost penalties are avai-lable in a few cases, for the sake of presentation
1032 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036
homogeneity, we have decided not to makeuse of them; neither for training nor for classi®-cation.
4. Results
Experiments with both the NN and the k-NNrules were carried out using the L2 metric, the MD,the CDM, and our CDW dissimilarity measures.As mentioned in Section 1, CDM consists inweighting each dimension by the inverse of thevariance of this dimension in each class.
In the case of the CDM dissimilarity, compu-tation singularities can appear when dealing withcategorical features, which often exhibit null class-dependent variances. This problem was solved byusing the overall variance as a ``back-o�'' forsmoothing the null values.
Initialization values for training the CDWweights were selected according to the followingsimple rule, which is based on LOO NN perfor-mance of conventional methods on the trainingdata: If raw L2 outperforms CDM, then set allinitial rij � 1; otherwise, set them to the inverse ofthe corresponding training data standard devia-tions. Similarly, the step factors, lij, are set to asmall constant (0.001) in the former case and tothe inverse of the standard deviation in the latter.Tables 1 and 2 summarize the results for NN andK-NN classi®cation, respectively. In the case of k-NN, only the results for the optimal value of
k; 1 < k < 21 observed in each method are re-ported.
For the NN classi®cation rule (Table 1) CDWoutperforms conventional methods in most of thecorpora. The greatest improvement (+13%) wasobtained in the Monkey-Problem, a categoricalcorpus with a small number of features and onlytwo classes. Similarly, good improvement (+9.2%)was obtained for the DNA corpus, which is also acorpus with categorical data, but with far morefeatures (180) and three classes. CDW has onlybeen slightly outperformed (by less than 1.6%) byother methods in a few cases: Australian, Iono-sphere and Shuttle.
For the K-NN classi®cation rule (Table 2),CDW outperforms conventional methods in manycorpora: DNA, Ionosphere, Letter, Monkey, Seg-men and Vehicle; again Monkey and DNA yieldedthe most signi®cant improvements (+12.7% and+7.7%, respectively). Also, in this K-NN case, inthe corpora where CDW is outperformed by someother method, the di�erence in accuracy was gen-erally small.
Error estimation 95% con®dence intervals 3
(Duda and Hart, 1973) for the best method arealso shown in Tables 1 and 2. It is interestingto note that in the few cases where CDW is
Table 1
Classi®cation accuracy (in %) of di�erent methods, using the NN rule on several data setsa
L2 MD CDM CDW CI
Australian 65.73 81.03 82.94 81.37 �2.7, )3.0
Balance 78.83 80.16 68.0 82.63 �2.9, )3.2
Diabetes 69.94 70.62 68.3 71.72 �3.2, )3.3
DNA 76.55 74.28 84.99 94.18 �1.3, )1.5
German 66.3 66.9 67.6 70.7 �2.8, )2.9
Heart 59.72 76.21 76.14 77.31 �4.8, )5.5
Ionosphere 92.05 85.22 82.95 91.39 �3.8, )5.5
Letter 95.8 95.26 92.98 96.6 �0.5, )0.5
Monkey 78.7 86.34 87.04 100 �0.0, )0.8
Satimage 89.45 89.35 85.3 90.15 �1.3, )1.4
Segmen 96.32 96.27 95.97 96.92 �0.7, )0.8
Shuttle 99.88 99.91 99.93 99.86 �0.04, )0.05
Vehicle 65.3 68.51 66.79 69.5 �3.1, )3.2
a Results in boldface correspond to the best accuracy. The last column is the 95% con®dence interval of the best method.
3 Computed by numerically solving the equations:Pk6K P �k; n; p1� � �1ÿ A�=2 and
Pk P K P �k; n; p0� � �1ÿ A�=
2, where P �k; n; p� is the binomial distribution, A� 0.95 is the
con®dence value and �p0; p1� the con®dence interval.
R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1033
outperformed by other methods, the di�erence isgenerally well within the corresponding con®denceintervals. On the other hand, in many cases whereCDW was the best method, con®dence intervalswere small (notably DNA, Monkey and Letter),thus indicating a statistically signi®cant advantageof CDW.
Comparisons with the best method known foreach corpus (UCI; Statlog; Sigilito et al., 1989) aresummarized in Table 3, while Table 4 shows theresults achieved by several methods in a few cor-pora. 4 From these comparisons and the previ-ously discussed results (Tables 1 and 2) it can beseen that CDW exhibits a uniformly good behaviourfor all the corpora, while other procedures maywork very well for some corpora (usually only onecorpus) but typically tend to worsen (dramaticallyin many cases) for the rest.
5. Concluding remarks
A weighted dissimilarity measure for NN clas-si®cation has been presented. The required matrix
of weights is obtained through Fractional-Pro-gramming-based minimization of an appropriatecriterion index. Results obtained for several stan-dard benchmark data sets are promising.
Current results using the CDW index and theFPGD algorithm are uniformly better than thoseachieved by other more traditional methods. Thisalso applies to comparing FPGD with the directGradient Descent technique previously proposedin (Paredes and Vidal, 1998) to minimize a simplercriterion index.
Other more sophisticated optimization methodscan be devised to minimize the proposed index (4)and new indexes can be proposed which wouldprobably lead to improved performance. Inthis sense, an index which computes the relation
Table 3
Comparing CDW classi®cation accuracy (in %) with the best
accuracy achieved by other methods
CDW Other (method)
Australian 84.80 86.9 (Cal5)
Diabetes 75.13 77.7 (LogDisc)
DNA 94.43 95.9 (Radial)
Ionosphere 97.35 96.7 (IB3)
Letter 96.60 93.6 (Alloc80)
Monkey 100.00 100.0 (AQ17-DCI)a
Satimage 90.75 90.75 (KNN)
Segmen 96.92 97.0 (Alloc80)
Shuttle 99.86 99.0 (NewId)
Vehicle 71.85 85.0 (QuaDisc)
a Many other algorithms also achieve 100% accuracy.
Table 2
Classi®cation accuracy (in %) of di�erent methods using the K-NN rule on several data setsa
L2 MD CDM CDW CI
Australian 69.26 85.44 85.29 84.8 �2.5, )2.8
Balance 91.16 91.66 91.16 90.83 �2.0, )2.4
Diabetes 76.5 77.32 73.77 75.13 �2.9, )3.1
DNA 86.76 83.64 85.16 94.43 �1.2, )1.4
German 71.2 73.2 74.5 71.8 �2.7, )2.8
Heart 67.89 85.13 82.14 80.6 �4.0, )4.8
Ionosphere 94.7 85.22 90.34 97.35 �1.9, )4.0
Letter 96.1 95.56 92.98 96.6 �0.5, )0.5
Monkey 83.33 86.34 87.33 100 �0.0, )0.8
Satimage 90.75 90.65 87.25 90.75 �1.2, )1.3
Segmen 96.32 96.27 95.97 96.92 �0.7, )0.8
Shuttle 99.88 99.92 99.93 99.86 �0.04, )0.05
Vehicle 66.54 71.72 70.25 71.85 �3.0, )3.2
a Results in boldface correspond to the best accuracy. The last column is the 95% con®dence interval of the best method.
4 Corpora that make use of classi®cation-cost penalties
(Section 3), (such as Heart and German), other corpora which
are not comparable because of other di�erences in experiment
design, are excluded. Only those methods which have results in
many corpora, and corpora for which results with many
methods are available have been chosen for the comparisons in
Table 4.
1034 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036
between the K-NN distances to the prototypes ofthe same class and the K-NN to the prototypes inthe nearest class (rather than the plain NN as in(4)), would be expected to improve current CDWK-NN results.
Another new weighting scheme that deserves tobe studied is one in which weights are assigned toeach prototype ± rather than (or in addition to)each class. This ``Prototype-Dependent Weighted(PDW)'' measure would involve a more ``local''con®guration of the dissimilarity function and isexpected to lead to an overall behaviour of thecorresponding k-NN classi®ers which is even moredata-independent.
Local prototype weighting can also be madefeature-independent; i.e., a single scalar weight isassigned to each prototype. The weight of eachprototype is intended to measure the value of thisprototype for improving classi®cation accuracy.Such a prototype weighting scheme can be seenfrom the viewpoint of prototype editing. This kindof weights can be learned using techniques similarto those introduced in this paper, leading to a re-cently studied very successful editing-orientedweighting method which we call WP-Edit (Paredesand Vidal, 2000).
References
Blake, C., Keogh, E., Merz, C.J. UCI Repository of Machine
Learning Databases. http://www.ics.uci.edu/�mlearn/
MLRepository.html. University of California, Irvine, De-
partment of Information and Computer Sciences.
Cover, T.M., Hart, P.E., 1967. Nearest neighbor pattern
classi®cation. IEEE Trans. Information Theory 13 (1), 21±
27.
Devroye, L., Gy�or®, L., Lugosi, G., 1996. A Probabilistic
Theory of Pattern Recognition. Springer, New York.
Duda, R., Hart, P., 1973. Pattern Recognition and Scene
Analysis. Wiley, New York.
Fukunaga, K., Flick, T.E., 1982. A parametrically de®ned
nearest neighbour measure. Pattern Recognition Letters 1,
3±5.
Fukunaga, K., Flick, T.E., 1984. An optimal global nearest
neighbour metric. IEEE Transactions on Pattern Recogni-
tion, Machine Intelligence and PAMI 6, 314±318.
Fukunaga, K., Flick, T.E., 1985. The 2-nn rule for more
accurate nn risk estimation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 7 (1), 107±112.
Luk, A., Macleod, J.E., 1986. An alternative nearest neighbour
classi®cation scheme. Pattern Recognition Letters 4, 375±
381.
Myles, J.P., Hand, D.J., 1990. The multi-class metric problem
in nearest neighbour discrimination rules. Pattern Recogni-
tion 23 (11), 1291±1297.
Paredes, R., Vidal, E., 1998. A nearest neighbor weighted
measure in classi®cation problems. In: Proceedings of the
VIII Simposium Nacional de Reconocimiento de Formas y
An�alisis de Im�agenes, Bilbao, Spain, July.
Paredes, R., Vidal, E., 2000. Weighting prototypes. A new
editing approach. In: Proceedings of the 15th International
Conference on Pattern Recognition, ICPR2000, Barcelona,
Spain, September.
Raudys, S.J., Jain, A.K., 1991. Small sample e�ects in statistical
pattern recognition: recomendations for practitioners. IEEE
Trans. PAMI 13 (3), 252±264.
Short, R.D., Fukunaga, K., 1980. A new nearest neighbor
distance measure. In: Proceedings of the Fifth IEEE
International Conference on Pattern Recognition, Miami
Beach, FL.
Short, R.D., Fukunaga, K., 1981. An optimal distance measure
for nearest neighbour classi®cation. IEEE Trans. Informa-
tion Theory 27, 622±627.
Shultz, T.R., Mareschal, D., Schmidt, W.C., 1994. Modeling
cognitive development on balance scale phenomena. Ma-
chine Learning 16, 57±86.
Sigilito, V.G., Wing, S.P., Hutton, L.V., Baker, K.B., 1989.
Classi®cation of radar returns from the ionosphere using
neural networks. Johns Hopkins APL Technical Digest 10,
262±266.
Sniedovich, M., 1992. Dynamic Programming. Marcel Dekker,
New York.
Table 4
Comparing classi®cation error rate (in %) achieved by several methodsa
Alloc80 CART C4.5 Discrim NBayes QDisc Cal5 Radial CDW
Australian 20.1 14.5 15.5 14.1 15.1 20.7 13.1 14.5 15.2
DNA 5.7 8.5 7.6 5.9 6.8 5.9 13.1 4.1 5.5
Letter 6.4 ± 13.2 30.2 52.9 11.3 25.3 23.3 3.4
Satimage 13.2 13.8 15 17.1 ± 15.5 15.1 12.1 9.2
Segmen 3 4 4 11.6 26.5 15.7 6.2 6.9 3.1
Vehicle 17.3 23.5 26.6 21.6 55.8 15 27.9 30.7 28.1
a Results in boldface correspond to the best method for each corpus.
R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036 1035
Statlog Corpora. Department of Statistics and Modellong
Science (Stams). Stratchclyde University. ftp.strath.
ac.uk.
Tomek, I., 1976. A generalization of the k-nn rule. IEEE
Transactions on Systems, Man, and Cybernetics 6 (2), 121±
126.
Urahama, K., Furukawa, Y., 1995. Gradient descent learning
of nearest neighbor classi®ers with outlier rejection. Pattern
Recognition 28 (5), 761±768.
Vidal, E., Marzal, A., Aibar, P., 1995. Fast computation of
normalized edit distances. IEEE Transactions on Pattern
Analysis and Machine Intelligence 17 (9), 899±902.
1036 R. Paredes, E. Vidal / Pattern Recognition Letters 21 (2000) 1027±1036