Fuzzy feature selection based on min–max learning rule and extension matrix

Pattern Recognition 41 (2008) 217–226www.elsevier.com/locate/pr

Fuzzy feature selection based on min–max learning rule and extension matrix

Yun Lia,∗, Zhong-Fu Wub

aDepartment of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd, Shanghai 200240, PR ChinabCollege of Computer, ChongQing University, 174 Shazheng Road, Chongqing 400044, PR China

Received 22 September 2006; received in revised form 15 June 2007; accepted 20 June 2007

Abstract

In many systems, such as fuzzy neural network, we often adopt the language labels (such as large, medium, small, etc.) to split the originalfeature into several fuzzy features. In order to reduce the computation complexity of the system after the fuzzification of features, the optimalfuzzy feature subset should be selected. In this paper, we propose a new heuristic algorithm, where the criterion is based on min–max learningrule and fuzzy extension matrix is designed as the search strategy. The algorithm is proved in theory and has shown its high performance overseveral real-world benchmark data sets.� 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Fuzzy set theory; Feature selection; Min–max rule; Extension matrix

1. Introduction

In a lot of real-world systems, such as fuzzy neural network,an original feature is often split into several fuzzy features us-ing some fuzzy linguistic terms such as small and large, etc.,which is called as 1-of-N code, and then the original featurespace is mapped to fuzzy feature space. Each fuzzy feature cor-responds to a fuzzy set and denoted by a membership function,and it represents some local information of the original featureand reduce the information granularity. The ambitions of thesplit (fuzzification) are to characterize the problem more accu-rately and simplify the classifier design in the system. In manycases, the relationship between the result of classification andthe original feature is nonlinear and complicated. However, ifthe original feature is fuzzified, the relationship may be linearand the classifier may be simple. Fuzzification of the originalfeatures is based on specified rules, which are obtained usingspecial expert knowledge or continuous experimentation. Theconcrete fuzzification process is described in Ref. [1]. How-ever, the gotten fuzzy feature space has higher dimensionalitythan the original feature space. In order to reduce the computa-tion complexity of the system after the fuzzification of features,

∗ Corresponding author. Tel.: +86 21 34204421; fax: +86 21 34205422.E-mail address: [email protected] (Y. Li).

0031-3203/$30.00 � 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2007.06.007

it is urgent to select features from fuzzy feature space with-out decreasing the efficiency of classifier. In many real-worldproblems, such as data mining and pattern recognition, featureselection is very important, which not only reduces trainingand inference time but also obtains better accuracy of the pre-dictor. In general, feature selection including two key points:the proper evaluation criterion and the efficient search algo-rithm. For evaluation criterion, the best one is the minimumerror probability, but it is infeasible for the poor quality of theclass-conditional density estimation. Then it is replaced withthe distance criterion or other criteria in terms of the classifica-tion performance, such as information measures, dependencymeasures and consistency measures [2,3]. For the search strat-egy, see Refs. [2,3] for a list of these techniques such as se-quence forward selection (SFS), sequence backward selection(SBS), simulated annealing (SA), etc.

In this paper, we propose a heuristic algorithm with newevaluation criterion and search strategy for feature selectionfrom fuzzy feature space. The paper is organized as follows:fuzzy feature selection algorithm is presented in Section 2. InSection 3, a number of real data sets are described and usedin the evaluation study. At the same time, the experimentalresults are also shown in Section 3. The paper ends with thediscussion of the utility and the limitations of our algorithm inSection 4.

http://www.elsevier.com/locate/pr

mailto:liyunprotect LY1extunderscore [email protected]

218 Y. Li, Z.-F. Wu / Pattern Recognition 41 (2008) 217–226

2. Fuzzy feature selection

2.1. Problem analysis

Up to now, most of fuzzy feature selection algorithms areonly interested in applying fuzzy set theories into the regularfeature selection, such as, a method was proposed in Ref. [4]to select features when the feature values are discrete fuzzy-value, and a fuzzy feature evaluation index-FFEI was presentedin Refs. [5–7] for regular feature selection. However, this paperfocuses on the feature selection from the fuzzy feature spacementioned in Section 1, where the value of fuzzy feature ismembership degree to the corresponding fuzzy set, and thedegree is numerical and takes value [0, 1]. For this specialfeature selection, the common algorithms listed in Section 1 canbe used, for example, exhaustive search and SBS are used inRef. [1] with large computational complexity. Here, we presenta new algorithm to solve this special feature selection with newevaluation criterion and search strategy. The proposed algorithmadequately considers the characteristics of value range of fuzzyfeature. Before giving a rigorous algorithm flow, we reviewsome notations and concepts used in this paper.

As we have known, if the similarity between classes withrespect to a selected feature subset is low, then the classificationperformance of the selected feature subset is high. Therefore,we can transfer the measure of classification accuracy to thecalculation of similarity between classes.

Definition 1. Similarity between samples. Suppose theoriginal feature set is fuzzified to a fuzzy feature setffs=[ff1, ff2, . . . , ffn], then every sample s is represented by themembership degree of every fuzzy feature ffi (i = 1, 2, . . . , n)

to corresponding fuzzy set, and degree is got through mem-bership function. Namely, s = (md1, md2, . . . , mdn), wheremdi = �ffi

(s) ∈ [0, 1], i = 1, 2, . . . , n and � is the member-ship function of fuzzy feature ffi . Suppose the data set hastwo classes, we name them as positive class P and nega-tive class N. Then the similarity between a positive samples+ = (md+1 , md+2 , . . . , md+n ) ∈ P and a negative samples− = (md−1 , md−2 , . . . , md−n ) ∈ N with respect to ffs is definedas

Sim(s+, s−|ffs)= n

mini=1

(sim(md+i , md−i )), (1)

where sim(md+i , md−i ) denotes the similarity between a posi-tive sample and a negative sample with respect to fuzzy featureffi , and defined as follows:

sim(md+i , md−i )= 1− |md+i − md−i |, (2)

its properties are summarized as

• 0�sim(md+i , md−i )�1;• sim(md+i , md−i )= 1 if and only if md+i equals to md−i ;• sim(md+i , md−i )= sim(md−i , md+i ).

These properties are enough to characterize the similarity offuzzy features.

Definition 2. Similarity between a positive sample and N withrespect to ffs can be defined as follows, which is the maximumvalue of similarity between a positive sample and all negativesamples:

Sim(s+, N |ffs)= maxs−∈N

(Sim(s+, s−)). (3)

Similarly, the similarity between classes can be defined as themaximum value of similarity between all positive samples andN:

Sim(P, N |ffs)= maxs+∈P

(Sim(s+, N)). (4)

On the one hand, the definitions above can be considered as anapplication of min–max rule, which is widely used in ensem-ble learning to combine classifiers like neural network, SVM,etc. [8–10]. These methods have been successfully applied intoclassification of EEG [11], text [12], etc. Here, we extend themin–max rule to combine the samples and get the similarity be-tween classes. On the other hand, to make definitions clear, wegive explanation from crisp case. For given examples a and b, agiven feature subset fs, and a given similarity measure sim, onecan consider the similarity degree between a and b with respectto fs (denoted by a|fs and b|fs) is equal to 1 if and only if a|fs=b|fs, i.e.,

∧fi∈fs(sim(a, b|fi))=1, and is equal to 0 if and only if

a|f s �= b|fs, i.e.,∧

fi∈fs(sim(a, b|fi))=0. This consequent im-plies the overlapping degree between P and N OV(P, N |fs)=0if and only if (P ∩N |fs)=� and OV(P, N |fs) > 0 if and only if(P ∩N |fs) �= � where � denotes empty set, and OV(P, N |fs)can be defined as

∨s+∈P

∨s−∈N(

∧fi∈fs(sim(s+, s−|fi))) [4].

The OV(P, N |fs) can be considered as the similarity betweenP and N, and it shows whether the intersection of two sets isempty for crisp case. Therefore, when fuzzy case is consideredand Zadeh operator (

∨=max,∧=min) is used, OV(P, N |fs)

is changed to Definition 2 and regarded as maximal degree ofsimilarity between two sets.

2.2. Evaluation criterion

The optimal feature selection generally attempts to select theminimally sized subset of features without sacrificing or evenget higher classification accuracy than the full feature set, be-cause it can diminish volume of storage, reduce the noise gen-erated by irrelevant or redundant features and eliminate uselessknowledge [2,13]. Also as pointed out in [14], generally, theobjective function for feature selection consists of two termsthat compete with each other: (1) goodness-of-fit (to be maxi-mized), and (2) the number of features (to be minimized). Here,goodness-of-fit is denoted by classification performance andit also can be inversely represented by the similarity betweenclasses with respect to selected subset, i.e., the smaller simi-larity, the higher goodness-of-fit. Then the criterion for fuzzyfeature selection can be defined as follows.

Suppose the similarity between P and N with respect to fuzzyfeature set ffs is THR= Sim(P, N |ffs). Then the similarity be-tween P and N with respect to the optimal fuzzy feature subset

Y. Li, Z.-F. Wu / Pattern Recognition 41 (2008) 217–226 219

offs should be smaller than or equal to THR, at the same time,the size of offs is minimum.

• offs ⊂ ffs and Sim(P, N |offs)=minffsub⊂ffsSim(P, N |ffsub)�THR, ffsub denotes a fuzzy feature subset.• The number of fuzzy features in offs is minimum, namely

the cardinality is minimum, in other words, the highest di-mensionality reduction rate is got.

This criterion can be considered as distance criterion, and it isalso a consistent one for its attempt to find a minimum numberof features that separate classes as consistency as the full setof features can.

2.3. Search strategy—fuzzy extension matrix

Extension matrix was first developed by Hong [15]. Here,we extend it to fuzzy extension matrix, where the element isthe similarity value between a positive sample and a negativesample on a fuzzy feature. The process of fuzzy extension ma-trix construction is briefly described as follows.

Set P = {s+1 , s+2 , . . . , s+l }, where s+p = (md+p1, md+p2, . . . ,

md+pn), p = 1, 2, . . . , l, l is the number of positive sam-

ples and md+pi denotes the membership degree of fuzzyfeature ffi (i = 1, 2, . . . , n) for the pth positive sample. SetN = {s−1 , s−2 , . . . , s−g }, where s−q = (md−q1, md−q2, . . . , md−qn),q = 1, 2, . . . , g, g is the number of negative samples andmd−qi denotes the membership degree of fuzzy feature ffi(i = 1, 2, . . . , n) for the qth negative sample.

Firstly the fuzzy extension matrix of a positive sample s+pagainst a negative sample s−q is defined as below:

FEM(s+p , s−q )= [e1, e2, . . . , en],ei = sim(md+pi, md−qi), i = 1, 2, . . . , n. (5)

Secondly, the fuzzy extension matrix of a positive sample s+pagainst N is defined as follows, T represents transpose:

FEM(s+p , N)= [r1, r2, . . . , rg]T,

rq = FEM(s+p , s−q ), q = 1, 2, . . . , g. (6)

Finally, the fuzzy extension matrix of P against N is defined asbelow, which is designed as search strategy for fuzzy featureselection and has the dimensionality (l ∗ g)× n:

FEM(P, N)= [A1, A2, . . . , Al]T,

Ap = FEM(s+p , N), p = 1, 2, . . . , l. (7)

The number of rows in fuzzy extension matrix FEM(P, N) isl ∗ g, the number of columns is n and equals to the number offuzzy features. Each fuzzy feature is corresponding to a columnin fuzzy extension matrix FEM(P, N), and they have one-to-one relationship. The fuzzy extension matrix integrates the su-periority of traditional extension matrix with the characteristicsof fuzzy feature.

2.4. Theoretical proof

In order to throw light on the proposed algorithm, we giveother three concepts:

Definition 3. THR consistency, given a threshold THR, ifSim(P, N |ffsub)�THR, then P and N are THR consistent withrespect to fuzzy feature subset ffsub, and ffsub is also called asa consistent feature subset with respect to P and N.

Definition 4. THR element, given a threshold THR, if the el-ement value in fuzzy matrix extension is not more than THR,then the element is called as a THR element.

Definition 5. A path of an extension matrix refers to a con-nection of its THR elements, which is obtained by selectingone THR element from each row of the extension matrixFEM(P, N).

Lemma. Let ffsub be a fuzzy feature subset, P and N are THRconsistent with respect to ffsub if and only if there exists atleast one THR element in the ffsub place of each row of fuzzyextension matrix FEM(P, N).

Proof. According to Definition 3, if P and N are THR con-sistent with respect to ffsub, then Sim(P, N |ffsub)�THR.According to Definition 2, the similarity between any positivesample and any negative sample is not more than THR, i.e.,Sim(s+, s−|ffsub)�THR. From Definition 1, there exists atleast one feature ffi ∈ ffsub such that sim(md+i , md−i )�THR,namely the element ei in the fuzzy extension matrixFEM(s+, s−) is not more than THR. Based on Definition 2,we know the searching involves all positive and negative sam-ples, then all rows of fuzzy extension matrix FEM(P, N) aretraversed according to the definition of fuzzy extension matrix.So there exists at least one THR element in the ffsub place ofeach row of FEM(P, N).

Conversely, if there exists a THR element in each rowof fuzzy extension matrix FEM(P, N), it implies that thereexists at least a fuzzy feature ffi ∈ ffs (i ∈ [1, n]) suchthat sim(md+i , md−i )�THR for each row. According to Def-inition 1, the similarity between a positive sample and anegative sample with respect to ffi is not more than THR.All these fuzzy features ffi got from THR element in eachrow are combined to obtain the fuzzy feature subset ff-sub, which satisfies Sim(P, N |ffsub)�THR, then P and Nis THR consistent with respect to ffsub. This completes theproof. �

Theorem. Let THR be a given threshold and FEM(P, N) bethe fuzzy extension matrix of P against N. Then finding anoptimal fuzzy feature subset is equivalent of searching for apath in the FEM(P, N), which involves the minimum numberof columns (features).

Proof. According to the presented evaluation criterion, Defini-tions 3 and 4, one knows that offs is an optimal fuzzy feature


subset if and only if

• P and N are THR consistent with respect to offs.• The cardinality of offs reaches a minimum.

According to Lemma, if there exists at least one THR elementin the offs place of each row of FEM(P, N), P and N are THRconsistent with respect to offs. Then this path can be obtainedby selecting one THR element from each row of FEM(P, N).Each column, which is involved in the process of selecting THRelement, corresponds to a fuzzy feature. Hence, the number ofinvolved columns is the number of selected features. Further-more, if we make the path involve minimum columns, then itsatisfies the cardinality of feature subset reaching a minimum.This completes the proof.

Noting that, the correctness of Lemma is independent of theconcrete similarity measure sim(md+i , md−i ), then other mea-sures also can be adopted instead of formula (2).

2.5. Algorithm design

According to Theorem, the optimal fuzzy feature subset se-lection problem can be transformed into a search for a path thatinvolves the least columns in FEM(P, N). If we use the greedyalgorithm, then it equals to select the columns with maximalnumber of THR elements.

Since the discussion above only related to two-class problem,we need to transfer more than two classes’ problems to severaltwo-class problems. We firstly consider one native class as Nand the remaining native classes as P, and get the current op-timal fuzzy feature subset. Subsequently, we select one nativeclass from P and consider it as N, the remaining native classesin P as new P to get optimal fuzzy feature subset. The pro-cess is repeated until only two native classes remained and getthe selection result. We unite the optimal fuzzy feature subsetsof all iterations to obtain the final result. If the data set has cclasses, then the number of established P is c−1. We illustratethis transformation with a virtual data set, which is classifiedinto three classes, i.e., C1, C2 and C3. There are three casesfor the establishing of P:

(a) In first step, C1 ∪ C2 ⇒ P , C3⇒ N , and in the secondstep C1⇒ P , C2⇒ N .

(b) In first step, C1 ∪ C3⇒ P , C2 ⇒ N , and in the secondstep C1⇒ P , C3⇒ N .

(c) In first step, C2 ∪ C3⇒ P , C1⇒ N , and in the secondstep C2⇒ P , C3⇒ N .

In each step we obtain the corresponding optimal feature subset,then unite them to get the final result. Based on the definitionof fuzzy extension matrix, similarity between all samples fromdifferent classes are compared in three cases, then the finalresults have similar classification performance and are robustto the exchanging of P and N.

From Definition 2, one can see that the “maximum” degreeof similarity is used. The maximum operation may result ininflexibility of the heuristic algorithm to some extent. It may

produce more than one column with the maximal number ofTHR elements. If so, we select one column such that the sumof its THR element values is minimum. Moreover, in orderto remove the noisy samples, we consider that if the numberof THR elements in the remaining fuzzy extension matrix isless than a given small number (threshold value), then termi-nate and output the final result. The proposed heuristic algo-rithm for fuzzy feature selection based on min–max rule andextension matrix is named as FFSEM, which is described asfollows.

Let � be the threshold for the number of remaining THRelements, fssup be the selected fuzzy feature subset in oneiteration, offs be the final result and |P | denote the number ofnative classes in positive class P.

Step 1: Initialization. The number of fuzzy features for eachoriginal feature, membership function � and similarity measureof fuzzy feature are determined.

Step 2: The original feature set is mapped to fuzzy featurespace ffs, and get the new representation for every sample indata set based on selected membership function, offs← �.

Step 3: Specify P, N, construct fuzzy extension matrixFEM(P, N) and get THR= Sim(P, N |ffs), fssup← �.

Step 4: Find ith column with maximal number of THR el-ements, which is corresponding to fuzzy feature ffi , fssup =fssup ∪ ffi .

(If there is more than one column with the maximal numberof THR elements, select one column such that the sum of itsTHR element values is minimum.)

Step 5: Remove the rows that include an THR element in theselected ith column of FEM(P, N), and then get updated fuzzyextension matrix FEM(P, N).

Step 6: If FEM(P, N)= � or the number of THR elementsin FEM(P, N) is less than �,

offs= offs ∪ fssupElse

Go to step 4End

Step 7: If |P |�2Go to step 3

Elseoutput offs

EndNoting that, in the implementation of the heuristic algorithmFFSEM, the fuzzy extension matrix FEM(P, N) does not bereally generated in memory. The time consumption exists inthe number of THR elements and removal rows need to beaggregated. So the algorithm has no much computational effortand space consumption. Furthermore, our algorithm adequatelyconsiders the value range of fuzzy feature, so it belongs to aprofessional algorithm for feature selection from fuzzy featurespace mentioned in Section 1.

3. Experiments

In Section 2.4, we have given theoretic proof to the proposedalgorithm FFSEM. In this section, we will compare FFSEM


with other three feature selection methods on classification per-formance and dimensionality reduction via experiments. Theselected methods are SFS [2], ReliefF [16,17] and the newlypresented algorithm in Ref. [18], we call it as Mitra’s. We brieflyintroduce these algorithms as follows.

SFS is widely used in feature selection, features are itera-tively added to an initial subset, in such a way that improvesa given criterion J taking into account those features alreadypresented in the solution. Here, we consider the accuracy ofK-nearest neighbors (K-NN) classifier as the evaluation crite-rion J and tend to find feature subsets better suited to the k-NNclassifier for SFS.

Mitra’s is to find the feature subset that is highly correlatedbased on the k nearest neighbors principle. First compute knearest features for each feature, among them the feature havingthe most compact subset, i.e., having the largest similarity to thefarthest neighbor, is selected, and its k neighbors are discarded.The process is repeated for the remaining features until all ofthem are considered. For determining the k nearest neighbors offeatures, assigning a constant error threshold (�), which is equalto the distance of kth nearest neighbor of selected feature in thefirst iteration. In the subsequent iterations, checking the distancebetween features whether is greater than � or not. If yes, thendecreases the value of k. So k may be changing over iterations.In the algorithm k controls the size of selected feature subset,since k determines the error threshold (�). Here, we designatethe k value to get the approximate or a little more number offeatures in selected feature subset than FFSEM and SFS.

Relief assigns different weights to every feature indicates thecorrelation of feature-class using a particular rule, and sortsthem in descending order. Then threshold the weight value todivide the feature set into relevant and irrelevant features. Re-liefF is the extension of Relief, which use several nearest hitsand misses and averages their contributions to weight of fea-ture. The ReliefF is more robust, can tolerate incomplete andnoisy data and can manage multi-class problems.

3.1. Data sets

Six real-world data sets are downloaded from UCI MachineLearning Repository [19] and used in the experiment.

Iris: This is a well-known benchmark data set which has 150examples, and they are classified into three classes, i.e., Setosa,Versicolor and Virginical. Each example is characterized byfour numerical features which are sepal length, sepal width,petal length, and petal width.

Pima Indian diabetes (DIAB): The data set contains 768 sam-ples from two classes, where 500 samples are from class 1 andthe remaining 268 samples are from class 2. Each sample isrepresented by eight features. The problem posed is to predictwhether a patient would test positive for diabetes according toWorld Health Organization criteria.

Wisconsin diagnostic breast cancer (WDBC): The data setconsists of 357 benign samples and 212 malignant samples,with 30 real-valued features. The task here is to predict diag-nosis results (benign or malignant).

Fig. 1. Three membership functions.

Fig. 2. Five membership functions.

Wine: The data set contains 178 samples from 3 classes. Eachsample is represented by 13 original features. These data arethe results of a chemical analysis of wines grown in the sameregion in Italy but derived from three different cultivars.

Ionosphere: The data set represents autocorrelation functionsof radar measurements. The task is to classify them into twoclasses denoting passage or obstruction in ionosphere. Thereare 351 instance and 34 attributes.

Sonar: The data set contains signals obtained from a varietyof different aspect angles, spanning 90◦ for the cylinder and180◦ for the rock. There are 208 samples, 60 features and 2classes.

3.2. Experimental setup

Here, the proposed algorithm FFSEM relies on generalcharacteristics of the data to evaluate and select feature sub-sets without involving any classification algorithm, and then itbelongs to filter model and its performance does not explicitlydepend on the adopted classifier. So we only use K-NN clas-sifier to evaluate the effectiveness of selected fuzzy featuresubset for classification, where K is assigned as 1, 3, 5 and7 and 5-fold cross-validation is used in the experiment. Tovalidate the performance of feature selection algorithm using


Table 1Details of benchmark data sets after fuzzification by three/five membershipfunctions

Data sets No. fuzzy features No. classes No. samples

Iris 12/20 3 150DIAB 24/40 2 768Wine 39/65 3 178WDBC 90/150 2 569Ionosphere 102/170 2 351Sonar 180/300 2 208

Table 2Experimental results of different cases for establishing P using different K-NN classifiers

Cases K = 1 K = 3 K = 5 K = 7Mean/DR% Mean/DR% Mean/DR% Mean/DR%

(a) 92.36/75.00 94.59/73.33 95.40/73.33 95.34/73.33(b) 92.67/66.67 94.83/66.67 95.40/66.67 95.22/66.67(c) 92.36/75.00 94.59/73.33 95.40/73.33 95.34/73.33

Mean: mean accuracy, DR: dimensionality reduction.

Table 3Experimental results for K-NN classifier with K = 1 for three membershipfunctions

Data sets Algorithms (parameter value) Mean%/Var. DR%

Iris SFS 90.70/0.08 78.33Mitra’s (k = 6) 91.18/0.08 68.33ReliefF 90.83/0.08 75.00FFSEM 92.36/0.08 75.00

DIAB SFS 64.66/0.22 83.33Mitra’s (k = 18) 61.09/0.23 80.83ReliefF 60.10/0.24 79.17FFSEM 67.05/0.22 90.00

WDBC SFS 93.46/0.06 95.11Mitra’s (k = 70) 90.65/0.09 90.67ReliefF 93.49/0.06 90.00FFSEM 93.92/0.06 96.67

Wine SFS 88.25/0.09 88.21Mitra’s (k = 30) 88.77/0.09 80.00ReliefF 91.36/0.08 82.05FFSEM 93.09/0.06 83.08

Ionosphere SFS 85.34/0.12 95.49Mitra’s (k = 80) 85.14/0.12 89.41ReliefF 82.18/0.14 91.18FFSEM 85.29/0.13 96.67

Sonar SFS 67.19/0.21 96.89Mitra’s (k = 160) 65.41/0.22 92.56ReliefF 69.26/0.21 95.00FFSEM 74.16/0.17 97.00

K-NN classifiers with different K values is also adopted inRefs. [5,20,21].

We use three membership functions and five membershipfunctions described in Figs. 1 and 2 to fuzzify the originalfeatures. Then, each original feature is fuzzified into three andfive fuzzy features, respectively. Such as, for Iris, the originalfeature sepal length is fuzzified as small sepal length (ssl),


























Table 7Experimental results for K-NN classifier with K = 1 for five membershipfunctions

Data sets Algorithms Mean%/Var. DR%

Iris SFS 88.86/0.10 68.00ReliefF 89.82/0.09 75.00FFSEM 90.48/0.07 80.00

DIAB SFS 65.87/0.22 87.00ReliefF 65.07/0.23 85.00FFSEM 64.67/0.23 90.00

WDBC SFS 92.28/0.07 95.47ReliefF 92.45/0.07 94.77FFSEM 92.86/0.06 97.47

Wine SFS 82.56/0.14 98.03ReliefF 77.25/0.17 89.85FFSEM 83.96/0.14 96.92

Ionosphere SFS 78.87/0.22 96.71ReliefF 81.69/0.15 91.76FFSEM 80.76/0.15 94.59

Sonar SFS 67.14/0.21 96.96ReliefF 68.33/0.20 95.00FFSEM 69.39/0.19 98.87

medium sepal length (msl) and large sepal length (lsl) usingthree membership functions; sepal width is fuzzified as smallsepal width (ssw), medium sepal width (msw) and large sepalwidth (lsw); petal length is fuzzified as small petal length(spl), medium petal length (mpl) and large petal length (lpl);

















petal width is fuzzified as small petal width (spw), mediumpetal width (mpw) and large petal width (lpw). For each sam-ple, the values of these fuzzy features are membership degreegot from corresponding membership functions. So do otherdata sets. For five membership functions, the process is thesame except for using different linguistic terms, such as verysmall, small, medium, large and very large. This paper doesnot investigate in detail the used linguistic terms. The detailedinformation about the data sets after fuzzification is listedin Table 1.


3.3. Experimental results

All the algorithms are implemented using Matlab compilerand the termination for FFSEM is the number of remainingTHR elements less than the 1% of original number of THRelements.









Table 11The best classification results among different K-NN classifiers for three and five membership functions (3-MF and 5-MF)

Data sets Algorithms 3-MF Mean%/Var. 5-MF Mean%/Var. 3-MF DR% 5-MF DR%

Iris FFSEM 95.40/0.04 (5-NN) 91.18/0.06 (3-NN) 73.33 80.00SFS 94.62/0.05 (7-NN) 88.86/0.10 (1-NN) 76.67 68.00ReliefF 94.89/0.05 (5-NN) 89.92/0.09 (1-NN) 75.00 75.00Mitra’s 94.20/0.05 (3-NN) – 65.00 –

DIAB FFSEM 72.27/0.20 (7-NN) 68.58/0.21 (5-NN) 90.93 90.00SFS 73.47/0.19 (7-NN) 72.23/0.20 (5-NN) 84.17 84.00ReliefF 66.64/0.22 (5-NN) 70.71/0.23 (5-NN) 79.17 85.00Mitra’s 65.28/0.22 (7-NN) – 80.00 –

WDBC FFSEM 96.04/0.04 (7-NN) 94.55/0.05 (5-NN) 96.67 97.60SFS 93.46/0.07 (1-NN) 92.29/0.07 (7-NN) 95.11 95.18ReliefF 95.09/0.05 (7-NN) 93.49/0.06 (3-NN) 90.00 94.61Mitra’s 92.78/0.07 (5-NN) – 90.56 –

Wine FFSEM 93.09/0.06 (1-NN) 83.96/0.14 (1-NN) 83.08 96.92SFS 89.98/0.09 (3-NN) 82.56/0.14 (1-NN) 85.64 98.03ReliefF 91.36/0.08 (1-NN) 77.71/0.17 (7-NN) 82.05 89.85Mitra’s 88.77/0.09 (1-NN) – 80.00 –

Ionosphere FFSEM 85.29/0.13 (1-NN) 82.55/0.14 (7-NN) 96.67 94.59SFS 86.94/0.11 (3-NN) 81.60/0.15 (5-NN) 95.69 94.35ReliefF 83.59/0.13 (5-NN) 81.69/0.15 (1-NN) 91.18 91.76Mitra’s 85.14/0.12 (1-NN) – 89.41 –

Sonar FFSEM 74.16/0.20 (1-NN) 69.39/0.19 (1-NN) 97.00 98.89SFS 67.19/0.21 (1-NN) 67.14/0.20 (1-NN) 96.89 96.96ReliefF 73.06/0.20 (5-NN) 69.05/0.19 (3-NN) 95.00 95.00Mitra’s 66.82/0.22 (5-NN) – 92.22 –

We first compare the experimental results of the establishingof P in three cases for Iris. The selected feature subsets in mostcross-validations are shown as follows:

(a) In first step, Setosa ∪ Versicolor ⇒ P , Virginical ⇒ N .Get the feature subset fssup = {mpl, lpw}.In the second step, Setosa⇒ P , Versicolor⇒ N . Get thefeature subset fssup = {spl}.Final results offs= {spl, mpl, lpw}.

(b) In first step, Setosa ∪ Virginical ⇒ P , Versicolor ⇒ N .Get the feature subset fssup = {lpl,mpw}.In the second step, Setosa⇒ P , Virginical⇒ N . Get thefeature subset fssup = {spl, spw}.Final results offs= {spl, lpl, spw, mpw}.

(c) In first step, Versicolor ∪ Virginical ⇒ P , Setosa ⇒ N .Get the feature subset fssup = {spl}.In the second step, Versicolor⇒ P , Virginical⇒ N . Getthe feature subset fssup = {mpl, lpw}.Final results offs= {spl, mpl, lpw}.

The classification accuracy of final results in three cases arelisted in Table 2. The experimental results have shown, althoughthe final results are different in three cases, however, given aK-NN classifier, they achieve the similar classification perfor-mance. Then we pay more attention to the algorithm design offuzzy feature selection and ignore the different assignments ofP and N, and we only consider one case for multi-class dataset in the following experiments.

We also conduct experiments to compare the performance ofFFSEM with SFS, ReliefF and Mitra’s. The results are shown


in Tables 3–6 for three membership functions and inTables 7–10 for five membership functions, which are corre-sponding to K = 1, 3, 5 and 7 for K-NN classifier. For eachsample, there are many fuzzy features having zero value afterfuzzification via five membership functions, then Mitra’s can-not effectively calculate the feature similarity with warning ofdivided by zero. So we only list the results of FFSEM, SFS andReliefF for five membership functions. The average dimen-sionality reduction rate is listed in the last column (from left toright). In order to demonstrate the effects of the preprocessingof feature selection via FFSEM, we systematically analyzethe experimental results above and give the summary shownin Table 11, which presents the highest classification accuracyamong K-NN (K = 1, 3, 5, 7) classifiers and correspondingdimensionality reduction for different numbers of membershipfunctions. All these results have shown that, although the K-NNclassifiers with different K values have changing classificationaccuracy for the specific feature selection algorithm, however,FFSEM always can get better trade-off between the classifica-tion accuracy and dimensionality reduction, and in most casesit can get highest classification accuracy or highest dimension-ality reduction without sacrificing dimensionality reduction orclassification performance compared with SFS, Mitra’s andReliefF for different number of membership functions anddifferent K values. In addition, we can observe that the morenumber of membership functions cannot lead to higher per-formance in most cases, which is caused by the restriction ofthe number of training samples. In other words, the numberof fuzzy features is increasing, while the number of trainingsamples does not grow, then leads to over-fitting for featureselection. On the other hand, it is well-known that K-NN clas-sifier belongs to distance-based classifier, which classifies eachunlabeled example by the majority label among its K-NNs inthe training sets. Its performance thus depends crucially on theidentification of nearest neighbors, which is influenced by thedistance between classes. According to the evaluation criteriondefined in Section 2.2, we know that the selected features candecrease the similarity between classes and then increase thedistance between the classes, i.e., they enlarge the “margin”between classes. So our algorithm implicitly improves theperformance of K-NN classifier.

4. Discussions and conclusions

This paper only discusses the fuzzy feature selection fromfuzzy feature space under the condition shown in Figs. 1 and2. However, the error rate of the classifier may not decreasewith a larger number of fuzzy features for each original fea-ture, and surely increases the computational overhead. The au-tomatic determination of the number of fuzzy features for eachoriginal feature may be performed using the fuzzy clustering.Of course, the number of fuzzy features are also restricted bythe number of samples in the training set. Furthermore, one canargue about the applicability of nonlinear membership function,e.g. exponential or sigmoid function. The choice, however, canbe specified by expert knowledge. A nonlinear portioning ofthe original feature space may help to select optimal fuzzy sets

and improve the accuracy of classifier. However, we focus onfuzzy feature selection algorithm and not discuss in detail howto choose the membership function and determine the numberof fuzzy feature for each original feature.

The paper presents an algorithm to perform feature selec-tion after the fuzzy projection of original feature space, wherethe evaluation criterion is both the overlapping (similarity) be-tween classes and the cardinality of optimal fuzzy feature sub-set are minimum, and fuzzy extension matrix is constructed assearch strategy. The algorithm is proved by theory and shownby experiments to get better classification accuracy and dimen-sionality reduction.

Acknowledgments

We gratefully thank Dr. P. Mitra, who is the author of Ref.[18], for his code of Mitra’s algorithm, and Professor B.L.Lu for his instruction. This work was done in part while thefirst author was a Ph.D candidate at Department of Computer,ChongQing University, PR China.

References

[1] M.R. Rezaee, B. Goedhart, et al., Fuzzy feature selection, PatternRecognition 32 (1999) 2011–2019.

[2] H. Liu, H. Motoda, Feature selection for Knowledge Discovery and DataMining, Kluwer Academic Publishers, Boston, 1998.

[3] H. Liu, L. Yu, Toward integrating feature selection algorithms forclassification and clustering, IEEE Trans. Knowl. Data Eng. 17 (4) (2005)491–502.

[4] E.C.C. Tsang, D.S. Yesung, X.Y. Wang, OFFSS: optimal fuzzy-valuedfeature subset selection, IEEE Trans. Fuzzy Syst. 11 (2) (2003) 202–213.

[5] S.K. Pal, R.K. De, J. Basak, Unsupervised feature evaluation: a neuro-fuzzy approach, IEEE Trans. Neural Network 11 (3) (2000) 366–376.

[6] J. Basak, R.K. De, S.K. Pal, Unsupervised feature selection using aneuro-fuzzy approach, Pattern Recognition Lett. 19 (1998) 997–1006.

[7] J. Basak, R.K. De, S.K. Pal, Unsupervised neuro-fuzzy feature selection,in: Proceedings of the IEEE International Joint Conference on NeuralNetwork, vol. 1, 1998, pp. 18–23.

[8] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combing classifiers,IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239.

[9] B.L. Lu, K.A. Wang, M. Utiyama, H. Isahara, A part-versus-partmethod for massively parallel training of support vector machines, in:Proceedings of the International Joint Conference Neural Networks’04,Budapest, July 25–29, 2004, pp. 735–740.

[10] B.L. Lu, M. Ito, Task decomposition and module combination basedon class relations: a modular neural network for pattern classification,IEEE Trans. Neural Networks 10 (1999) 1244–1256.

[11] B.L. Lu, J. Shin, M. Ichikawa, Massively parallel classification of single-trial EEG signals using a min–max modular neural network, IEEE Trans.Biomed. Eng. 51 (3) (2004) 551–558.

[12] F.Y. Liu, K. Wu, H. Zhao, B.L. Lu, Fast text categorization with min–maxmodular support vector machines, in: Proceedings of the InternationalJoint Conference on Neural Networks’05, Montreal, Quebec, Canada,July 31–August 4, 2005, pp. 570–575.

[13] L.C. Molina, L. Belanche, A. Nebot, Feature selection algorithm: a surveyand experimental evaluation, in: IEEE Conference on Data Mining, 2002,pp. 306–313.

[14] I. Guyon, A. Elisseeff, An introduction to variable and feature selection,J. Mach. Learn. Res. 3 (2003) 1158–1172.

[15] J.R. Hong, AE1: an extension matrix approximate method for the generalcovering problem, Int. J. Comput. Inf. Sci. 12 (1985) 421–437.

[16] I. Kononerko, Estimating attributes analysis and extension of RELIEF,in: Proceedings of the European Conference on Machine Learning, 1994,pp. 171–182.


[17] M.K. Sikonja, L. Kononenko, An adaptation of relief for attributeestimation in regression, in: Proceedings of the 14th InternationalConference on Machine Learning, 1997, pp. 296–304.

[18] P. Mitra, C.A. Murthy, S.K. Pal, Unsupervised feature selection usingfeature similarity, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002)301–312.

[19] C.J. Merz, P.M. Murphy, UCI repository of machine learning database,1996, 〈http://www.ics.uci.edu/mlearn/MLRepository.html〉.

[20] K.Z. Mao, Identifying critical variables of principal components forunsupervised feature selection, IEEE Trans. Syst. Man and Cybern. BCybern. 2 (35) (2005) 339–344.

[21] H.L. Wei, S.A. Billings, Feature subset selection and ranking for datadimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1)(2007) 162–166.

About the Author—YUN LI received his Ph.D. degree in Software and Theory of Computer from ChongQing University (CQU), China in 2005, and nowhe is a Postdoctor at Department of Computer Science and Engineering, Shanghai Jiao Tong University (SJTU), China. His research interests are in the areaof pattern recognition, data mining and content-based image retrieval.

About the Author—ZHONG-FU WU is a professor in ChongQing University, China. His fields of interest include pattern recognition, data mining and gridcomputing.

http://www.ics.uci.edu/mlearn/MLRepository.html

Documents

Fuzzy feature selection based on min–max learning rule and extension matrix