12
Research Article A Hybrid Feature Selection Method Based on Rough Conditional Mutual Information and Naive Bayesian Classifier Zilin Zeng, 1,2 Hongjun Zhang, 1 Rui Zhang, 1 and Youliang Zhang 1 1 PLA University of Science & Technology, Nanjing 210007, China 2 Nanchang Military Academy, Nanchang 330103, China Correspondence should be addressed to Zilin Zeng; [email protected] and Hongjun Zhang; [email protected] Received 17 December 2013; Accepted 12 February 2014; Published 30 March 2014 Academic Editors: A. Bellouquid, S. Biringen, H. C. So, and E. Yee Copyright © 2014 Zilin Zeng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We introduced a novel hybrid feature selection method based on rough conditional mutual information and Naive Bayesian classifier. Conditional mutual information is an important metric in feature selection, but it is hard to compute. We introduce a new measure called rough conditional mutual information which is based on rough sets; it is shown that the new measure can substitute Shannon’s conditional mutual information. us rough conditional mutual information can also be used to filter the irrelevant and redundant features. Subsequently, to reduce the feature and improve classification accuracy, a wrapper approach based on naive Bayesian classifier is used to search the optimal feature subset in the space of a candidate feature subset which is selected by filter model. Finally, the proposed algorithms are tested on several UCI datasets compared with other classical feature selection methods. e results show that our approach obtains not only high classification accuracy, but also the least number of selected features. 1. Introduction With increase of data dimensionality in many domains such as bioinformatics, text categorization, and image recognition, feature selection has become one of the most important data mining preprocessing methods. e aim of feature selection is to find a minimal feature subset of the original datasets that is the most characterizing. Since feature selection can bring lots of advantages, such as avoiding overfitting, facilitating data visualization, reducing storage requirements, and reducing training times, it has attracted considerable attention in various areas [1]. In the past two decades, different techniques are proposed to address these challenging tasks. Dash and Liu [2] point out that there are four basic steps in a typical feature selec- tion method, that is, subset generation, subset evaluation, stopping criterion, and validation. Most studies focus on the two major steps of feature selection: subset generation and subset evaluation. According to subset evaluation function, feature selection methods can be divided into two categories: filter method and wrapper method [3]. Filter methods are independent of predictor, whereas wrapper methods utilize their predictive power as the evaluation function. e merits of filter methods are high computation efficiency and its generality. However, the result of filter method is not always satisfactory. is is because the filter model separates feature selection from the classifier learning and selects the feature subsets that are independent from the learning algorithm. On the other hand, wrapper methods guarantee good results, but they are very slow when applied to large datasets. In this paper, we propose a new algorithm which com- bined rough conditional entropy and naive Bayesian classifier to select features. First, in order to decrease the computational cost of wrapper search, a candidate feature set is selected by using rough conditional mutual information. Second, the candidate feature subset is then further refined by a wrapper procedure. We take advantages of both the filter and the wrapper. e main goal of our research is expected to obtain a few features while the classification accuracy is still very high. is approach provides the possibility of efficiently applying filter-wrapper model on some datasets from UCI [4], obtaining better results than other classical feature selection approaches. Hindawi Publishing Corporation ISRN Applied Mathematics Volume 2014, Article ID 382738, 11 pages http://dx.doi.org/10.1155/2014/382738

Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

Research ArticleA Hybrid Feature Selection Method Based on Rough ConditionalMutual Information and Naive Bayesian Classifier

Zilin Zeng12 Hongjun Zhang1 Rui Zhang1 and Youliang Zhang1

1 PLA University of Science amp Technology Nanjing 210007 China2Nanchang Military Academy Nanchang 330103 China

Correspondence should be addressed to Zilin Zeng beauty1981sohucom and Hongjun Zhang jsnjzhj263net

Received 17 December 2013 Accepted 12 February 2014 Published 30 March 2014

Academic Editors A Bellouquid S Biringen H C So and E Yee

Copyright copy 2014 Zilin Zeng et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We introduced a novel hybrid feature selection method based on rough conditional mutual information and Naive Bayesianclassifier Conditional mutual information is an important metric in feature selection but it is hard to compute We introducea new measure called rough conditional mutual information which is based on rough sets it is shown that the new measure cansubstitute Shannonrsquos conditional mutual information Thus rough conditional mutual information can also be used to filter theirrelevant and redundant features Subsequently to reduce the feature and improve classification accuracy a wrapper approachbased on naive Bayesian classifier is used to search the optimal feature subset in the space of a candidate feature subset which isselected by filter model Finally the proposed algorithms are tested on several UCI datasets compared with other classical featureselection methods The results show that our approach obtains not only high classification accuracy but also the least number ofselected features

1 Introduction

With increase of data dimensionality in many domains suchas bioinformatics text categorization and image recognitionfeature selection has become one of the most important datamining preprocessingmethodsThe aimof feature selection isto find aminimal feature subset of the original datasets that isthe most characterizing Since feature selection can bring lotsof advantages such as avoiding overfitting facilitating datavisualization reducing storage requirements and reducingtraining times it has attracted considerable attention invarious areas [1]

In the past two decades different techniques are proposedto address these challenging tasks Dash and Liu [2] pointout that there are four basic steps in a typical feature selec-tion method that is subset generation subset evaluationstopping criterion and validation Most studies focus on thetwo major steps of feature selection subset generation andsubset evaluation According to subset evaluation functionfeature selection methods can be divided into two categoriesfilter method and wrapper method [3] Filter methods areindependent of predictor whereas wrapper methods utilize

their predictive power as the evaluation function The meritsof filter methods are high computation efficiency and itsgenerality However the result of filter method is not alwayssatisfactory This is because the filter model separates featureselection from the classifier learning and selects the featuresubsets that are independent from the learning algorithmOnthe other hand wrapper methods guarantee good results butthey are very slow when applied to large datasets

In this paper we propose a new algorithm which com-bined rough conditional entropy and naive Bayesian classifierto select features First in order to decrease the computationalcost of wrapper search a candidate feature set is selectedby using rough conditional mutual information Secondthe candidate feature subset is then further refined by awrapper procedure We take advantages of both the filterand the wrapper The main goal of our research is expectedto obtain a few features while the classification accuracyis still very high This approach provides the possibility ofefficiently applying filter-wrapper model on some datasetsfrom UCI [4] obtaining better results than other classicalfeature selection approaches

Hindawi Publishing CorporationISRN Applied MathematicsVolume 2014 Article ID 382738 11 pageshttpdxdoiorg1011552014382738

2 ISRN Applied Mathematics

In the remainder of the paper related work is firstdiscussed in the next section Section 3 presents the pre-liminaries on Shannonrsquos entropy and rough sets Section 4introduces the definitions of rough uncertainty measure anddiscusses their properties and interpretation The proposedhybrid feature selection method is delineated in Section 5The experimental results are presented in Section 6 Finallya brief conclusion is given in Section 7

2 Related Work

In filter based feature selection techniques a number ofrelevancemeasures were applied tomeasure the performanceof features for predicting decisionsThese relevancemeasurescan be divided into four categories distance dependencyconsistency and information The most prominent distance-based method is relief [5] This method uses Euclideandistance to select the relevance features Since relief worksonly for binary classes Kononenko generalized it to multipleclasses called relief-F [6 7] However relief and relief-F areunable to detect redundant features Dependence measuresor correlation measures quantify the ability to predict thevalue of one variable from the value of another variable Hallrsquoscorrelation-based feature selection (CFS) algorithm [8] istypical representative of this category Consistency measurestry to preserve the discriminative power of data in the originalfeature space Rough set theory is a popular technique ofthis sort [9] Among these measures mutual information(MI) is the most widely used one in computing relevanceMI is a well-known concept from information theory and hasbeen used to capture the relevance and redundancy amongfeatures In this paper we focus on the information-basedmeasure in the filter model

The main advantages of MI are its robustness to noiseand transformation In contrast to other measures MI is notlimited to linear dependencies but includes any nonlinearones Since Battiti proposed mutual information featureselector (MIFS) [10] more and more researchers began tostudy information-based feature selection MIFS selects thefeature that maximizes the information of the class correctedby subtracting a quantity proportion to the average MI withthe previously selected features Battiti demonstrated thatMI can be very useful in feature selection problems andthe MIFS can be used in any classifying systems for itssimplicity whatever the learning algorithm may be Kwakand Choi [11] analyzed the limitations of MIFS and proposedmethod called MIFS-U which in general makes a betterestimation of the MI between input attributes and outputclasses than MIFS They showed that MIFS does not work innonlinear problems and proposed MIFS-U to improve MIFSfor solving nonlinear problems Another variant of MIFS ismin-redundancy max-relevance (mRMR) criterion [12] Themethod presented the theoretical analysis of the relationshipsof max-dependency max-relevance and min-redundancyThey proved that mRMR is equivalent to max-dependencyfor the first-order incremental search

The limitations ofMIFSMIFS-U andmRMR algorithmsare as follows Firstly they are all incremental search schemes

that select one feature at a time At each pass these methodsselect one feature with maximum criterion without consid-ering the interaction between groups of features In manyclassification problems groups of several features occurringsimultaneously are relevant but not for the case of individualfeature alone for example the XOR problem Secondly thecoefficient 120573 is a configurable parameter which must beset experimentally Thirdly they are not accurate enough toquantify the dependency among features with respect to agiven decision

Assume 119883 = 1198831 1198832 119883

119899 as an input feature

set and 119884 as a target our task is to select 119898 (119898 lt 119899)features from a pool such that their joint mutual information119868(1198831 1198832 119883

119898 119884) is maximized However the estimation

of mutual information from the available data is a great chal-lenge especially multivariate mutual information MartınezSotoca and Pla [13] and Guo et al [14] proposed differentmethods to approximate multivariate conditional mutualinformation respectively Nevertheless their proofs are allbased on the same inequality that is 119868(119883 119884 | 119885) le 119868(119883 119884)The inequality does not hold under any conditions Only ifrandom variables 119883 119884 and 119885 satisfy Markovity then theinequality holds Many researchers try various methods toestimate mutual information The most common methodsare histogram [15] kernel density estimation (KDE) [16] andk-nearest neighbor estimation (K-NN) [17] The standardhistogram partitions the axes into distinct bins of widthand then counts the number of observations therefore thisestimation method is highly dependent on the choice of thewidth of the bins Although the KDE is better than histogramthe bandwidth and kernel function are difficult to decideTheK-NN approach uses a fixed number of nearest neighbors toestimate the MI but it seems more suitable for continuousrandom variables

This paper will compute multivariate mutual informationand multivariate conditional mutual information in a newperspective Our method is based on rough entropy uncer-tainty measure Several authors [18ndash21] have used Shannonrsquosentropy and its variants to measure uncertainty in rough settheory In this work we will propose several rough entropy-based metrics Some important properties and relationshipsof these uncertaintymeasures will be concludedThenwewillfind a candidate feature subset by using rough conditionalmutual information to filter the irrelevant and redundantfeatures in the first stage To overcome the limitations of thefilter model in the second stage we will use the wrappermodel with the sequential backward elimination scheme tosearch for an optimal feature subset from the candidatefeature subset

3 Preliminaries

In this section we briefly introduce some basic concepts andnotations of the information theory and rough set theory

31 Entropy Mutual Information and Conditional MutualInformation Shannonrsquos information theory first introducedin 1948 [22] provides a way to measure the information of

ISRN Applied Mathematics 3

random variables The entropy is a measure of uncertainty ofrandom variables [23] Let 119883 = 119909

1 1199092 119909

119899 be a discrete

random variable and let 119901(119909119894) be the probability of 119909

119894 the

entropy of119883 is defined by the following

119867(119883) = minus119899

sum119894=1

119901 (119909119894) log119901 (119909

119894) (1)

Here the base of log is 2 and the unit of entropy is the bit If119883and 119884 = 119910

1 1199102 119910

119898 are two discrete random variables

the joint probability is 119901(119909119894 119910119895) where 119894 = 1 2 119899 and 119895 =

1 2 119898 The joint entropy of119883 and 119884 is as follows

119867(119883 119884) = minus119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log119901 (119909

119894 119910119895) (2)

When certain variables are known and others are notknown the remaining uncertainty is measured by the con-ditional entropy as follows

119867(119883 | 119884) = 119867 (119883 119884) minus 119867 (119884)

= minus119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log119901 (119909

119894| 119910119895)

(3)

The information found commonly in two random variables isof importance and this is defined as the mutual informationbetween two variables as follows

119868 (119883 119884) =119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log

119901 (119909119894| 119910119895)

119901 (119909119894)

(4)

If the mutual information between two random variables islarge (small) it means two variables are closely (not closely)related If the mutual information becomes zero the tworandomvariables are totally unrelated or the two variables areindependent The mutual information and the entropy havethe following relation

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

119868 (119883 119884) = 119867 (119884) minus 119867 (119883 | 119884)

119868 (119883 119884) = 119867 (119883) + 119867 (119884) minus 119867 (119883 119884)

119868 (119883119883) = 119867 (119883)

(5)

For continuous random variables the entropy and mutualinformation are defined as follows

119867(119883) = minus int119901 (119909) log119901 (119909) 119889119909

119868 (119883 119884) =int119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910)119889119909 119889119910

(6)

Conditional mutual information is the reduction in theuncertainty of119883 due to knowledge of 119884 when119885 is givenTheconditionalmutual information of random variables119883 and119884given 119885 is defined by the following

119868 (119883 119884 | 119885) = 119867 (119883 | 119885) minus 119867 (119883 | 119884 119885) (7)

Mutual information satisfies a chain rule that is

119868 (1198831 1198832 119883

119899 119884) =

119899

sum119894=1

119868 (119883119894 119884 | 119883

119894minus1 119883119894minus2 119883

1)

(8)

32 Rough Sets Rough sets theory introduced by Pawlak[24] is a mathematical tool to handle imprecision uncer-tainty and vagueness It has been applied in many fields[25] such as machine learning data mining and patternrecognition

The notion of an information system provides a conve-nient basis for the representation of objects in terms of theirattributes An information system is a pair of (119880 119860) where119880 is a nonempty finite set of objects called the universe and119860 is a nonempty finite set of attributes that is 119886 119880 rarr 119881

119886

for 119886 isin 119860 where 119881119886is called the domain of 119886 A decision

table is a special case of information system 119878 = (119880 119860 cup 119889)where attributes in 119860 are called condition attributes and 119889 isa designated attribute called the decision attribute

For every set of attributes 119861 sube 119860 an indiscernibilityrelation 119868119873119863(119861) is defined in the following way Two objects119909119894and 119909

119895 are indiscernible by the set of attribute 119861 in 119860

if 119887(119909119894) = 119887(119909

119895) for every 119887 isin 119861 The equivalence class of

119868119873119863(119861) is called elementary set in 119861 because it representsthe smallest discernible groups of objects For any element119909119894of 119880 the equivalence class of 119909

119894in relation 119868119873119863(119861) is

represented as [119909119894]119861 For 119861 sube 119860 the indiscernibility relation

119868119873119863(119861) constitutes a partition of 119880 which is denoted by119880119868119873119863(119861)

Given an information system 119878 = (119880 119860) for any subset119883 sube 119880 and equivalence relation 119868119873119863(119861) the 119861-lower and119861-upper approximations of 119883 are defined respectively asfollows

119861 (119883) = 119909 isin 119880 [119909]119861 sube 119883

119861 (119883) = 119909 isin 119880 [119909]119861 cap 119883 = (9)

4 Rough Entropy-Based Metrics

In this section the concept of rough entropy is introducedto measure the uncertainty of knowledge in an informationsystem and then some rough entropy-based uncertaintymeasures are presented Some important properties of theseuncertainty measures are deduced respectively and therelationships among them are discussed as well

Definition 1 Given a set of samples 119880 = 1199091 1199092 119909

119899

described by features 119865 119878 sube 119865 is a subset of attributes Thenthe rough entropy of the sample is defined by

RH119909119894(119878) = minus log

1003816100381610038161003816[119909119894]1198781003816100381610038161003816

119899 (10)

and the average entropy of the set of samples is computed as

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

(11)

where |[119909119894]119878| is the cardinality of [119909

119894]119878

4 ISRN Applied Mathematics

Since for all 119909119894 119909119894 sube [119909

119894]119878sube 119880 1119899 le |[119909

119894]119878|119899 le 1 so

we have 0 le RH(119878) le log 119899 RH(119878) = log 119899 if and only if forall 119909119894 |[119909119894]|119878= 1 that is 119880119868119873119863(119878) = 119909

1 1199092 119909

119899

RH(119878) = 0 if and only if for for all 119909119894 |[119909119894]|119878= 119899 that

is 119880119868119873119863(119878) = 119880 Obviously when knowledge 119878 candistinguish any two objects the rough entropy is the largestwhen knowledge 119878 can not distinguish any two objects therough entropy is zero

Theorem2 Consider RH(119878) = 119867(119878) where119867(119878) is Shannonrsquosentropy

Proof Suppose 119880 = 1199091 1199092 119909

119899 and 119880119868119873119863(119878) =

1198831 1198832 119883

119898 where 119883

119894= 119909

1198941 1199091198942 119909

119894|119883119894| then

119867(119878) = minussum119898

119894=1(|119883119894|119899) log(|119883

119894|119899) Because 119883

119894cap 119883119895= for

119894 = 119895 and119883119894= [119909]119878for any 119909 isin 119883

119894 we have

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= sum119909isin1198831

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899+ sdot sdot sdot + sum

119909isin119883119898

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899

= (minus

100381610038161003816100381611988311003816100381610038161003816

119899log

100381610038161003816100381611988311003816100381610038161003816

119899) + sdot sdot sdot + (minus

10038161003816100381610038161198831198981003816100381610038161003816

119899log

10038161003816100381610038161198831198981003816100381610038161003816

119899)

= minus119898

sum119894=1

10038161003816100381610038161198831198941003816100381610038161003816

119899log

10038161003816100381610038161198831198941003816100381610038161003816

119899= 119867 (119878)

(12)

Theorem 2 shows that the rough entropy equals Shan-nonrsquos entropy

Definition 3 Suppose 119877 119878 sube 119865 are two subsets of attributesthe joint rough entropy is defined as

RH (119877 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

(13)

Due to [119909119894]119877cup119878

= [119909119894]119877cap [119909119894]119878 therefore RH(119877 119878) = minus(1

119899)sum119899

119894=1log(|[119909

119894]119877cap [119909119894]119878|119899) According to Definition 3 we

can observe that RH(119877 119878) = RH(119877 cup 119878)

Theorem 4 Consider RH(119877 119878) ge RH(119877) and RH(119877 119878) geRH(119878)

Proof Consider for all 119909119894isin 119880 we have [119909

119894]119878cup119877

sube [119909119894]119878and

[119909119894]119878cup119877

sube [119909119894]119877 and then |[119909

119894]119878cup119877

| le |[119909119894]119878| and |[119909

119894]119878cup119877

| le|[119909119894]119877| Therefore RH(119877 119878) ge RH(119877) and RH(119877 119878) ge RH(119878)

Definition 5 Suppose 119877 119878 sube 119865 are two subsets of attributesthe conditional rough entropy of 119877 to 119878 is defined as

RH (119877 | 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (14)

Theorem 6 (chain rule) Consider RH(119877 | 119878) = RH(119877 119878) minusRH(119878)

Proof Consider

RH (119877 119878) minus RH (119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RH (119877 | 119878)

(15)

Definition 7 Suppose 119877 119878 sube 119865 are two subsets of attributesthe rough mutual information of 119877 and 119878 is defined as

RI (119877 119878) = 1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (16)

Theorem 8 (the relation between rough mutual informationand rough entropy) Consider

(1) RI(119877 119878) = RI(119878 119877)(2) RI(119877 119878) = RH(119877) + RH(119878) minus RH(119877 119878)(3) RI(119877 119878) = RH(119877) minus RH(119877 | 119878) = RH(119878) minus RH(119878 | 119877)

Proof Theconclusions of (1) and (3) are straightforward herewe give the proof of property (2)

(2) Consider

RH (119877) + RH (119878) minus RH (119877 119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+ log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RI (119877 119878)

(17)

Definition 9 The rough conditional mutual information of 119877and 119878 given 119879 is defined by

RI (119877 119878 | 119879) = 1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816 (18)

Theorem 10 The following equations hold(1) RI(119877 119878 | 119879) = RH(119877 | 119878) minus RH(119877 | 119878 119879)(2) RI(119877 119878 | 119879) = RI(119877 119879 119878) minus RI(119877 119878)

ISRN Applied Mathematics 5

Proof (1) Consider

RH (119877 | 119878) minus RH (119877 | 119878 119879)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878cup119879

10038161003816100381610038161003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+ log

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816

1003816100381610038161003816[119909119894]119878cup1198791003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(19)

(2) Consider

RI (119877 119879 119878) minus RI (119877 119878)

= RI (119877 cup 119879 119878) minus RI (119877 119878)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

=1

119899

119899

sum119894=1

(log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(20)

5 A Hybrid Feature Selection Method

In this section we propose a novel hybrid feature selectionmethod based on rough conditional mutual information andnaive Bayesian classifier

51 Feature Selection by Rough Conditional Mutual Informa-tion Given a set of sample 119880 described by the attribute set119865 in terms of mutual information the purpose of featureselection is to find a feature set 119878 (119878 sube 119865) with 119898 featureswhich jointly have the largest dependency on the target class119910 This criterion called max-dependency has the followingform

max119863(119878 119910) 119863 = 119868 (1198911 1198912 119891

119898 119910) (21)

According to the chain rule for information

119868 (1198911 1198912 119891

119898 119910) =

119898

sum119894=1

119868 (119891119894 119910119891119894minus1 119891119894minus2 119891

1) (22)

that is to say we can select a feature which produces themaximum conditional mutual information formally writtenas

max119891isin119865minus119878

119863(119891 119878 119910) 119863 = 119868 (119891 119910 | 119878) (23)

where 119878 represents the selected feature setFigure 1 illustrates the validity of this criterion Here

119891119894represents a feature highly correlated with 119891

119895 and 119891

119896

is much less correlated with 119891119894 The mutual information

between vectors (119891119894 119891119895) and 119910 is indicated by a shadowed

area consisting of three different patterns of patches that is119868((119891119894 119891119895) 119910) = 119860 + 119861 + 119862 where 119860 119861 and 119862 are defined by

different cases of overlap In detail(1) (119860 + 119861) is the mutual information between 119891

119894and 119910

that is 119868(119891119894 119910)

(2) (119861 + 119862) is the mutual information between 119891119895and 119910

that is 119868(119891119895 119910)

(3) (119861 + 119863) is the mutual information between 119891119894and 119891

119895

that is 119868(119891119894 119891119895)

(4) 119862 is the conditional mutual information between 119891119895

and 119910 given 119891119894 that is 119868(119891

119895 119910 | 119891

119894)

(5) 119864 is the mutual information between 119891119896and 119910 that is

119868(119891119896 119910)

This illustration clearly shows that the features maximizingthe mutual information not only depend on their predictiveinformation individually for example (119860+119861+119861+119862) but alsoneed to take account of redundancy between them In thisexample feature 119891

119894should be selected first since the mutual

information between 119891119894and 119910 is the largest and feature 119891

119896

should have priority for selection over 119891119895in spite of the latter

having larger individual mutual information with 119910 This isbecause 119891

119896provides more complementary information to

feature 119891119894to predict 119910 than does 119891

119895(as 119864 gt 119862 in Figure 1)

that is to say for each round we should select a featurewhich maximizes conditional mutual information FromTheorem 2 we know that rough entropy equals Shannonrsquosentropy therefore we can select a feature which produces themaximum rough conditional mutual information

We adopt the forward feature algorithm to select featuresEach single input feature is added to selected features setbased onmaximizing rough conditional mutual informationthat is given selected feature set 119878 maximizing the roughmutual information of 119891

119894and target class 119910 where 119891

119894belongs

to the remain feature set In order to apply the roughconditional mutual information measure to the filter modelwell a numerical threshold value 120573 (120573 gt 0) is set to RI(119891

119894 119910 |

119878) This can help the algorithm to be resistant to noise dataand to overcome the overfitting problem to a certain extent[26] The procedure can be performed until RI(119891

119894 119910 | 119878) gt

120573 is satisfied The filter algorithm can be described by thefollowing procedure

(1) Initialization set 119865 larr ldquoinitial set of all featuresrdquo 119878 larrldquoempty setrdquo and 119910 larr ldquoclass outputsrdquo

(2) Computation of the rough mutual information of thefeatures with the class outputs for each feature (119891

119894isin

119865) compute RI(119891119894 119910)

6 ISRN Applied Mathematics

fj

C

E

B

y

fi

A

D

fk

Figure 1 Illustration ofmutual information and conditionalmutualinformation for different scenarios

(3) Selection of the first feature find the feature 119891119894that

maximizes RI(119891119894 119910) set 119865 larr 119865 minus 119891

119894 and 119878 larr 119891

119894

(4) Greedy selection repeat until the termination condi-tion is satisfied

(a) computation of the rough mutual informationRI(119891119894 119910 | 119878) for each feature 119891

119894isin 119865

(b) selection of the next feature choose the feature119891119894as the one that maximizes RI(119891

119894 119910 | 119878) set

119865 larr 119865 minus 119891119894 and 119878 larr 119878 cup 119891

119894

(5) Output the set containing the selected features 119878

52 Selecting the Best Feature Subset on Wrapper ApproachThe wrapper model uses the classification accuracy of apredetermined learning algorithm to determine the goodnessof the selected subset It searches for features that arebetter suited to the learning algorithm aiming at improvingthe performance of the learning algorithm therefore thewrapper approach generally outperforms the filter approachin the aspect of the final predictive accuracy of a learningmachineHowever it ismore computationally expensive thanthe filter models Although many wrapper methods are notexhaustive search most of them still incur time complexity119874(1198732) [27 28] where 119873 is the number of features of thedataset Hence it is worth reducing the search space beforeusing wrapper feature selection approach Through the filtermodel it can reduce high computational cost and avoidencountering the local maximal problemTherefore the finalsubset of the features obtained contains a few features whilethe predictive accuracy is still high

In this work we propose the reducing of the search spaceof the original feature set to the best candidate which canreduce the computational cost of the wrapper search effec-tively Our method uses the sequential backward eliminationtechnique to search for every possible subset of featuresthrough the candidate space

The features are ranked according to the average accuracyof the classifier and then features will be removed one byone from the candidate feature subset only if such exclusionimproves or does not change the classifier accuracy Differ-ent kinds of learning models can be applied to wrappersHowever different kinds of learning machines have differentdiscrimination abilities Naive Bayesian classifier is widelyused in machine learning because it is fast and easy to beimplemented Rennie et al [29] show that its performance iscompetitive with the state-of-the-art models like SVM whilethe latter has too many parameters to decide Therefore wechoose the naive Bayesian classifier as the core of fine tuningThe decrement selection procedure for selecting an optimalfeature subset based on the wrapper approach can be seen asshown in Algorithm 1

There are two phases in the wrapper algorithm as shownin wrapper algorithm In the first stage we compute the clas-sification accuracy Acc

119862of the candidate feature set which

is the results of filter model (step 1) where Classperf (119863119862)represents the average classification accuracy of dataset Dwith candidate features CThe results are obtained by 10-foldcross-validation For each 119891

119894isin 119862 we compute the average

accuracy 119878119888119900119903119890 Then features are ranked according to 119878119888119900119903119890value (steps 3ndash6) In the second stage we deal with the list ofthe ordered features once each feature in the list determinesthe first till the last ranked feature (steps 8ndash26) In this stageeach feature in the list considers the average accuracy of thenaive Bayesian classifier only if the feature is excluded Ifany feature is found to lead to the most improved averageaccuracy and the relative accuracy [30] is more than 120575

1(steps

11ndash14) the feature then will be removed Otherwise everypossible feature is considered and the feature that leads tothe largest average accuracy will be chosen and removed (step15)The one that leads to the improvement or the unchangingof the average accuracy (steps 17ndash20) or the degrading ofthe relative accuracy not worse than 120575

2(steps 21ndash24) will be

removed In general 1205751should take value in [0 01] and 120575

2

should take value in [0 002] In the following if not specified1205751= 005 and 120575

2= 001

This decrement selection procedure is repeated until thetermination condition is satisfied Usually the sequentialbackward elimination is more computationally expensivethan the incremental sequential forward search Howeverit could yield a better result when considering the localmaximal In addition the sequential forward search addingone feature at each pass does not take the interactionbetween the groups of the features into account [31] In manyclassification problems the class variable may be affected bygrouping several features but not the individual feature aloneTherefore the sequential forward search is unable to find thedependencies between the groups of the features while theperformance can be degraded sometimes

6 Experimental Results

This section illustrates the evaluation of our method in termsof the classification accuracy and the number of selectedfeatures in order to see how good the filter wrapper is in the

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

2 ISRN Applied Mathematics

In the remainder of the paper related work is firstdiscussed in the next section Section 3 presents the pre-liminaries on Shannonrsquos entropy and rough sets Section 4introduces the definitions of rough uncertainty measure anddiscusses their properties and interpretation The proposedhybrid feature selection method is delineated in Section 5The experimental results are presented in Section 6 Finallya brief conclusion is given in Section 7

2 Related Work

In filter based feature selection techniques a number ofrelevancemeasures were applied tomeasure the performanceof features for predicting decisionsThese relevancemeasurescan be divided into four categories distance dependencyconsistency and information The most prominent distance-based method is relief [5] This method uses Euclideandistance to select the relevance features Since relief worksonly for binary classes Kononenko generalized it to multipleclasses called relief-F [6 7] However relief and relief-F areunable to detect redundant features Dependence measuresor correlation measures quantify the ability to predict thevalue of one variable from the value of another variable Hallrsquoscorrelation-based feature selection (CFS) algorithm [8] istypical representative of this category Consistency measurestry to preserve the discriminative power of data in the originalfeature space Rough set theory is a popular technique ofthis sort [9] Among these measures mutual information(MI) is the most widely used one in computing relevanceMI is a well-known concept from information theory and hasbeen used to capture the relevance and redundancy amongfeatures In this paper we focus on the information-basedmeasure in the filter model

The main advantages of MI are its robustness to noiseand transformation In contrast to other measures MI is notlimited to linear dependencies but includes any nonlinearones Since Battiti proposed mutual information featureselector (MIFS) [10] more and more researchers began tostudy information-based feature selection MIFS selects thefeature that maximizes the information of the class correctedby subtracting a quantity proportion to the average MI withthe previously selected features Battiti demonstrated thatMI can be very useful in feature selection problems andthe MIFS can be used in any classifying systems for itssimplicity whatever the learning algorithm may be Kwakand Choi [11] analyzed the limitations of MIFS and proposedmethod called MIFS-U which in general makes a betterestimation of the MI between input attributes and outputclasses than MIFS They showed that MIFS does not work innonlinear problems and proposed MIFS-U to improve MIFSfor solving nonlinear problems Another variant of MIFS ismin-redundancy max-relevance (mRMR) criterion [12] Themethod presented the theoretical analysis of the relationshipsof max-dependency max-relevance and min-redundancyThey proved that mRMR is equivalent to max-dependencyfor the first-order incremental search

The limitations ofMIFSMIFS-U andmRMR algorithmsare as follows Firstly they are all incremental search schemes

that select one feature at a time At each pass these methodsselect one feature with maximum criterion without consid-ering the interaction between groups of features In manyclassification problems groups of several features occurringsimultaneously are relevant but not for the case of individualfeature alone for example the XOR problem Secondly thecoefficient 120573 is a configurable parameter which must beset experimentally Thirdly they are not accurate enough toquantify the dependency among features with respect to agiven decision

Assume 119883 = 1198831 1198832 119883

119899 as an input feature

set and 119884 as a target our task is to select 119898 (119898 lt 119899)features from a pool such that their joint mutual information119868(1198831 1198832 119883

119898 119884) is maximized However the estimation

of mutual information from the available data is a great chal-lenge especially multivariate mutual information MartınezSotoca and Pla [13] and Guo et al [14] proposed differentmethods to approximate multivariate conditional mutualinformation respectively Nevertheless their proofs are allbased on the same inequality that is 119868(119883 119884 | 119885) le 119868(119883 119884)The inequality does not hold under any conditions Only ifrandom variables 119883 119884 and 119885 satisfy Markovity then theinequality holds Many researchers try various methods toestimate mutual information The most common methodsare histogram [15] kernel density estimation (KDE) [16] andk-nearest neighbor estimation (K-NN) [17] The standardhistogram partitions the axes into distinct bins of widthand then counts the number of observations therefore thisestimation method is highly dependent on the choice of thewidth of the bins Although the KDE is better than histogramthe bandwidth and kernel function are difficult to decideTheK-NN approach uses a fixed number of nearest neighbors toestimate the MI but it seems more suitable for continuousrandom variables

This paper will compute multivariate mutual informationand multivariate conditional mutual information in a newperspective Our method is based on rough entropy uncer-tainty measure Several authors [18ndash21] have used Shannonrsquosentropy and its variants to measure uncertainty in rough settheory In this work we will propose several rough entropy-based metrics Some important properties and relationshipsof these uncertaintymeasures will be concludedThenwewillfind a candidate feature subset by using rough conditionalmutual information to filter the irrelevant and redundantfeatures in the first stage To overcome the limitations of thefilter model in the second stage we will use the wrappermodel with the sequential backward elimination scheme tosearch for an optimal feature subset from the candidatefeature subset

3 Preliminaries

In this section we briefly introduce some basic concepts andnotations of the information theory and rough set theory

31 Entropy Mutual Information and Conditional MutualInformation Shannonrsquos information theory first introducedin 1948 [22] provides a way to measure the information of

ISRN Applied Mathematics 3

random variables The entropy is a measure of uncertainty ofrandom variables [23] Let 119883 = 119909

1 1199092 119909

119899 be a discrete

random variable and let 119901(119909119894) be the probability of 119909

119894 the

entropy of119883 is defined by the following

119867(119883) = minus119899

sum119894=1

119901 (119909119894) log119901 (119909

119894) (1)

Here the base of log is 2 and the unit of entropy is the bit If119883and 119884 = 119910

1 1199102 119910

119898 are two discrete random variables

the joint probability is 119901(119909119894 119910119895) where 119894 = 1 2 119899 and 119895 =

1 2 119898 The joint entropy of119883 and 119884 is as follows

119867(119883 119884) = minus119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log119901 (119909

119894 119910119895) (2)

When certain variables are known and others are notknown the remaining uncertainty is measured by the con-ditional entropy as follows

119867(119883 | 119884) = 119867 (119883 119884) minus 119867 (119884)

= minus119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log119901 (119909

119894| 119910119895)

(3)

The information found commonly in two random variables isof importance and this is defined as the mutual informationbetween two variables as follows

119868 (119883 119884) =119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log

119901 (119909119894| 119910119895)

119901 (119909119894)

(4)

If the mutual information between two random variables islarge (small) it means two variables are closely (not closely)related If the mutual information becomes zero the tworandomvariables are totally unrelated or the two variables areindependent The mutual information and the entropy havethe following relation

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

119868 (119883 119884) = 119867 (119884) minus 119867 (119883 | 119884)

119868 (119883 119884) = 119867 (119883) + 119867 (119884) minus 119867 (119883 119884)

119868 (119883119883) = 119867 (119883)

(5)

For continuous random variables the entropy and mutualinformation are defined as follows

119867(119883) = minus int119901 (119909) log119901 (119909) 119889119909

119868 (119883 119884) =int119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910)119889119909 119889119910

(6)

Conditional mutual information is the reduction in theuncertainty of119883 due to knowledge of 119884 when119885 is givenTheconditionalmutual information of random variables119883 and119884given 119885 is defined by the following

119868 (119883 119884 | 119885) = 119867 (119883 | 119885) minus 119867 (119883 | 119884 119885) (7)

Mutual information satisfies a chain rule that is

119868 (1198831 1198832 119883

119899 119884) =

119899

sum119894=1

119868 (119883119894 119884 | 119883

119894minus1 119883119894minus2 119883

1)

(8)

32 Rough Sets Rough sets theory introduced by Pawlak[24] is a mathematical tool to handle imprecision uncer-tainty and vagueness It has been applied in many fields[25] such as machine learning data mining and patternrecognition

The notion of an information system provides a conve-nient basis for the representation of objects in terms of theirattributes An information system is a pair of (119880 119860) where119880 is a nonempty finite set of objects called the universe and119860 is a nonempty finite set of attributes that is 119886 119880 rarr 119881

119886

for 119886 isin 119860 where 119881119886is called the domain of 119886 A decision

table is a special case of information system 119878 = (119880 119860 cup 119889)where attributes in 119860 are called condition attributes and 119889 isa designated attribute called the decision attribute

For every set of attributes 119861 sube 119860 an indiscernibilityrelation 119868119873119863(119861) is defined in the following way Two objects119909119894and 119909

119895 are indiscernible by the set of attribute 119861 in 119860

if 119887(119909119894) = 119887(119909

119895) for every 119887 isin 119861 The equivalence class of

119868119873119863(119861) is called elementary set in 119861 because it representsthe smallest discernible groups of objects For any element119909119894of 119880 the equivalence class of 119909

119894in relation 119868119873119863(119861) is

represented as [119909119894]119861 For 119861 sube 119860 the indiscernibility relation

119868119873119863(119861) constitutes a partition of 119880 which is denoted by119880119868119873119863(119861)

Given an information system 119878 = (119880 119860) for any subset119883 sube 119880 and equivalence relation 119868119873119863(119861) the 119861-lower and119861-upper approximations of 119883 are defined respectively asfollows

119861 (119883) = 119909 isin 119880 [119909]119861 sube 119883

119861 (119883) = 119909 isin 119880 [119909]119861 cap 119883 = (9)

4 Rough Entropy-Based Metrics

In this section the concept of rough entropy is introducedto measure the uncertainty of knowledge in an informationsystem and then some rough entropy-based uncertaintymeasures are presented Some important properties of theseuncertainty measures are deduced respectively and therelationships among them are discussed as well

Definition 1 Given a set of samples 119880 = 1199091 1199092 119909

119899

described by features 119865 119878 sube 119865 is a subset of attributes Thenthe rough entropy of the sample is defined by

RH119909119894(119878) = minus log

1003816100381610038161003816[119909119894]1198781003816100381610038161003816

119899 (10)

and the average entropy of the set of samples is computed as

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

(11)

where |[119909119894]119878| is the cardinality of [119909

119894]119878

4 ISRN Applied Mathematics

Since for all 119909119894 119909119894 sube [119909

119894]119878sube 119880 1119899 le |[119909

119894]119878|119899 le 1 so

we have 0 le RH(119878) le log 119899 RH(119878) = log 119899 if and only if forall 119909119894 |[119909119894]|119878= 1 that is 119880119868119873119863(119878) = 119909

1 1199092 119909

119899

RH(119878) = 0 if and only if for for all 119909119894 |[119909119894]|119878= 119899 that

is 119880119868119873119863(119878) = 119880 Obviously when knowledge 119878 candistinguish any two objects the rough entropy is the largestwhen knowledge 119878 can not distinguish any two objects therough entropy is zero

Theorem2 Consider RH(119878) = 119867(119878) where119867(119878) is Shannonrsquosentropy

Proof Suppose 119880 = 1199091 1199092 119909

119899 and 119880119868119873119863(119878) =

1198831 1198832 119883

119898 where 119883

119894= 119909

1198941 1199091198942 119909

119894|119883119894| then

119867(119878) = minussum119898

119894=1(|119883119894|119899) log(|119883

119894|119899) Because 119883

119894cap 119883119895= for

119894 = 119895 and119883119894= [119909]119878for any 119909 isin 119883

119894 we have

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= sum119909isin1198831

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899+ sdot sdot sdot + sum

119909isin119883119898

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899

= (minus

100381610038161003816100381611988311003816100381610038161003816

119899log

100381610038161003816100381611988311003816100381610038161003816

119899) + sdot sdot sdot + (minus

10038161003816100381610038161198831198981003816100381610038161003816

119899log

10038161003816100381610038161198831198981003816100381610038161003816

119899)

= minus119898

sum119894=1

10038161003816100381610038161198831198941003816100381610038161003816

119899log

10038161003816100381610038161198831198941003816100381610038161003816

119899= 119867 (119878)

(12)

Theorem 2 shows that the rough entropy equals Shan-nonrsquos entropy

Definition 3 Suppose 119877 119878 sube 119865 are two subsets of attributesthe joint rough entropy is defined as

RH (119877 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

(13)

Due to [119909119894]119877cup119878

= [119909119894]119877cap [119909119894]119878 therefore RH(119877 119878) = minus(1

119899)sum119899

119894=1log(|[119909

119894]119877cap [119909119894]119878|119899) According to Definition 3 we

can observe that RH(119877 119878) = RH(119877 cup 119878)

Theorem 4 Consider RH(119877 119878) ge RH(119877) and RH(119877 119878) geRH(119878)

Proof Consider for all 119909119894isin 119880 we have [119909

119894]119878cup119877

sube [119909119894]119878and

[119909119894]119878cup119877

sube [119909119894]119877 and then |[119909

119894]119878cup119877

| le |[119909119894]119878| and |[119909

119894]119878cup119877

| le|[119909119894]119877| Therefore RH(119877 119878) ge RH(119877) and RH(119877 119878) ge RH(119878)

Definition 5 Suppose 119877 119878 sube 119865 are two subsets of attributesthe conditional rough entropy of 119877 to 119878 is defined as

RH (119877 | 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (14)

Theorem 6 (chain rule) Consider RH(119877 | 119878) = RH(119877 119878) minusRH(119878)

Proof Consider

RH (119877 119878) minus RH (119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RH (119877 | 119878)

(15)

Definition 7 Suppose 119877 119878 sube 119865 are two subsets of attributesthe rough mutual information of 119877 and 119878 is defined as

RI (119877 119878) = 1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (16)

Theorem 8 (the relation between rough mutual informationand rough entropy) Consider

(1) RI(119877 119878) = RI(119878 119877)(2) RI(119877 119878) = RH(119877) + RH(119878) minus RH(119877 119878)(3) RI(119877 119878) = RH(119877) minus RH(119877 | 119878) = RH(119878) minus RH(119878 | 119877)

Proof Theconclusions of (1) and (3) are straightforward herewe give the proof of property (2)

(2) Consider

RH (119877) + RH (119878) minus RH (119877 119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+ log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RI (119877 119878)

(17)

Definition 9 The rough conditional mutual information of 119877and 119878 given 119879 is defined by

RI (119877 119878 | 119879) = 1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816 (18)

Theorem 10 The following equations hold(1) RI(119877 119878 | 119879) = RH(119877 | 119878) minus RH(119877 | 119878 119879)(2) RI(119877 119878 | 119879) = RI(119877 119879 119878) minus RI(119877 119878)

ISRN Applied Mathematics 5

Proof (1) Consider

RH (119877 | 119878) minus RH (119877 | 119878 119879)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878cup119879

10038161003816100381610038161003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+ log

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816

1003816100381610038161003816[119909119894]119878cup1198791003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(19)

(2) Consider

RI (119877 119879 119878) minus RI (119877 119878)

= RI (119877 cup 119879 119878) minus RI (119877 119878)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

=1

119899

119899

sum119894=1

(log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(20)

5 A Hybrid Feature Selection Method

In this section we propose a novel hybrid feature selectionmethod based on rough conditional mutual information andnaive Bayesian classifier

51 Feature Selection by Rough Conditional Mutual Informa-tion Given a set of sample 119880 described by the attribute set119865 in terms of mutual information the purpose of featureselection is to find a feature set 119878 (119878 sube 119865) with 119898 featureswhich jointly have the largest dependency on the target class119910 This criterion called max-dependency has the followingform

max119863(119878 119910) 119863 = 119868 (1198911 1198912 119891

119898 119910) (21)

According to the chain rule for information

119868 (1198911 1198912 119891

119898 119910) =

119898

sum119894=1

119868 (119891119894 119910119891119894minus1 119891119894minus2 119891

1) (22)

that is to say we can select a feature which produces themaximum conditional mutual information formally writtenas

max119891isin119865minus119878

119863(119891 119878 119910) 119863 = 119868 (119891 119910 | 119878) (23)

where 119878 represents the selected feature setFigure 1 illustrates the validity of this criterion Here

119891119894represents a feature highly correlated with 119891

119895 and 119891

119896

is much less correlated with 119891119894 The mutual information

between vectors (119891119894 119891119895) and 119910 is indicated by a shadowed

area consisting of three different patterns of patches that is119868((119891119894 119891119895) 119910) = 119860 + 119861 + 119862 where 119860 119861 and 119862 are defined by

different cases of overlap In detail(1) (119860 + 119861) is the mutual information between 119891

119894and 119910

that is 119868(119891119894 119910)

(2) (119861 + 119862) is the mutual information between 119891119895and 119910

that is 119868(119891119895 119910)

(3) (119861 + 119863) is the mutual information between 119891119894and 119891

119895

that is 119868(119891119894 119891119895)

(4) 119862 is the conditional mutual information between 119891119895

and 119910 given 119891119894 that is 119868(119891

119895 119910 | 119891

119894)

(5) 119864 is the mutual information between 119891119896and 119910 that is

119868(119891119896 119910)

This illustration clearly shows that the features maximizingthe mutual information not only depend on their predictiveinformation individually for example (119860+119861+119861+119862) but alsoneed to take account of redundancy between them In thisexample feature 119891

119894should be selected first since the mutual

information between 119891119894and 119910 is the largest and feature 119891

119896

should have priority for selection over 119891119895in spite of the latter

having larger individual mutual information with 119910 This isbecause 119891

119896provides more complementary information to

feature 119891119894to predict 119910 than does 119891

119895(as 119864 gt 119862 in Figure 1)

that is to say for each round we should select a featurewhich maximizes conditional mutual information FromTheorem 2 we know that rough entropy equals Shannonrsquosentropy therefore we can select a feature which produces themaximum rough conditional mutual information

We adopt the forward feature algorithm to select featuresEach single input feature is added to selected features setbased onmaximizing rough conditional mutual informationthat is given selected feature set 119878 maximizing the roughmutual information of 119891

119894and target class 119910 where 119891

119894belongs

to the remain feature set In order to apply the roughconditional mutual information measure to the filter modelwell a numerical threshold value 120573 (120573 gt 0) is set to RI(119891

119894 119910 |

119878) This can help the algorithm to be resistant to noise dataand to overcome the overfitting problem to a certain extent[26] The procedure can be performed until RI(119891

119894 119910 | 119878) gt

120573 is satisfied The filter algorithm can be described by thefollowing procedure

(1) Initialization set 119865 larr ldquoinitial set of all featuresrdquo 119878 larrldquoempty setrdquo and 119910 larr ldquoclass outputsrdquo

(2) Computation of the rough mutual information of thefeatures with the class outputs for each feature (119891

119894isin

119865) compute RI(119891119894 119910)

6 ISRN Applied Mathematics

fj

C

E

B

y

fi

A

D

fk

Figure 1 Illustration ofmutual information and conditionalmutualinformation for different scenarios

(3) Selection of the first feature find the feature 119891119894that

maximizes RI(119891119894 119910) set 119865 larr 119865 minus 119891

119894 and 119878 larr 119891

119894

(4) Greedy selection repeat until the termination condi-tion is satisfied

(a) computation of the rough mutual informationRI(119891119894 119910 | 119878) for each feature 119891

119894isin 119865

(b) selection of the next feature choose the feature119891119894as the one that maximizes RI(119891

119894 119910 | 119878) set

119865 larr 119865 minus 119891119894 and 119878 larr 119878 cup 119891

119894

(5) Output the set containing the selected features 119878

52 Selecting the Best Feature Subset on Wrapper ApproachThe wrapper model uses the classification accuracy of apredetermined learning algorithm to determine the goodnessof the selected subset It searches for features that arebetter suited to the learning algorithm aiming at improvingthe performance of the learning algorithm therefore thewrapper approach generally outperforms the filter approachin the aspect of the final predictive accuracy of a learningmachineHowever it ismore computationally expensive thanthe filter models Although many wrapper methods are notexhaustive search most of them still incur time complexity119874(1198732) [27 28] where 119873 is the number of features of thedataset Hence it is worth reducing the search space beforeusing wrapper feature selection approach Through the filtermodel it can reduce high computational cost and avoidencountering the local maximal problemTherefore the finalsubset of the features obtained contains a few features whilethe predictive accuracy is still high

In this work we propose the reducing of the search spaceof the original feature set to the best candidate which canreduce the computational cost of the wrapper search effec-tively Our method uses the sequential backward eliminationtechnique to search for every possible subset of featuresthrough the candidate space

The features are ranked according to the average accuracyof the classifier and then features will be removed one byone from the candidate feature subset only if such exclusionimproves or does not change the classifier accuracy Differ-ent kinds of learning models can be applied to wrappersHowever different kinds of learning machines have differentdiscrimination abilities Naive Bayesian classifier is widelyused in machine learning because it is fast and easy to beimplemented Rennie et al [29] show that its performance iscompetitive with the state-of-the-art models like SVM whilethe latter has too many parameters to decide Therefore wechoose the naive Bayesian classifier as the core of fine tuningThe decrement selection procedure for selecting an optimalfeature subset based on the wrapper approach can be seen asshown in Algorithm 1

There are two phases in the wrapper algorithm as shownin wrapper algorithm In the first stage we compute the clas-sification accuracy Acc

119862of the candidate feature set which

is the results of filter model (step 1) where Classperf (119863119862)represents the average classification accuracy of dataset Dwith candidate features CThe results are obtained by 10-foldcross-validation For each 119891

119894isin 119862 we compute the average

accuracy 119878119888119900119903119890 Then features are ranked according to 119878119888119900119903119890value (steps 3ndash6) In the second stage we deal with the list ofthe ordered features once each feature in the list determinesthe first till the last ranked feature (steps 8ndash26) In this stageeach feature in the list considers the average accuracy of thenaive Bayesian classifier only if the feature is excluded Ifany feature is found to lead to the most improved averageaccuracy and the relative accuracy [30] is more than 120575

1(steps

11ndash14) the feature then will be removed Otherwise everypossible feature is considered and the feature that leads tothe largest average accuracy will be chosen and removed (step15)The one that leads to the improvement or the unchangingof the average accuracy (steps 17ndash20) or the degrading ofthe relative accuracy not worse than 120575

2(steps 21ndash24) will be

removed In general 1205751should take value in [0 01] and 120575

2

should take value in [0 002] In the following if not specified1205751= 005 and 120575

2= 001

This decrement selection procedure is repeated until thetermination condition is satisfied Usually the sequentialbackward elimination is more computationally expensivethan the incremental sequential forward search Howeverit could yield a better result when considering the localmaximal In addition the sequential forward search addingone feature at each pass does not take the interactionbetween the groups of the features into account [31] In manyclassification problems the class variable may be affected bygrouping several features but not the individual feature aloneTherefore the sequential forward search is unable to find thedependencies between the groups of the features while theperformance can be degraded sometimes

6 Experimental Results

This section illustrates the evaluation of our method in termsof the classification accuracy and the number of selectedfeatures in order to see how good the filter wrapper is in the

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

ISRN Applied Mathematics 3

random variables The entropy is a measure of uncertainty ofrandom variables [23] Let 119883 = 119909

1 1199092 119909

119899 be a discrete

random variable and let 119901(119909119894) be the probability of 119909

119894 the

entropy of119883 is defined by the following

119867(119883) = minus119899

sum119894=1

119901 (119909119894) log119901 (119909

119894) (1)

Here the base of log is 2 and the unit of entropy is the bit If119883and 119884 = 119910

1 1199102 119910

119898 are two discrete random variables

the joint probability is 119901(119909119894 119910119895) where 119894 = 1 2 119899 and 119895 =

1 2 119898 The joint entropy of119883 and 119884 is as follows

119867(119883 119884) = minus119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log119901 (119909

119894 119910119895) (2)

When certain variables are known and others are notknown the remaining uncertainty is measured by the con-ditional entropy as follows

119867(119883 | 119884) = 119867 (119883 119884) minus 119867 (119884)

= minus119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log119901 (119909

119894| 119910119895)

(3)

The information found commonly in two random variables isof importance and this is defined as the mutual informationbetween two variables as follows

119868 (119883 119884) =119899

sum119894=1

119898

sum119895=1

119901 (119909119894 119910119895) log

119901 (119909119894| 119910119895)

119901 (119909119894)

(4)

If the mutual information between two random variables islarge (small) it means two variables are closely (not closely)related If the mutual information becomes zero the tworandomvariables are totally unrelated or the two variables areindependent The mutual information and the entropy havethe following relation

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

119868 (119883 119884) = 119867 (119884) minus 119867 (119883 | 119884)

119868 (119883 119884) = 119867 (119883) + 119867 (119884) minus 119867 (119883 119884)

119868 (119883119883) = 119867 (119883)

(5)

For continuous random variables the entropy and mutualinformation are defined as follows

119867(119883) = minus int119901 (119909) log119901 (119909) 119889119909

119868 (119883 119884) =int119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910)119889119909 119889119910

(6)

Conditional mutual information is the reduction in theuncertainty of119883 due to knowledge of 119884 when119885 is givenTheconditionalmutual information of random variables119883 and119884given 119885 is defined by the following

119868 (119883 119884 | 119885) = 119867 (119883 | 119885) minus 119867 (119883 | 119884 119885) (7)

Mutual information satisfies a chain rule that is

119868 (1198831 1198832 119883

119899 119884) =

119899

sum119894=1

119868 (119883119894 119884 | 119883

119894minus1 119883119894minus2 119883

1)

(8)

32 Rough Sets Rough sets theory introduced by Pawlak[24] is a mathematical tool to handle imprecision uncer-tainty and vagueness It has been applied in many fields[25] such as machine learning data mining and patternrecognition

The notion of an information system provides a conve-nient basis for the representation of objects in terms of theirattributes An information system is a pair of (119880 119860) where119880 is a nonempty finite set of objects called the universe and119860 is a nonempty finite set of attributes that is 119886 119880 rarr 119881

119886

for 119886 isin 119860 where 119881119886is called the domain of 119886 A decision

table is a special case of information system 119878 = (119880 119860 cup 119889)where attributes in 119860 are called condition attributes and 119889 isa designated attribute called the decision attribute

For every set of attributes 119861 sube 119860 an indiscernibilityrelation 119868119873119863(119861) is defined in the following way Two objects119909119894and 119909

119895 are indiscernible by the set of attribute 119861 in 119860

if 119887(119909119894) = 119887(119909

119895) for every 119887 isin 119861 The equivalence class of

119868119873119863(119861) is called elementary set in 119861 because it representsthe smallest discernible groups of objects For any element119909119894of 119880 the equivalence class of 119909

119894in relation 119868119873119863(119861) is

represented as [119909119894]119861 For 119861 sube 119860 the indiscernibility relation

119868119873119863(119861) constitutes a partition of 119880 which is denoted by119880119868119873119863(119861)

Given an information system 119878 = (119880 119860) for any subset119883 sube 119880 and equivalence relation 119868119873119863(119861) the 119861-lower and119861-upper approximations of 119883 are defined respectively asfollows

119861 (119883) = 119909 isin 119880 [119909]119861 sube 119883

119861 (119883) = 119909 isin 119880 [119909]119861 cap 119883 = (9)

4 Rough Entropy-Based Metrics

In this section the concept of rough entropy is introducedto measure the uncertainty of knowledge in an informationsystem and then some rough entropy-based uncertaintymeasures are presented Some important properties of theseuncertainty measures are deduced respectively and therelationships among them are discussed as well

Definition 1 Given a set of samples 119880 = 1199091 1199092 119909

119899

described by features 119865 119878 sube 119865 is a subset of attributes Thenthe rough entropy of the sample is defined by

RH119909119894(119878) = minus log

1003816100381610038161003816[119909119894]1198781003816100381610038161003816

119899 (10)

and the average entropy of the set of samples is computed as

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

(11)

where |[119909119894]119878| is the cardinality of [119909

119894]119878

4 ISRN Applied Mathematics

Since for all 119909119894 119909119894 sube [119909

119894]119878sube 119880 1119899 le |[119909

119894]119878|119899 le 1 so

we have 0 le RH(119878) le log 119899 RH(119878) = log 119899 if and only if forall 119909119894 |[119909119894]|119878= 1 that is 119880119868119873119863(119878) = 119909

1 1199092 119909

119899

RH(119878) = 0 if and only if for for all 119909119894 |[119909119894]|119878= 119899 that

is 119880119868119873119863(119878) = 119880 Obviously when knowledge 119878 candistinguish any two objects the rough entropy is the largestwhen knowledge 119878 can not distinguish any two objects therough entropy is zero

Theorem2 Consider RH(119878) = 119867(119878) where119867(119878) is Shannonrsquosentropy

Proof Suppose 119880 = 1199091 1199092 119909

119899 and 119880119868119873119863(119878) =

1198831 1198832 119883

119898 where 119883

119894= 119909

1198941 1199091198942 119909

119894|119883119894| then

119867(119878) = minussum119898

119894=1(|119883119894|119899) log(|119883

119894|119899) Because 119883

119894cap 119883119895= for

119894 = 119895 and119883119894= [119909]119878for any 119909 isin 119883

119894 we have

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= sum119909isin1198831

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899+ sdot sdot sdot + sum

119909isin119883119898

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899

= (minus

100381610038161003816100381611988311003816100381610038161003816

119899log

100381610038161003816100381611988311003816100381610038161003816

119899) + sdot sdot sdot + (minus

10038161003816100381610038161198831198981003816100381610038161003816

119899log

10038161003816100381610038161198831198981003816100381610038161003816

119899)

= minus119898

sum119894=1

10038161003816100381610038161198831198941003816100381610038161003816

119899log

10038161003816100381610038161198831198941003816100381610038161003816

119899= 119867 (119878)

(12)

Theorem 2 shows that the rough entropy equals Shan-nonrsquos entropy

Definition 3 Suppose 119877 119878 sube 119865 are two subsets of attributesthe joint rough entropy is defined as

RH (119877 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

(13)

Due to [119909119894]119877cup119878

= [119909119894]119877cap [119909119894]119878 therefore RH(119877 119878) = minus(1

119899)sum119899

119894=1log(|[119909

119894]119877cap [119909119894]119878|119899) According to Definition 3 we

can observe that RH(119877 119878) = RH(119877 cup 119878)

Theorem 4 Consider RH(119877 119878) ge RH(119877) and RH(119877 119878) geRH(119878)

Proof Consider for all 119909119894isin 119880 we have [119909

119894]119878cup119877

sube [119909119894]119878and

[119909119894]119878cup119877

sube [119909119894]119877 and then |[119909

119894]119878cup119877

| le |[119909119894]119878| and |[119909

119894]119878cup119877

| le|[119909119894]119877| Therefore RH(119877 119878) ge RH(119877) and RH(119877 119878) ge RH(119878)

Definition 5 Suppose 119877 119878 sube 119865 are two subsets of attributesthe conditional rough entropy of 119877 to 119878 is defined as

RH (119877 | 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (14)

Theorem 6 (chain rule) Consider RH(119877 | 119878) = RH(119877 119878) minusRH(119878)

Proof Consider

RH (119877 119878) minus RH (119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RH (119877 | 119878)

(15)

Definition 7 Suppose 119877 119878 sube 119865 are two subsets of attributesthe rough mutual information of 119877 and 119878 is defined as

RI (119877 119878) = 1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (16)

Theorem 8 (the relation between rough mutual informationand rough entropy) Consider

(1) RI(119877 119878) = RI(119878 119877)(2) RI(119877 119878) = RH(119877) + RH(119878) minus RH(119877 119878)(3) RI(119877 119878) = RH(119877) minus RH(119877 | 119878) = RH(119878) minus RH(119878 | 119877)

Proof Theconclusions of (1) and (3) are straightforward herewe give the proof of property (2)

(2) Consider

RH (119877) + RH (119878) minus RH (119877 119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+ log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RI (119877 119878)

(17)

Definition 9 The rough conditional mutual information of 119877and 119878 given 119879 is defined by

RI (119877 119878 | 119879) = 1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816 (18)

Theorem 10 The following equations hold(1) RI(119877 119878 | 119879) = RH(119877 | 119878) minus RH(119877 | 119878 119879)(2) RI(119877 119878 | 119879) = RI(119877 119879 119878) minus RI(119877 119878)

ISRN Applied Mathematics 5

Proof (1) Consider

RH (119877 | 119878) minus RH (119877 | 119878 119879)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878cup119879

10038161003816100381610038161003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+ log

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816

1003816100381610038161003816[119909119894]119878cup1198791003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(19)

(2) Consider

RI (119877 119879 119878) minus RI (119877 119878)

= RI (119877 cup 119879 119878) minus RI (119877 119878)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

=1

119899

119899

sum119894=1

(log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(20)

5 A Hybrid Feature Selection Method

In this section we propose a novel hybrid feature selectionmethod based on rough conditional mutual information andnaive Bayesian classifier

51 Feature Selection by Rough Conditional Mutual Informa-tion Given a set of sample 119880 described by the attribute set119865 in terms of mutual information the purpose of featureselection is to find a feature set 119878 (119878 sube 119865) with 119898 featureswhich jointly have the largest dependency on the target class119910 This criterion called max-dependency has the followingform

max119863(119878 119910) 119863 = 119868 (1198911 1198912 119891

119898 119910) (21)

According to the chain rule for information

119868 (1198911 1198912 119891

119898 119910) =

119898

sum119894=1

119868 (119891119894 119910119891119894minus1 119891119894minus2 119891

1) (22)

that is to say we can select a feature which produces themaximum conditional mutual information formally writtenas

max119891isin119865minus119878

119863(119891 119878 119910) 119863 = 119868 (119891 119910 | 119878) (23)

where 119878 represents the selected feature setFigure 1 illustrates the validity of this criterion Here

119891119894represents a feature highly correlated with 119891

119895 and 119891

119896

is much less correlated with 119891119894 The mutual information

between vectors (119891119894 119891119895) and 119910 is indicated by a shadowed

area consisting of three different patterns of patches that is119868((119891119894 119891119895) 119910) = 119860 + 119861 + 119862 where 119860 119861 and 119862 are defined by

different cases of overlap In detail(1) (119860 + 119861) is the mutual information between 119891

119894and 119910

that is 119868(119891119894 119910)

(2) (119861 + 119862) is the mutual information between 119891119895and 119910

that is 119868(119891119895 119910)

(3) (119861 + 119863) is the mutual information between 119891119894and 119891

119895

that is 119868(119891119894 119891119895)

(4) 119862 is the conditional mutual information between 119891119895

and 119910 given 119891119894 that is 119868(119891

119895 119910 | 119891

119894)

(5) 119864 is the mutual information between 119891119896and 119910 that is

119868(119891119896 119910)

This illustration clearly shows that the features maximizingthe mutual information not only depend on their predictiveinformation individually for example (119860+119861+119861+119862) but alsoneed to take account of redundancy between them In thisexample feature 119891

119894should be selected first since the mutual

information between 119891119894and 119910 is the largest and feature 119891

119896

should have priority for selection over 119891119895in spite of the latter

having larger individual mutual information with 119910 This isbecause 119891

119896provides more complementary information to

feature 119891119894to predict 119910 than does 119891

119895(as 119864 gt 119862 in Figure 1)

that is to say for each round we should select a featurewhich maximizes conditional mutual information FromTheorem 2 we know that rough entropy equals Shannonrsquosentropy therefore we can select a feature which produces themaximum rough conditional mutual information

We adopt the forward feature algorithm to select featuresEach single input feature is added to selected features setbased onmaximizing rough conditional mutual informationthat is given selected feature set 119878 maximizing the roughmutual information of 119891

119894and target class 119910 where 119891

119894belongs

to the remain feature set In order to apply the roughconditional mutual information measure to the filter modelwell a numerical threshold value 120573 (120573 gt 0) is set to RI(119891

119894 119910 |

119878) This can help the algorithm to be resistant to noise dataand to overcome the overfitting problem to a certain extent[26] The procedure can be performed until RI(119891

119894 119910 | 119878) gt

120573 is satisfied The filter algorithm can be described by thefollowing procedure

(1) Initialization set 119865 larr ldquoinitial set of all featuresrdquo 119878 larrldquoempty setrdquo and 119910 larr ldquoclass outputsrdquo

(2) Computation of the rough mutual information of thefeatures with the class outputs for each feature (119891

119894isin

119865) compute RI(119891119894 119910)

6 ISRN Applied Mathematics

fj

C

E

B

y

fi

A

D

fk

Figure 1 Illustration ofmutual information and conditionalmutualinformation for different scenarios

(3) Selection of the first feature find the feature 119891119894that

maximizes RI(119891119894 119910) set 119865 larr 119865 minus 119891

119894 and 119878 larr 119891

119894

(4) Greedy selection repeat until the termination condi-tion is satisfied

(a) computation of the rough mutual informationRI(119891119894 119910 | 119878) for each feature 119891

119894isin 119865

(b) selection of the next feature choose the feature119891119894as the one that maximizes RI(119891

119894 119910 | 119878) set

119865 larr 119865 minus 119891119894 and 119878 larr 119878 cup 119891

119894

(5) Output the set containing the selected features 119878

52 Selecting the Best Feature Subset on Wrapper ApproachThe wrapper model uses the classification accuracy of apredetermined learning algorithm to determine the goodnessof the selected subset It searches for features that arebetter suited to the learning algorithm aiming at improvingthe performance of the learning algorithm therefore thewrapper approach generally outperforms the filter approachin the aspect of the final predictive accuracy of a learningmachineHowever it ismore computationally expensive thanthe filter models Although many wrapper methods are notexhaustive search most of them still incur time complexity119874(1198732) [27 28] where 119873 is the number of features of thedataset Hence it is worth reducing the search space beforeusing wrapper feature selection approach Through the filtermodel it can reduce high computational cost and avoidencountering the local maximal problemTherefore the finalsubset of the features obtained contains a few features whilethe predictive accuracy is still high

In this work we propose the reducing of the search spaceof the original feature set to the best candidate which canreduce the computational cost of the wrapper search effec-tively Our method uses the sequential backward eliminationtechnique to search for every possible subset of featuresthrough the candidate space

The features are ranked according to the average accuracyof the classifier and then features will be removed one byone from the candidate feature subset only if such exclusionimproves or does not change the classifier accuracy Differ-ent kinds of learning models can be applied to wrappersHowever different kinds of learning machines have differentdiscrimination abilities Naive Bayesian classifier is widelyused in machine learning because it is fast and easy to beimplemented Rennie et al [29] show that its performance iscompetitive with the state-of-the-art models like SVM whilethe latter has too many parameters to decide Therefore wechoose the naive Bayesian classifier as the core of fine tuningThe decrement selection procedure for selecting an optimalfeature subset based on the wrapper approach can be seen asshown in Algorithm 1

There are two phases in the wrapper algorithm as shownin wrapper algorithm In the first stage we compute the clas-sification accuracy Acc

119862of the candidate feature set which

is the results of filter model (step 1) where Classperf (119863119862)represents the average classification accuracy of dataset Dwith candidate features CThe results are obtained by 10-foldcross-validation For each 119891

119894isin 119862 we compute the average

accuracy 119878119888119900119903119890 Then features are ranked according to 119878119888119900119903119890value (steps 3ndash6) In the second stage we deal with the list ofthe ordered features once each feature in the list determinesthe first till the last ranked feature (steps 8ndash26) In this stageeach feature in the list considers the average accuracy of thenaive Bayesian classifier only if the feature is excluded Ifany feature is found to lead to the most improved averageaccuracy and the relative accuracy [30] is more than 120575

1(steps

11ndash14) the feature then will be removed Otherwise everypossible feature is considered and the feature that leads tothe largest average accuracy will be chosen and removed (step15)The one that leads to the improvement or the unchangingof the average accuracy (steps 17ndash20) or the degrading ofthe relative accuracy not worse than 120575

2(steps 21ndash24) will be

removed In general 1205751should take value in [0 01] and 120575

2

should take value in [0 002] In the following if not specified1205751= 005 and 120575

2= 001

This decrement selection procedure is repeated until thetermination condition is satisfied Usually the sequentialbackward elimination is more computationally expensivethan the incremental sequential forward search Howeverit could yield a better result when considering the localmaximal In addition the sequential forward search addingone feature at each pass does not take the interactionbetween the groups of the features into account [31] In manyclassification problems the class variable may be affected bygrouping several features but not the individual feature aloneTherefore the sequential forward search is unable to find thedependencies between the groups of the features while theperformance can be degraded sometimes

6 Experimental Results

This section illustrates the evaluation of our method in termsof the classification accuracy and the number of selectedfeatures in order to see how good the filter wrapper is in the

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

4 ISRN Applied Mathematics

Since for all 119909119894 119909119894 sube [119909

119894]119878sube 119880 1119899 le |[119909

119894]119878|119899 le 1 so

we have 0 le RH(119878) le log 119899 RH(119878) = log 119899 if and only if forall 119909119894 |[119909119894]|119878= 1 that is 119880119868119873119863(119878) = 119909

1 1199092 119909

119899

RH(119878) = 0 if and only if for for all 119909119894 |[119909119894]|119878= 119899 that

is 119880119868119873119863(119878) = 119880 Obviously when knowledge 119878 candistinguish any two objects the rough entropy is the largestwhen knowledge 119878 can not distinguish any two objects therough entropy is zero

Theorem2 Consider RH(119878) = 119867(119878) where119867(119878) is Shannonrsquosentropy

Proof Suppose 119880 = 1199091 1199092 119909

119899 and 119880119868119873119863(119878) =

1198831 1198832 119883

119898 where 119883

119894= 119909

1198941 1199091198942 119909

119894|119883119894| then

119867(119878) = minussum119898

119894=1(|119883119894|119899) log(|119883

119894|119899) Because 119883

119894cap 119883119895= for

119894 = 119895 and119883119894= [119909]119878for any 119909 isin 119883

119894 we have

RH (119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= sum119909isin1198831

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899+ sdot sdot sdot + sum

119909isin119883119898

minus1

119899log

1003816100381610038161003816[119909]1198781003816100381610038161003816

119899

= (minus

100381610038161003816100381611988311003816100381610038161003816

119899log

100381610038161003816100381611988311003816100381610038161003816

119899) + sdot sdot sdot + (minus

10038161003816100381610038161198831198981003816100381610038161003816

119899log

10038161003816100381610038161198831198981003816100381610038161003816

119899)

= minus119898

sum119894=1

10038161003816100381610038161198831198941003816100381610038161003816

119899log

10038161003816100381610038161198831198941003816100381610038161003816

119899= 119867 (119878)

(12)

Theorem 2 shows that the rough entropy equals Shan-nonrsquos entropy

Definition 3 Suppose 119877 119878 sube 119865 are two subsets of attributesthe joint rough entropy is defined as

RH (119877 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

(13)

Due to [119909119894]119877cup119878

= [119909119894]119877cap [119909119894]119878 therefore RH(119877 119878) = minus(1

119899)sum119899

119894=1log(|[119909

119894]119877cap [119909119894]119878|119899) According to Definition 3 we

can observe that RH(119877 119878) = RH(119877 cup 119878)

Theorem 4 Consider RH(119877 119878) ge RH(119877) and RH(119877 119878) geRH(119878)

Proof Consider for all 119909119894isin 119880 we have [119909

119894]119878cup119877

sube [119909119894]119878and

[119909119894]119878cup119877

sube [119909119894]119877 and then |[119909

119894]119878cup119877

| le |[119909119894]119878| and |[119909

119894]119878cup119877

| le|[119909119894]119877| Therefore RH(119877 119878) ge RH(119877) and RH(119877 119878) ge RH(119878)

Definition 5 Suppose 119877 119878 sube 119865 are two subsets of attributesthe conditional rough entropy of 119877 to 119878 is defined as

RH (119877 | 119878) = minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (14)

Theorem 6 (chain rule) Consider RH(119877 | 119878) = RH(119877 119878) minusRH(119878)

Proof Consider

RH (119877 119878) minus RH (119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RH (119877 | 119878)

(15)

Definition 7 Suppose 119877 119878 sube 119865 are two subsets of attributesthe rough mutual information of 119877 and 119878 is defined as

RI (119877 119878) = 1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816 (16)

Theorem 8 (the relation between rough mutual informationand rough entropy) Consider

(1) RI(119877 119878) = RI(119878 119877)(2) RI(119877 119878) = RH(119877) + RH(119878) minus RH(119877 119878)(3) RI(119877 119878) = RH(119877) minus RH(119877 | 119878) = RH(119878) minus RH(119878 | 119877)

Proof Theconclusions of (1) and (3) are straightforward herewe give the proof of property (2)

(2) Consider

RH (119877) + RH (119878) minus RH (119877 119878)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877

1003816100381610038161003816119899

minus log1003816100381610038161003816[119909119894]119878

1003816100381610038161003816119899

+ log1003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816119899

)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816= RI (119877 119878)

(17)

Definition 9 The rough conditional mutual information of 119877and 119878 given 119879 is defined by

RI (119877 119878 | 119879) = 1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816 (18)

Theorem 10 The following equations hold(1) RI(119877 119878 | 119879) = RH(119877 | 119878) minus RH(119877 | 119878 119879)(2) RI(119877 119878 | 119879) = RI(119877 119879 119878) minus RI(119877 119878)

ISRN Applied Mathematics 5

Proof (1) Consider

RH (119877 | 119878) minus RH (119877 | 119878 119879)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878cup119879

10038161003816100381610038161003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+ log

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816

1003816100381610038161003816[119909119894]119878cup1198791003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(19)

(2) Consider

RI (119877 119879 119878) minus RI (119877 119878)

= RI (119877 cup 119879 119878) minus RI (119877 119878)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

=1

119899

119899

sum119894=1

(log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(20)

5 A Hybrid Feature Selection Method

In this section we propose a novel hybrid feature selectionmethod based on rough conditional mutual information andnaive Bayesian classifier

51 Feature Selection by Rough Conditional Mutual Informa-tion Given a set of sample 119880 described by the attribute set119865 in terms of mutual information the purpose of featureselection is to find a feature set 119878 (119878 sube 119865) with 119898 featureswhich jointly have the largest dependency on the target class119910 This criterion called max-dependency has the followingform

max119863(119878 119910) 119863 = 119868 (1198911 1198912 119891

119898 119910) (21)

According to the chain rule for information

119868 (1198911 1198912 119891

119898 119910) =

119898

sum119894=1

119868 (119891119894 119910119891119894minus1 119891119894minus2 119891

1) (22)

that is to say we can select a feature which produces themaximum conditional mutual information formally writtenas

max119891isin119865minus119878

119863(119891 119878 119910) 119863 = 119868 (119891 119910 | 119878) (23)

where 119878 represents the selected feature setFigure 1 illustrates the validity of this criterion Here

119891119894represents a feature highly correlated with 119891

119895 and 119891

119896

is much less correlated with 119891119894 The mutual information

between vectors (119891119894 119891119895) and 119910 is indicated by a shadowed

area consisting of three different patterns of patches that is119868((119891119894 119891119895) 119910) = 119860 + 119861 + 119862 where 119860 119861 and 119862 are defined by

different cases of overlap In detail(1) (119860 + 119861) is the mutual information between 119891

119894and 119910

that is 119868(119891119894 119910)

(2) (119861 + 119862) is the mutual information between 119891119895and 119910

that is 119868(119891119895 119910)

(3) (119861 + 119863) is the mutual information between 119891119894and 119891

119895

that is 119868(119891119894 119891119895)

(4) 119862 is the conditional mutual information between 119891119895

and 119910 given 119891119894 that is 119868(119891

119895 119910 | 119891

119894)

(5) 119864 is the mutual information between 119891119896and 119910 that is

119868(119891119896 119910)

This illustration clearly shows that the features maximizingthe mutual information not only depend on their predictiveinformation individually for example (119860+119861+119861+119862) but alsoneed to take account of redundancy between them In thisexample feature 119891

119894should be selected first since the mutual

information between 119891119894and 119910 is the largest and feature 119891

119896

should have priority for selection over 119891119895in spite of the latter

having larger individual mutual information with 119910 This isbecause 119891

119896provides more complementary information to

feature 119891119894to predict 119910 than does 119891

119895(as 119864 gt 119862 in Figure 1)

that is to say for each round we should select a featurewhich maximizes conditional mutual information FromTheorem 2 we know that rough entropy equals Shannonrsquosentropy therefore we can select a feature which produces themaximum rough conditional mutual information

We adopt the forward feature algorithm to select featuresEach single input feature is added to selected features setbased onmaximizing rough conditional mutual informationthat is given selected feature set 119878 maximizing the roughmutual information of 119891

119894and target class 119910 where 119891

119894belongs

to the remain feature set In order to apply the roughconditional mutual information measure to the filter modelwell a numerical threshold value 120573 (120573 gt 0) is set to RI(119891

119894 119910 |

119878) This can help the algorithm to be resistant to noise dataand to overcome the overfitting problem to a certain extent[26] The procedure can be performed until RI(119891

119894 119910 | 119878) gt

120573 is satisfied The filter algorithm can be described by thefollowing procedure

(1) Initialization set 119865 larr ldquoinitial set of all featuresrdquo 119878 larrldquoempty setrdquo and 119910 larr ldquoclass outputsrdquo

(2) Computation of the rough mutual information of thefeatures with the class outputs for each feature (119891

119894isin

119865) compute RI(119891119894 119910)

6 ISRN Applied Mathematics

fj

C

E

B

y

fi

A

D

fk

Figure 1 Illustration ofmutual information and conditionalmutualinformation for different scenarios

(3) Selection of the first feature find the feature 119891119894that

maximizes RI(119891119894 119910) set 119865 larr 119865 minus 119891

119894 and 119878 larr 119891

119894

(4) Greedy selection repeat until the termination condi-tion is satisfied

(a) computation of the rough mutual informationRI(119891119894 119910 | 119878) for each feature 119891

119894isin 119865

(b) selection of the next feature choose the feature119891119894as the one that maximizes RI(119891

119894 119910 | 119878) set

119865 larr 119865 minus 119891119894 and 119878 larr 119878 cup 119891

119894

(5) Output the set containing the selected features 119878

52 Selecting the Best Feature Subset on Wrapper ApproachThe wrapper model uses the classification accuracy of apredetermined learning algorithm to determine the goodnessof the selected subset It searches for features that arebetter suited to the learning algorithm aiming at improvingthe performance of the learning algorithm therefore thewrapper approach generally outperforms the filter approachin the aspect of the final predictive accuracy of a learningmachineHowever it ismore computationally expensive thanthe filter models Although many wrapper methods are notexhaustive search most of them still incur time complexity119874(1198732) [27 28] where 119873 is the number of features of thedataset Hence it is worth reducing the search space beforeusing wrapper feature selection approach Through the filtermodel it can reduce high computational cost and avoidencountering the local maximal problemTherefore the finalsubset of the features obtained contains a few features whilethe predictive accuracy is still high

In this work we propose the reducing of the search spaceof the original feature set to the best candidate which canreduce the computational cost of the wrapper search effec-tively Our method uses the sequential backward eliminationtechnique to search for every possible subset of featuresthrough the candidate space

The features are ranked according to the average accuracyof the classifier and then features will be removed one byone from the candidate feature subset only if such exclusionimproves or does not change the classifier accuracy Differ-ent kinds of learning models can be applied to wrappersHowever different kinds of learning machines have differentdiscrimination abilities Naive Bayesian classifier is widelyused in machine learning because it is fast and easy to beimplemented Rennie et al [29] show that its performance iscompetitive with the state-of-the-art models like SVM whilethe latter has too many parameters to decide Therefore wechoose the naive Bayesian classifier as the core of fine tuningThe decrement selection procedure for selecting an optimalfeature subset based on the wrapper approach can be seen asshown in Algorithm 1

There are two phases in the wrapper algorithm as shownin wrapper algorithm In the first stage we compute the clas-sification accuracy Acc

119862of the candidate feature set which

is the results of filter model (step 1) where Classperf (119863119862)represents the average classification accuracy of dataset Dwith candidate features CThe results are obtained by 10-foldcross-validation For each 119891

119894isin 119862 we compute the average

accuracy 119878119888119900119903119890 Then features are ranked according to 119878119888119900119903119890value (steps 3ndash6) In the second stage we deal with the list ofthe ordered features once each feature in the list determinesthe first till the last ranked feature (steps 8ndash26) In this stageeach feature in the list considers the average accuracy of thenaive Bayesian classifier only if the feature is excluded Ifany feature is found to lead to the most improved averageaccuracy and the relative accuracy [30] is more than 120575

1(steps

11ndash14) the feature then will be removed Otherwise everypossible feature is considered and the feature that leads tothe largest average accuracy will be chosen and removed (step15)The one that leads to the improvement or the unchangingof the average accuracy (steps 17ndash20) or the degrading ofthe relative accuracy not worse than 120575

2(steps 21ndash24) will be

removed In general 1205751should take value in [0 01] and 120575

2

should take value in [0 002] In the following if not specified1205751= 005 and 120575

2= 001

This decrement selection procedure is repeated until thetermination condition is satisfied Usually the sequentialbackward elimination is more computationally expensivethan the incremental sequential forward search Howeverit could yield a better result when considering the localmaximal In addition the sequential forward search addingone feature at each pass does not take the interactionbetween the groups of the features into account [31] In manyclassification problems the class variable may be affected bygrouping several features but not the individual feature aloneTherefore the sequential forward search is unable to find thedependencies between the groups of the features while theperformance can be degraded sometimes

6 Experimental Results

This section illustrates the evaluation of our method in termsof the classification accuracy and the number of selectedfeatures in order to see how good the filter wrapper is in the

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

ISRN Applied Mathematics 5

Proof (1) Consider

RH (119877 | 119878) minus RH (119877 | 119878 119879)

= minus1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+1

119899

119899

sum119894=1

log1003816100381610038161003816[119909119894]119877cup119878cup119879

10038161003816100381610038161003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816

=1

119899

119899

sum119894=1

(minus log1003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119878

1003816100381610038161003816+ log

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816

1003816100381610038161003816[119909119894]119878cup1198791003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(19)

(2) Consider

RI (119877 119879 119878) minus RI (119877 119878)

= RI (119877 cup 119879 119878) minus RI (119877 119878)

=1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus1

119899

119899

sum119894=1

log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

=1

119899

119899

sum119894=1

(log1198991003816100381610038161003816[119909119894]119877cup119879cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119879

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816

minus log1198991003816100381610038161003816[119909119894]119877cup119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

1003816100381610038161003816)

=1

119899

119899

sum119894=1

1003816100381610038161003816[119909119894]119877cup119878cup1198791003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878

10038161003816100381610038161003816100381610038161003816[119909119894]119877cup119878

1003816100381610038161003816 sdot1003816100381610038161003816[119909119894]119878cup119879

1003816100381610038161003816= RI (119877 119878 | 119879)

(20)

5 A Hybrid Feature Selection Method

In this section we propose a novel hybrid feature selectionmethod based on rough conditional mutual information andnaive Bayesian classifier

51 Feature Selection by Rough Conditional Mutual Informa-tion Given a set of sample 119880 described by the attribute set119865 in terms of mutual information the purpose of featureselection is to find a feature set 119878 (119878 sube 119865) with 119898 featureswhich jointly have the largest dependency on the target class119910 This criterion called max-dependency has the followingform

max119863(119878 119910) 119863 = 119868 (1198911 1198912 119891

119898 119910) (21)

According to the chain rule for information

119868 (1198911 1198912 119891

119898 119910) =

119898

sum119894=1

119868 (119891119894 119910119891119894minus1 119891119894minus2 119891

1) (22)

that is to say we can select a feature which produces themaximum conditional mutual information formally writtenas

max119891isin119865minus119878

119863(119891 119878 119910) 119863 = 119868 (119891 119910 | 119878) (23)

where 119878 represents the selected feature setFigure 1 illustrates the validity of this criterion Here

119891119894represents a feature highly correlated with 119891

119895 and 119891

119896

is much less correlated with 119891119894 The mutual information

between vectors (119891119894 119891119895) and 119910 is indicated by a shadowed

area consisting of three different patterns of patches that is119868((119891119894 119891119895) 119910) = 119860 + 119861 + 119862 where 119860 119861 and 119862 are defined by

different cases of overlap In detail(1) (119860 + 119861) is the mutual information between 119891

119894and 119910

that is 119868(119891119894 119910)

(2) (119861 + 119862) is the mutual information between 119891119895and 119910

that is 119868(119891119895 119910)

(3) (119861 + 119863) is the mutual information between 119891119894and 119891

119895

that is 119868(119891119894 119891119895)

(4) 119862 is the conditional mutual information between 119891119895

and 119910 given 119891119894 that is 119868(119891

119895 119910 | 119891

119894)

(5) 119864 is the mutual information between 119891119896and 119910 that is

119868(119891119896 119910)

This illustration clearly shows that the features maximizingthe mutual information not only depend on their predictiveinformation individually for example (119860+119861+119861+119862) but alsoneed to take account of redundancy between them In thisexample feature 119891

119894should be selected first since the mutual

information between 119891119894and 119910 is the largest and feature 119891

119896

should have priority for selection over 119891119895in spite of the latter

having larger individual mutual information with 119910 This isbecause 119891

119896provides more complementary information to

feature 119891119894to predict 119910 than does 119891

119895(as 119864 gt 119862 in Figure 1)

that is to say for each round we should select a featurewhich maximizes conditional mutual information FromTheorem 2 we know that rough entropy equals Shannonrsquosentropy therefore we can select a feature which produces themaximum rough conditional mutual information

We adopt the forward feature algorithm to select featuresEach single input feature is added to selected features setbased onmaximizing rough conditional mutual informationthat is given selected feature set 119878 maximizing the roughmutual information of 119891

119894and target class 119910 where 119891

119894belongs

to the remain feature set In order to apply the roughconditional mutual information measure to the filter modelwell a numerical threshold value 120573 (120573 gt 0) is set to RI(119891

119894 119910 |

119878) This can help the algorithm to be resistant to noise dataand to overcome the overfitting problem to a certain extent[26] The procedure can be performed until RI(119891

119894 119910 | 119878) gt

120573 is satisfied The filter algorithm can be described by thefollowing procedure

(1) Initialization set 119865 larr ldquoinitial set of all featuresrdquo 119878 larrldquoempty setrdquo and 119910 larr ldquoclass outputsrdquo

(2) Computation of the rough mutual information of thefeatures with the class outputs for each feature (119891

119894isin

119865) compute RI(119891119894 119910)

6 ISRN Applied Mathematics

fj

C

E

B

y

fi

A

D

fk

Figure 1 Illustration ofmutual information and conditionalmutualinformation for different scenarios

(3) Selection of the first feature find the feature 119891119894that

maximizes RI(119891119894 119910) set 119865 larr 119865 minus 119891

119894 and 119878 larr 119891

119894

(4) Greedy selection repeat until the termination condi-tion is satisfied

(a) computation of the rough mutual informationRI(119891119894 119910 | 119878) for each feature 119891

119894isin 119865

(b) selection of the next feature choose the feature119891119894as the one that maximizes RI(119891

119894 119910 | 119878) set

119865 larr 119865 minus 119891119894 and 119878 larr 119878 cup 119891

119894

(5) Output the set containing the selected features 119878

52 Selecting the Best Feature Subset on Wrapper ApproachThe wrapper model uses the classification accuracy of apredetermined learning algorithm to determine the goodnessof the selected subset It searches for features that arebetter suited to the learning algorithm aiming at improvingthe performance of the learning algorithm therefore thewrapper approach generally outperforms the filter approachin the aspect of the final predictive accuracy of a learningmachineHowever it ismore computationally expensive thanthe filter models Although many wrapper methods are notexhaustive search most of them still incur time complexity119874(1198732) [27 28] where 119873 is the number of features of thedataset Hence it is worth reducing the search space beforeusing wrapper feature selection approach Through the filtermodel it can reduce high computational cost and avoidencountering the local maximal problemTherefore the finalsubset of the features obtained contains a few features whilethe predictive accuracy is still high

In this work we propose the reducing of the search spaceof the original feature set to the best candidate which canreduce the computational cost of the wrapper search effec-tively Our method uses the sequential backward eliminationtechnique to search for every possible subset of featuresthrough the candidate space

The features are ranked according to the average accuracyof the classifier and then features will be removed one byone from the candidate feature subset only if such exclusionimproves or does not change the classifier accuracy Differ-ent kinds of learning models can be applied to wrappersHowever different kinds of learning machines have differentdiscrimination abilities Naive Bayesian classifier is widelyused in machine learning because it is fast and easy to beimplemented Rennie et al [29] show that its performance iscompetitive with the state-of-the-art models like SVM whilethe latter has too many parameters to decide Therefore wechoose the naive Bayesian classifier as the core of fine tuningThe decrement selection procedure for selecting an optimalfeature subset based on the wrapper approach can be seen asshown in Algorithm 1

There are two phases in the wrapper algorithm as shownin wrapper algorithm In the first stage we compute the clas-sification accuracy Acc

119862of the candidate feature set which

is the results of filter model (step 1) where Classperf (119863119862)represents the average classification accuracy of dataset Dwith candidate features CThe results are obtained by 10-foldcross-validation For each 119891

119894isin 119862 we compute the average

accuracy 119878119888119900119903119890 Then features are ranked according to 119878119888119900119903119890value (steps 3ndash6) In the second stage we deal with the list ofthe ordered features once each feature in the list determinesthe first till the last ranked feature (steps 8ndash26) In this stageeach feature in the list considers the average accuracy of thenaive Bayesian classifier only if the feature is excluded Ifany feature is found to lead to the most improved averageaccuracy and the relative accuracy [30] is more than 120575

1(steps

11ndash14) the feature then will be removed Otherwise everypossible feature is considered and the feature that leads tothe largest average accuracy will be chosen and removed (step15)The one that leads to the improvement or the unchangingof the average accuracy (steps 17ndash20) or the degrading ofthe relative accuracy not worse than 120575

2(steps 21ndash24) will be

removed In general 1205751should take value in [0 01] and 120575

2

should take value in [0 002] In the following if not specified1205751= 005 and 120575

2= 001

This decrement selection procedure is repeated until thetermination condition is satisfied Usually the sequentialbackward elimination is more computationally expensivethan the incremental sequential forward search Howeverit could yield a better result when considering the localmaximal In addition the sequential forward search addingone feature at each pass does not take the interactionbetween the groups of the features into account [31] In manyclassification problems the class variable may be affected bygrouping several features but not the individual feature aloneTherefore the sequential forward search is unable to find thedependencies between the groups of the features while theperformance can be degraded sometimes

6 Experimental Results

This section illustrates the evaluation of our method in termsof the classification accuracy and the number of selectedfeatures in order to see how good the filter wrapper is in the

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

6 ISRN Applied Mathematics

fj

C

E

B

y

fi

A

D

fk

Figure 1 Illustration ofmutual information and conditionalmutualinformation for different scenarios

(3) Selection of the first feature find the feature 119891119894that

maximizes RI(119891119894 119910) set 119865 larr 119865 minus 119891

119894 and 119878 larr 119891

119894

(4) Greedy selection repeat until the termination condi-tion is satisfied

(a) computation of the rough mutual informationRI(119891119894 119910 | 119878) for each feature 119891

119894isin 119865

(b) selection of the next feature choose the feature119891119894as the one that maximizes RI(119891

119894 119910 | 119878) set

119865 larr 119865 minus 119891119894 and 119878 larr 119878 cup 119891

119894

(5) Output the set containing the selected features 119878

52 Selecting the Best Feature Subset on Wrapper ApproachThe wrapper model uses the classification accuracy of apredetermined learning algorithm to determine the goodnessof the selected subset It searches for features that arebetter suited to the learning algorithm aiming at improvingthe performance of the learning algorithm therefore thewrapper approach generally outperforms the filter approachin the aspect of the final predictive accuracy of a learningmachineHowever it ismore computationally expensive thanthe filter models Although many wrapper methods are notexhaustive search most of them still incur time complexity119874(1198732) [27 28] where 119873 is the number of features of thedataset Hence it is worth reducing the search space beforeusing wrapper feature selection approach Through the filtermodel it can reduce high computational cost and avoidencountering the local maximal problemTherefore the finalsubset of the features obtained contains a few features whilethe predictive accuracy is still high

In this work we propose the reducing of the search spaceof the original feature set to the best candidate which canreduce the computational cost of the wrapper search effec-tively Our method uses the sequential backward eliminationtechnique to search for every possible subset of featuresthrough the candidate space

The features are ranked according to the average accuracyof the classifier and then features will be removed one byone from the candidate feature subset only if such exclusionimproves or does not change the classifier accuracy Differ-ent kinds of learning models can be applied to wrappersHowever different kinds of learning machines have differentdiscrimination abilities Naive Bayesian classifier is widelyused in machine learning because it is fast and easy to beimplemented Rennie et al [29] show that its performance iscompetitive with the state-of-the-art models like SVM whilethe latter has too many parameters to decide Therefore wechoose the naive Bayesian classifier as the core of fine tuningThe decrement selection procedure for selecting an optimalfeature subset based on the wrapper approach can be seen asshown in Algorithm 1

There are two phases in the wrapper algorithm as shownin wrapper algorithm In the first stage we compute the clas-sification accuracy Acc

119862of the candidate feature set which

is the results of filter model (step 1) where Classperf (119863119862)represents the average classification accuracy of dataset Dwith candidate features CThe results are obtained by 10-foldcross-validation For each 119891

119894isin 119862 we compute the average

accuracy 119878119888119900119903119890 Then features are ranked according to 119878119888119900119903119890value (steps 3ndash6) In the second stage we deal with the list ofthe ordered features once each feature in the list determinesthe first till the last ranked feature (steps 8ndash26) In this stageeach feature in the list considers the average accuracy of thenaive Bayesian classifier only if the feature is excluded Ifany feature is found to lead to the most improved averageaccuracy and the relative accuracy [30] is more than 120575

1(steps

11ndash14) the feature then will be removed Otherwise everypossible feature is considered and the feature that leads tothe largest average accuracy will be chosen and removed (step15)The one that leads to the improvement or the unchangingof the average accuracy (steps 17ndash20) or the degrading ofthe relative accuracy not worse than 120575

2(steps 21ndash24) will be

removed In general 1205751should take value in [0 01] and 120575

2

should take value in [0 002] In the following if not specified1205751= 005 and 120575

2= 001

This decrement selection procedure is repeated until thetermination condition is satisfied Usually the sequentialbackward elimination is more computationally expensivethan the incremental sequential forward search Howeverit could yield a better result when considering the localmaximal In addition the sequential forward search addingone feature at each pass does not take the interactionbetween the groups of the features into account [31] In manyclassification problems the class variable may be affected bygrouping several features but not the individual feature aloneTherefore the sequential forward search is unable to find thedependencies between the groups of the features while theperformance can be degraded sometimes

6 Experimental Results

This section illustrates the evaluation of our method in termsof the classification accuracy and the number of selectedfeatures in order to see how good the filter wrapper is in the

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

ISRN Applied Mathematics 7

Input data set119863 candidate feature set 119862Output an optimal feature set 119861

(1) Acc119862= Classperf (119863119862)

(2) set 119861 = (3) for all 119891

119894isin 119862 do

(4) Score = Classperf (119863119891119894)

(5) append 119891119894to 119861

(6) end for(7) sort 119861 in an ascending order according to Score value(8) while |119861| gt 1 do(9) for all 119891

119894isin 119861 according to order do

(10) Acc119891119894= Classperf (119863 119861 minus 119891

119894)

(11) if (Acc119891119894minus Acc

119862)Acc

119865gt 1205751then

(12) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(13) go to Step 8(14) end if(15) Select 119891

119894with the maximum Acc

119891119894

(16) end for(17) if Acc

119891119894ge Acc

119862then

(18) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(19) go to Step 8(20) end if(21) if (Acc

119862minus Acc

119891119894)Acc

119862le 1205752then

(22) 119861 = 119861 minus 119891119894 Acc

119862= Acc

119891119894

(23) go to Step 8(24) end if(25) go to Step 27(26) end while(27) Return an optimal feature subset 119861

Algorithm 1 Wrapper algorithm

Table 1 Experimental data description

Number Datasets Instances Features Classes1 Arrhythmia 452 279 162 Hepatitis 155 19 23 Ionosphere 351 34 24 Segment 2310 19 75 Sonar 208 60 26 Soybean 683 35 197 Vote 435 16 28 WDBC 569 16 29 Wine 178 12 310 Zoo 101 16 7

situation of large and middle-sized features In addition theperformance of the rough conditional mutual informationalgorithm is compared with three typical feature selectionmethods which are based on three different evaluationcriterions respectively These methods include correlationbased feature selection (CFS) consistency based algorithmand min-redundancy max-relevance (mRMR) The resultsillustrate the efficiency and effectiveness of our method

In order to compare our hybrid method with someclassical techniques 10 databases are downloaded from UCIrepository of machine learning databases All these datasets

are widely used by the data mining community for evaluatinglearning algorithms The details of the 10 UCI experimentaldatasets are listed in Table 1 The sizes of databases vary from101 to 2310 the numbers of original features vary from 12 to279 and the numbers of classes vary from 2 to 19

61 Unselect versus CFS Consistency Based AlgorithmmRMR and RCMI In Section 5 rough conditional mutualinformation is used to filter the redundant and irrelevantfeatures In order to compute the rough mutual information

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

8 ISRN Applied Mathematics

Table 2 Number and accuracy of selected features with different algorithms tested by naive Bayes

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 7854 22 7566 24 7500 21 7743 162 8452 8903 8 8581 13 8774 7 8513 83 9060 9259 13 9088 7 9060 7 9487 64 9152 9351 8 9320 9 9398 5 9307 35 8558 7788 19 8221 14 8413 10 8606 156 9209 9180 24 8463 13 9048 21 9122 247 9011 9471 5 9195 12 9563 1 9563 28 9578 9666 11 9684 8 9578 9 9543 49 9831 9888 10 9944 5 9888 8 9888 510 9505 9505 10 9307 5 9406 4 9505 10Average 8986 9087 13 8937 11 9063 93 9128 93

Table 3 Number and accuracy of selected feature with different algorithms tested by CART

Number Unselect CFS Consistency mRMR RCMIAccuracy Accuracy Feature number Accuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 7257 22 7190 24 7345 21 7500 162 7935 8194 8 8000 13 8323 7 8516 83 8974 9031 13 8946 7 8689 7 9174 64 9615 9610 8 9528 9 9576 5 9554 35 7452 7404 19 7788 14 7644 10 7740 156 9253 9165 24 8594 13 9224 21 9136 247 9563 9563 5 9563 12 9563 1 9609 28 9350 9455 11 9473 8 9543 9 9490 49 9494 9494 10 9607 5 9326 8 9494 510 9208 9307 10 9208 5 9208 4 9307 10Average 8808 8848 13 8790 11 8844 93 8952 93

we employ Fayyad and Iranirsquos MDL discretization algorithm[32] to transform continuous features into discrete ones

We use naive Bayesian and CART classifier to test theclassification accuracy of selected features with differentfeature selection methods The results in Tables 2 and 3show the classification accuracies and the number of selectedfeatures obtained by the original feature (unselect) RCMIand other feature selectors According to Tables 2 and 3 wecan find that the selected feature by RCMI has the highestaverage accuracy in terms of naive Bayes and CART It canalso be observed that RCMI can achieve the least averagenumber of selected features which is the same asmRMRThisshows that RCMI is better than CFS and consistency basedalgorithm and is comparable to mRMR

In addition to illustrate the efficiency of RCMI we exper-iment on Ionosphere Sonar and Wine datasets respectivelyA different number of the selected features obtained by RCMIand mRMR are tested on naive Bayesian classifier as pre-sented in Figures 2 3 and 4 In Figures 2ndash4 the classificationaccuracies are the results of 10-fold cross-validation testedby naive Bayes The number 119896 in 119909-axis refers to the first 119896features with the selected order by different methods Theresults in Figures 2ndash4 show that the average accuracy ofclassifier with RCMI is comparable to mRMR in the majority

of cases We can see that the maximum value of the plotsfor each dataset with RCMI method is higher than mRMRFor example the highest accuracy of Ionosphere achievedby RCMI is 9487 while the highest accuracy achieved bymRMR is 9060 At the same time we can also notice thattheRCMImethodhas the number ofmaximumvalues higherthan mRMR It shows that RCMI is an effective measure forfeature selection

However the number of the features selected by theRCMI method is still more in some datasets Thereforeto improve performance and reduce the number of theselected features these problems were conducted by usingthe wrapper method With removal of the redundant andirreverent features the core of wrappers for fine tuning canperform much faster

62 Filter Wrapper versus RCMI and Unselect Similarly wealso use naive Bayesian and CART to test the classificationaccuracy of selected features with filter wrapper RCMI andunselect The results in Tables 4 and 5 show the classificationaccuracies and the number of selected features

Now we analyze the performance of these selected fea-tures First we can conclude that although most of featuresare removed from the raw data the classification accuracies

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

ISRN Applied Mathematics 9

Table 4 Number and accuracy of selected features based on naive Bayes

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7500 279 7743 16 7876 92 8452 19 8513 8 8581 33 9060 34 9487 6 9487 64 9152 19 9307 3 9433 35 8558 60 8606 15 8654 66 9209 35 9122 24 9224 107 9011 16 9563 2 9563 18 9578 16 9543 4 9543 49 9831 12 9888 5 9607 310 9505 16 9505 10 9505 8Average 8986 506 9128 93 9147 53

Table 5 Number and accuracy of selected features based on CART

Number Unselect RCMI Filter wrapperAccuracy Feature number Accuracy Feature number Accuracy Feature number

1 7235 279 7500 16 7589 92 7935 19 8516 8 8581 33 8974 34 9174 6 9174 64 9615 19 9554 3 9580 35 7452 60 7740 15 8077 66 9253 35 9136 24 9195 107 9563 16 9609 2 9563 18 9350 16 9490 4 9490 49 9494 12 9494 5 9382 310 9208 16 9307 10 9307 8Average 8808 506 8952 93 8994 53

1 2 3 4 5 6 7 8 9 10 11082

084

086

088

09

092

094

096

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 2 Classification accuracy of different number of selectedfeatures on Ionosphere dataset (naive Bayes)

0 2 4 6 8 10 12 14 16072

074

076

078

08

082

084

086

088

09

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 3 Classification accuracy of different number of selectedfeatures on Sonar dataset (naive Bayes)

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

10 ISRN Applied Mathematics

1 2 3 4 5 6 7 8 9 10082

084

086

088

09

092

094

096

098

1

Number of features

Clas

sifica

tion

accu

racy

RCMImRMR

Figure 4 Classification accuracy of different number of selectedfeatures on Wine dataset (naive Bayes)

do not decrease on the contrary the classification accuraciesincrease in the majority of datasets The average accuraciesderived from RCMI and filter-wrapper method are all higherthan the unselect datasets with respect to naive Bayes andCARTWith respect to naive Bayesian learning algorithm theaverage accuracy is 9147 for filter wrapper while 8986for unselect The average classification accuracy increased18 With respect to CART learning algorithm the averageaccuracy is 8994 for filter wrapper while 8808 for uns-elect The average classification accuracy increased 21 Theaverage number of selected features is 53 for filter wrapperwhile 93 for RCMI and 506 for unselectThe average numberof selected features reduced 43 and 895 respectivelyTherefore the average value of classification accuracy andnumber of features obtained from the filter-wrapper methodare better than those obtained from the RCMI and unselectIn other words using the RCMI and wrapper methods asa hybrid improves the classification efficiency and accuracycompared with using the RCMI method individually

7 Conclusion

The main goal of feature selection is to find a feature subsetas small as possible while the feature subset has highly pre-diction accuracy A hybrid feature selection approach whichtakes advantages of filter model and wrapper model hasbeen presented in this paper In the filter model measuringthe relevance between features plays an important role Anumber of measures were proposed Mutual information iswidely used for its robustness However it is difficult tocompute mutual information especially multivariate mutualinformation We proposed a set of rough based metrics tomeasure the relevance between features and analyzed someimportant properties of these uncertainty measures We haveproved that the RCMI can substitute Shannonrsquos conditionalmutual information thus RCMI can be used as an effective

measure to filter the irrelevant and redundant features Basedon the candidate feature subset by RCMI naive Bayesianclassifier is applied to the wrapper model The accuracy ofnaive Bayesian and CART classifier was used to evaluate thegoodness of feature subsetsThe performance of the proposedmethod is evaluated based on tenUCI datasets Experimentalresults on ten UCI datasets show that the filter-wrappermethodoutperformedCFS consistency based algorithm andmRMR atmost cases Our technique not only chooses a smallsubset of features from a candidate subset but also providesgood performance in predictive accuracy

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Natural ScienceFoundation of China (70971137)

References

[1] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[2] M Dash and H Liu ldquoFeature selection for classificationrdquoIntelligent Data Analysis vol 1 no 1ndash4 pp 131ndash156 1997

[3] M Dash and H Liu ldquoConsistency-based search in featureselectionrdquo Artificial Intelligence vol 151 no 1-2 pp 155ndash1762003

[4] C J Merz and P M Murphy UCI Repository of MachineLearning Databases Department of Information and ComputerScience University of California Irvine Calif USA 1996httpmlearnicsucieduMLRepositoryhtml

[5] K Kira and L A Rendell ldquoFeature selection problem tradi-tional methods and a new algorithmrdquo in Proceedings of the 9thNational Conference on Artificial Intelligence (AAAI rsquo92) pp129ndash134 July 1992

[6] M Robnik-Sikonja and I Kononenko ldquoTheoretical and empir-ical analysis of ReliefF and RReliefFrdquoMachine Learning vol 53no 1-2 pp 23ndash69 2003

[7] I Kononenko ldquoEstimating attributes analysis and extensionof RELIEFrdquo in Proceedings of European Conference on MachineLearning (ECML rsquo94) pp 171ndash182 1994

[8] MAHallCorrelation-based feature subset selection formachinelearning [PhD thesis] Department of Computer Science Uni-versity of Waikato Hamilton New Zealand 1999

[9] J G Bazan ldquoA comparison of dynamic and non-dynamic roughset methods for extracting laws from decision tablerdquo in RoughSets in KnowledgeDiscovery L Polkowski andA Skowron Edspp 321ndash365 Physica Heidelberg Germany 1998

[10] R Battiti ldquoUsing mutual information for selecting features insupervised neural net learningrdquo IEEE Transactions on NeuralNetworks vol 5 no 4 pp 537ndash550 1994

[11] N Kwak and C H Choi ldquoInput feature selection for classifica-tion problemsrdquo IEEE Transactions on Neural Networks vol 13no 1 pp 143ndash159 2002

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

ISRN Applied Mathematics 11

[12] H Peng F Long and C Ding ldquoFeature selection basedon mutual information criteria of max-dependency max-relevance and min-redundancyrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 27 no 8 pp 1226ndash12382005

[13] J Martınez Sotoca and F Pla ldquoSupervised feature selectionby clustering using conditional mutual information-based dis-tancesrdquo Pattern Recognition vol 43 no 6 pp 2068ndash2081 2010

[14] B Guo R I Damper S R Gunn and J D B NelsonldquoA fast separability-based feature-selection method for high-dimensional remotely sensed image classificationrdquo PatternRecognition vol 41 no 5 pp 1670ndash1679 2008

[15] D W Scott Multivariate Density Estimation Theory Practiceand Visualization John Wiley amp Sons New York NY USA1992

[16] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall London UK 1986

[17] A Kraskov H Stogbauer and P Grassberger ldquoEstimatingmutual informationrdquo Physical Review E vol 69 no 6 ArticleID 066138 16 pages 2004

[18] T Beaubouef F E Petry and G Arora ldquoInformation-theoreticmeasures of uncertainty for rough sets and rough relationaldatabasesrdquo Information Sciences vol 109 no 1ndash4 pp 185ndash1951998

[19] I Duntsch and G Gediga ldquoUncertainty measures of rough setpredictionrdquo Artificial Intelligence vol 106 no 1 pp 109ndash1371998

[20] G J Klir and M J Wierman Uncertainty Based InformationElements of Generalized InformationTheory Physica NewYorkNY USA 1999

[21] J Liang and Z Shi ldquoThe information entropy rough entropyand knowledge granulation in rough set theoryrdquo InternationalJournal of Uncertainty Fuzziness and Knowledge-Based Systemsvol 12 no 1 pp 37ndash46 2004

[22] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948

[23] T M Cover and J A Thomas Elements of InformationTheoryJohn Wiley amp Sons New York NY USA 1991

[24] Z Pawlak ldquoRough setsrdquo International Journal of Computer andInformation Sciences vol 11 no 5 pp 341ndash356 1982

[25] X Hu and N Cercone ldquoLearning in relational databases arough set approachrdquo Computational Intelligence vol 11 no 2pp 323ndash338 1995

[26] J Huang Y Cai and X Xu ldquoA hybrid genetic algorithmfor feature selection wrapper based on mutual informationrdquoPattern Recognition Letters vol 28 no 13 pp 1825ndash1844 2007

[27] H Liu and L Yu ldquoToward integrating feature selection algo-rithms for classification and clusteringrdquo IEEE Transactions onKnowledge and Data Engineering vol 17 no 4 pp 491ndash5022005

[28] L Yu and H Liu ldquoEfficient feature selection via analysisof relevance and redundancyrdquo Journal of Machine LearningResearch vol 5 pp 1205ndash1224 200304

[29] J D M Rennie L Shih J Teevan and D Karger ldquoTackling thepoor assumptions of naive bayes text classifiersrdquo in ProceedingsTwentieth International Conference on Machine Learning pp616ndash623 Washington DC USA August 2003

[30] R Setiono and H Liu ldquoNeural-network feature selectorrdquo IEEETransactions onNeural Networks vol 8 no 3 pp 654ndash662 1997

[31] S Foithong O Pinngern and B Attachoo ldquoFeature subsetselectionwrapper based onmutual information and rough setsrdquo

Expert Systems with Applications vol 39 no 1 pp 574ndash5842012

[32] U Fayyad and K Irani ldquoMulti-interval discretization ofcontinuous-valued attributes for classification learningrdquo in Pro-ceedings of the 13th International Joint Conference on ArtificialIntelligence pp 1022ndash1027MorganKaufmann SanMateo CalifUSA 1993

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article A Hybrid Feature Selection Method Based ...downloads.hindawi.com/archive/2014/382738.pdf · out that there are four basic steps in a typical feature selec-tion method,

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of