11
Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition Saisakul Chernbumroong a , Shuang Cang b , Hongnian Yu a,a Faculty of Science and Technology, Bournemouth University, Poole, Dorset BH12 5BB, UK b School of Tourism, Bournemouth University, Poole, Dorset BH12 5BB, UK article info Article history: Available online 23 August 2014 Keywords: Feature selection Neural networks Mutual information Activity recognition abstract In the multi-sensor activity recognition domain, the input space is often large and contains irrelevant and overlapped features. It is important to perform feature selection in order to select the smallest number of features which can describe the outputs. This paper proposes a new feature selection algorithms using the maximal relevance and maximal complementary (MRMC) based on neural networks. Unlike other feature selection algorithms that are based on relevance and redundancy measurements, the idea of how a feature complements to the already selected features is utilized. The proposed algorithm is eval- uated on two well-defined problems and five real world data sets. The data sets cover different types of data i.e. real, integer and category and sizes i.e. small to large set of features. The experimental results show that the MRMC can select a smaller number of features while achieving good results. The proposed algorithm can be applied to any type of data, and demonstrate great potential for the data set with a large number of features. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction The aim of feature selection is to identify the smallest subset of input features which explains the output classes. This process is important especially to the classification problems with a large number of input features. For example, a multi-sensor activity clas- sification system normally contains a large number of input features generated from different sensors. Feature selection can help reduce the size of feature space which leads to reduction in computational cost and complexity in the classification system. In real world prob- lems where input features contain irrelevant and redundant features, feature selection can help identify a relevance feature set which leads to improvement in classification performances. There are three main approaches in feature selection found in wearable sensor-based activity recognition applications: intuition, filter, and wrapper. Intuition based feature selection requires a domain knowledge or understanding which is required in the clas- sification of the interested activities. This approach is often used in conjunction with visual inspection, statistical analysis of the features e.g. histogram, distribution graph, or observation made during activity occurrence (Parkka et al., 2006 & Ward et al., Ward, Lukowicz, Troster, & Starner, 2006). Filter based-feature selection measures the relevance between features and the outputs by using techniques such as information theory, distance, correla- tion, receive operating curve (ROC), etc. Each feature is evaluated for its relevance then given a ranking score. For example, features which have the best performance in discriminating the interested activities were selected using ROC (Banos, Damas, Pomares, Prieto, & Rojas, 2012; Ermes, Parkka, Mantyjarvi, & Korhonen, 2008). Many of the statistical tests are used with this approach e.g. chi-square, T-test, etc. The study in Banos et al. (2012) found that features selected from the ranking quality group technique based on discrimination and robustness, ROC, T-test or the Wilco- xon with support vector machine produced remarkable results. Mutual information (MI) is another popular measurement used for measuring the relationship between two variables. Feature selection techniques which use MI are such as maximum relevance minimum redundancy (Peng, Long, & Ding, 2005), normalized mutual information feature selection-feature space (Cang & Yu, 2012), feature selection based on cumulate conditional mutual information (Zhang & Zhang, 2012), etc. Some techniques are based on neural networks to rank the features e.g. neural network feature selection (Setiono & Liu, 1997), Clamping technique (Wang, Jones, & Partridge, 2000), constructive approach for feature selection (Kabir, Islam, & Murase, 2010), etc. The main advantages of the fil- ter approach are due to its simplicity, speed and independence of the classification algorithm (Saeys, Inza, & Larraaga, 2007). How- ever, most of the techniques in this approach usually consider http://dx.doi.org/10.1016/j.eswa.2014.07.052 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved. Corresponding author. Expert Systems with Applications 42 (2015) 573–583 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

Embed Size (px)

Citation preview

Page 1: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

Expert Systems with Applications 42 (2015) 573–583

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Maximum relevancy maximum complementary feature selectionfor multi-sensor activity recognition

http://dx.doi.org/10.1016/j.eswa.2014.07.0520957-4174/� 2014 Elsevier Ltd. All rights reserved.

⇑ Corresponding author.

Saisakul Chernbumroong a, Shuang Cang b, Hongnian Yu a,⇑a Faculty of Science and Technology, Bournemouth University, Poole, Dorset BH12 5BB, UKb School of Tourism, Bournemouth University, Poole, Dorset BH12 5BB, UK

a r t i c l e i n f o

Article history:Available online 23 August 2014

Keywords:Feature selectionNeural networksMutual informationActivity recognition

a b s t r a c t

In the multi-sensor activity recognition domain, the input space is often large and contains irrelevant andoverlapped features. It is important to perform feature selection in order to select the smallest number offeatures which can describe the outputs. This paper proposes a new feature selection algorithms usingthe maximal relevance and maximal complementary (MRMC) based on neural networks. Unlike otherfeature selection algorithms that are based on relevance and redundancy measurements, the idea ofhow a feature complements to the already selected features is utilized. The proposed algorithm is eval-uated on two well-defined problems and five real world data sets. The data sets cover different types ofdata i.e. real, integer and category and sizes i.e. small to large set of features. The experimental resultsshow that the MRMC can select a smaller number of features while achieving good results. The proposedalgorithm can be applied to any type of data, and demonstrate great potential for the data set with a largenumber of features.

� 2014 Elsevier Ltd. All rights reserved.

1. Introduction

The aim of feature selection is to identify the smallest subset ofinput features which explains the output classes. This process isimportant especially to the classification problems with a largenumber of input features. For example, a multi-sensor activity clas-sification system normally contains a large number of input featuresgenerated from different sensors. Feature selection can help reducethe size of feature space which leads to reduction in computationalcost and complexity in the classification system. In real world prob-lems where input features contain irrelevant and redundant features,feature selection can help identify a relevance feature set whichleads to improvement in classification performances.

There are three main approaches in feature selection found inwearable sensor-based activity recognition applications: intuition,filter, and wrapper. Intuition based feature selection requires adomain knowledge or understanding which is required in the clas-sification of the interested activities. This approach is often used inconjunction with visual inspection, statistical analysis of thefeatures e.g. histogram, distribution graph, or observation madeduring activity occurrence (Parkka et al., 2006 & Ward et al.,Ward, Lukowicz, Troster, & Starner, 2006). Filter based-featureselection measures the relevance between features and the outputs

by using techniques such as information theory, distance, correla-tion, receive operating curve (ROC), etc. Each feature is evaluatedfor its relevance then given a ranking score. For example, featureswhich have the best performance in discriminating the interestedactivities were selected using ROC (Banos, Damas, Pomares,Prieto, & Rojas, 2012; Ermes, Parkka, Mantyjarvi, & Korhonen,2008). Many of the statistical tests are used with this approache.g. chi-square, T-test, etc. The study in Banos et al. (2012) foundthat features selected from the ranking quality group techniquebased on discrimination and robustness, ROC, T-test or the Wilco-xon with support vector machine produced remarkable results.Mutual information (MI) is another popular measurement usedfor measuring the relationship between two variables. Featureselection techniques which use MI are such as maximum relevanceminimum redundancy (Peng, Long, & Ding, 2005), normalizedmutual information feature selection-feature space (Cang & Yu,2012), feature selection based on cumulate conditional mutualinformation (Zhang & Zhang, 2012), etc. Some techniques are basedon neural networks to rank the features e.g. neural network featureselection (Setiono & Liu, 1997), Clamping technique (Wang, Jones,& Partridge, 2000), constructive approach for feature selection(Kabir, Islam, & Murase, 2010), etc. The main advantages of the fil-ter approach are due to its simplicity, speed and independence ofthe classification algorithm (Saeys, Inza, & Larraaga, 2007). How-ever, most of the techniques in this approach usually consider

Page 2: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

574 S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583

two variables i.e. a feature and class output, thus ignoring depen-dencies among a set of features. This may lead to a selection ofredundant features resulting in low classification accuracy. In sometechniques such as MRMR (Peng et al., 2005) and NMIFS (Estevez,Tesmer, Perez, & Zurada, 2009), another criteria i.e. redundancy isused to reduce the chance of selecting redundant features.

Wrapper based-feature selection is the most popular techniquein wearable sensor-based activity recognition. In this technique,various set of feature subsets are generated and evaluated usinga classification algorithm. The most optimum feature subset isselected using search techniques. Examples of this approach areforward selection (Dalton & Olaighin, 2013; Peng, Ferguson,Rafferty, & Kelly, 2011; Zhang & Sawchuk, 2013), backward selec-tion, forward–backward selection (Khan, Lee, Lee, & Kim, 2010),exhaustive search (Varkey, Pompili, & Walls, 2012), etc. In forwardselection, one feature is added into a feature subset each time andthe subset is evaluated for its performance. On the other hand,backward selection removes one feature from the feature subseteach time and evaluates the subset performance. Forward–back-ward selection employs both directions where forward selectionis carried out first then the subset is refined using backward selec-tion. This approach is computationally more extensive than the fil-ter method, however it can provide a better result as it takes intoaccount the features dependency and interaction with the classifi-cation algorithm. Some studies combine both filter and wrappermethods. For example, the study in Hsu, Hsieh, and Lu (2011) com-bined the features filtered by information gain and F-score, thenused the wrapper method to improve classification accuracy.

The most feature selection methods in the current literature arebased on two criteria i.e. relevancy – how the feature is relevant tooutputs, and redundancy – to reduce the chance of selectingredundant features. However, feature selection using these twocriteria does not consider how a feature will complement thealready selected features. This may result in selecting a largernumber of feature than actually required. Also, in some featureselection techniques which only consider the relevancy criteria,redundant features may be selected. For techniques which usethe wrapper approach, considering all possible feature subsets suf-fers a high computational cost.

Considering the above limitations, we propose a new featureselection algorithm with a new criterion i.e. complementary – howa feature complements the already selected features. In addition,based on our knowledge, this criterion has not yet been consideredin any other feature selection algorithm. The Clamping technique isemployed to measure the feature relevance. We introduce a newmeasurement to calculate the complementary value of the featureto the already selected feature set. The feature is selected based onthe criteria of maximum relevance and maximum complementary.The main difference between the proposed technique and the otheralgorithms are that the complementary measurement is usedinstead of the redundancy measurement. Feature redundancy canbe detected through the complementary measurement such thatthe redundant feature should give a low complementary score.

The paper is organized as follows: Section 2 presents some pop-ular feature selection algorithms which are used for comparison inthis study. Section 3 presents the proposed feature selection tech-nique in detail. We evaluate our algorithm using two well-definedproblems and four benchmark data sets and one multi-sensoractivity recognition data set collected from a real home. The exper-imental results are presented in Section 4. Finally, the discussionand conclusion are presented in Section 5.

2. Related works

Many techniques have been proposed for feature selections asdiscussed in the previous section. In this paper, we look at two

different approaches used for feature ranking i.e. mutual informa-tion (MI) and neural networks (NN).

2.1. Mutual information based feature selection

MI, which is based on information theory (Shannon, 2001),measures the dependency between two variables. The MI value iszero if and only if the variables are independent. Given continuousvariables f i and f j, the MI is:

MIðf i; f jÞ ¼ZZ

pðf i; f jÞ logpðf i; f jÞ

pðf iÞpðf jÞdf idf j

In practice, it is difficult to calculate MI of the continuous valuesand often the variables are discretized using bins. The MI of dis-crete variables is:

MIðf i; f jÞ ¼X

c

Xj

pðf i; f jÞ logpðf i; f jÞ

pðf iÞpðf jÞ

There are many feature ranking algorithms based on the MI(Cang & Yu, 2012; Estevez et al., 2009; Peng et al., 2005; Zhang &Zhang, 2012). The maximal relevant minimal redundant (MRMR)is one of the most popular feature selection algorithms. Many algo-rithms have been based on MRMR. For example, the normalizedmutual information feature selection (NMIFS) which enhanceMRMR by using entropy of the variables to normalize the MI valueswhen calculating the redundancy between variables. MRMR isenhanced by using the kernel canonical correlation analysis asinputs rather than the actual features (Sakar, Kursun, & Gurgen,2012).

In this study we investigate the commonly used feature selec-tion algorithms based on MI which are MRMR and NMIFSalgorithms.

2.1.1. MRMRThe MRMR algorithm (Peng et al., 2005) ranks the features

based on the minimal redundancy and maximal relevancecriterion. It calculates the MI between two features to measurethe redundancy and the MI between a feature and the outputs tomeasure the relevance. Using the MRMR concept and greedy selec-tion, a set of feature rankings S can be obtained as follow:

(A) Given S ¼ fg where S is a set of selected features andF ¼ ff 1; f 2; f i; f j . . . ; f Ng where F is a set of N features, selectthe feature f s in F which has the maximum mutual informa-tion between itself and output C where C ¼ fc1; c3; . . . ; cKgand f s ¼maxf i2FMIðf i; CÞ, and update S and F.

S ¼ S [ ff sg ð1ÞF ¼ F n ff sg ð2Þ

Select feature f s in F which satisfies the following condition:

(B)

f s ¼maxf i2F

MIðf i; CÞ � 1jsjXf j2S

MIðf i; f jÞ

8<:

9=;

Update S and F using (1) and (2).Repeat Step (B) until the desired number of features isobtained.

2.1.2. NMIFSThe NMIFS algorithm (Estevez et al., 2009) is an enhancement of

the MRMR algorithm. A normalized MI (NMI) between twofeatures are used instead:

NMIði; jÞ ¼ MIði; jÞminfHðiÞ;HðjÞg

Page 3: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583 575

where Hð Þ is the entropy function. Similar steps as MRMR are car-ried out, however the condition in Step (B) is changed to:

f s ¼maxf i2F

MIðf i; CÞ � 1jsjXf j2S

NMIðf i; f jÞ

8<:

9=;

Baseline network

2.2. Neural network based feature selection

Some studies have proposed to use NN for feature selection(Kabir et al., 2010; Setiono & Liu, 1997; Wang et al., 2000). Forexample, the neural network feature selector (NNFS) (Setiono &Liu, 1997) selects features based on weights associated with thatfeatures. The weights associated with unimportant features wouldhave values close to zero. NNFS adds a penalty term to the cross-entropy error function in order to distinguish redundant networkconnection. The technique proposed in Verikas and Bacauskiene(2002) trains the network by minimizing the cross-entropy errorfunction augmented with additional terms to constraint the deriva-tives. The features are selected based on the reaction of the valida-tion classification error as a result of the removal of individualfeatures. The algorithm is tested on three real-world problemsand the results indicate that it outperforms the techniques suchas NNFS, the fuzzy entropy, the discriminant analysis, the neuralnetwork output sensitivity based feature saliency measure, andthe weights-based feature saliency measure. The Clamping tech-nique proposed in Wang et al. (2000) is designed to determine theimportance of the feature by observing the network performancewhen each feature is clamped. In this study we compare the perfor-mance of the proposed algorithm with the Clamping algorithm.

The idea of the Clamping algorithm is that if the feature isimportant by Clamping the feature to a certain value i.e. meanvalue, the generalized classification accuracy will be affected.Firstly, train the network using all features on training data setand obtain the generalized performance (gðFÞ) by calculating theclassification accuracy on test data set. The generalized perfor-mance of clamped feature gðFjf i ¼ �f iÞ is obtained by calculatingthe classification accuracy of test data set where feature f i isclamped to its mean value. The importance of feature f i is calcu-lated as:

Imðf iÞ ¼ 1� gðFjf i ¼ �f iÞgðFÞ ð3Þ

The features are then ranked based on their importance indescending order.

The Clamping technique provides robust ranking even in noisydata. However, it only considers the relationship between one fea-ture and the classes. It does not consider any relationship betweenthe features. MRMR and NMIFS do consider the relationshipbetween features. However, the relationship between only twofeatures are measured. None of these three techniques considershow a feature would complement to the already selected features.In this paper, we propose a new feature selection technique whichconsider the relationship between features and the class as well asthe relationship among a group of features. The details of the pro-posed algorithm are presented in the next section.

fi

fN

...

b1

...

wNj

ci

cK

c1

fi=fi

fN

...

b1

...

wij

wNj

ci

cK

c1wij

Relevancy network

Fig. 1. The architecture of the network with all features and the relevancy network.

3. Methodology

The proposed feature selection method is based on the criteriaof maximum relevance and maximum complementary (MRMC)of the feature. In our method, NN is employed for the calculationof the relevance and complementary score. NN is based on the con-cept of connectionism where several input nodes are connectedwith associated weights to several outputs nodes. We use a

network with one hidden layer which is called Multi-layer percep-tron (MLP). Given input of N features F ¼ ff 1; f 2; . . . ; f i; . . . ; f Ng, pre-dict output of K classes C ¼ fc1; c2; . . . ; cKg. Fig. 1 depicts the neuralnetwork architecture where b1 is the bias and weightsW ¼ fw11;w12; . . . ;wNjg where w11 represents a weight connectfrom f 1 to hidden node 1. The weights and bias are generated ran-domly from a univariate distribution. The network output node yi

can be calculated from the summation function (Bishop, 1995):

yi ¼ gXN

i¼1

WT f i þ b1

!

where gðzÞ is a sigmoid activation function. In this study, a logisticfunction gðzÞ ¼ 1

1þe�z is used. The network tries to minimize the fol-lowing cost function:

JðWÞ ¼ � 1N

XN

i¼1

XK

k¼1

yðkÞi log ðyiÞðkÞ þ 1� yðkÞi

� �log ð1� yiÞ

ðkÞ

( )

Two measurements, the relevancy score and complementaryscore are introduced below for calculating the feature’s score.

3.1. Relevancy score

The relevancy score is designed to show how important eachfeature is to the overall network. By removing the feature nodein the network then calculating the network’s performance, therelevancy of the feature can be obtained such that if the clampedfeature is important, the network performance will be significantlyaffected. First, the base network is constructed using all the fea-tures F and its performance is used as the base line. Next, the fea-ture f i is removed from the network. In order to remove the featurewithout disrupting the whole network, a static value is used. In thisstudy, a mean value of the feature is used (f i ¼ �f i). This network isreferred as the relevancy network. After the feature is removed, thenetwork performance is re-calculated and evaluated with the baseline performance. Fig. 1 shows the architecture and concept of thebase line network and the network with removed feature.

Given a set of feature F, the relevance of the feature Relf iis:

Relf i¼ 1� P0ðFjf i ¼ �f iÞ

PðFÞ ð4Þ

where P is the generalized performance of the neural network usingfeature set F and P0 is the generalized performance of the neural net-work using feature set F where feature f i values are substituted bymean value of f i. Note that the values of P and P0 are between 0 and1. The higher score of relevancy means the feature is more impor-tant. The score reflects the efficacy of the feature, should it beremoved from the network. For example, Reli = 0.7 means that theabsent of the feature f i will lower the network’s performance by70%.

The relevance measurement only considers the relationshipbetween a single input and the class. It does not consider therelationship between features i.e. redundancy and complementary.We enhance the Clamping method by introducing another

Page 4: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

576 S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583

measurement to measure complementary of the features to thealready selected feature set. Also, unlike other techniques whichconsider redundancy measurement, MRMC considers feature com-plementary which to the best of our knowledge has not been usedin other feature selection algorithms before.

3.2. Complementary score

The complementary score measures how much the feature com-plements the already selected features set. It also takes featureredundancy into account such that if the feature is redundant tothe already selected features, the score should be low as it doesnot bring additional information to the classification. Firstly, the baseline performance is obtained by constructing a network using allselected features S and calculating its performance. Next, a new fea-ture f i is added to the network. This network is referred as the com-plementary network. The architecture and concept of the base linenetwork and the network with the new feature is shown in Fig. 2.

From Fig. 2, it can be seen that the weights for the feature f i

needs to be obtained as they are not existed in the base line net-work. In our algorithm, we modify the construction of the comple-mentary network such that it partly uses the weights and biasesfrom the base line network. We assume that the baseline networkhas already identified the correct weights for the already selectedfeatures. Thus, using the same weights and bias would help thenetwork converges faster. This also reduces the possibility of thecomplementary network obtaining poor performance resultingfrom random initial weights. As the input and hidden nodes ofthe baseline network and the complementary network are differ-ent, the number of weights and biases are also different. The otherweights and biases that are missing are generated randomly usingthe standard normal distribution with mean 0 and variance 1scaled by the number of input nodes for bias and weights in thefirst layer and the number of hidden nodes in the second layer.

Given an already selected feature set S, the complementary offeature f i to S can be calculated as:

Comf i¼ PðS [ f iÞ

PðSÞ � 1 ð5Þ

where PðS [ f iÞ is the generalized performance of the complemen-tary network and PðSÞ is the generalized performance of thebaseline network. The values of P are between 0 and 1. The comple-mentary score reflects how much the new feature f i contributes tothe base line network. For example Comf i

¼ 0:1 means by addingfeature f i, the performance of the network is improved by 10%.

3.3. Maximum relevance and maximum complementary score

The proposed algorithm ranks features based on the maximumrelevance and maximum complementary score. After the relevancy

si

sn

b1

...

wij

ci

cK

c1

si

fi

b1

...

wij

ci

cK

c1

...

wnj sn

... w nj

?

Baseline network Complementary network

Fig. 2. The architecture of the network with selected features and the comple-mentary network.

and complementary scores are obtained, the relevance–comple-mentary score (RC) can be calculated as:

RCf i¼ Relf i

þ Comf ið6Þ

The feature is then selected based on the maximum RC score.From the algorithm, it can be seen that the complementarymeasurement can reduce the chance of selecting overlapping orredundant features. For example, given three features f 1; f 2; f 3

where f 1 ¼ f 3 to represent overlapped feature and their relevancescores are expressed as f 1 ¼ f 3 > f 2, if the Clamping technique isused, the feature ranking will be f 1; f 3; f 2. However, by combingthe complementary with relevancy, the ranking will be f 1; f 2; f 3.As the complementary score of f 3 should be zero, the RC scorefor f 2 will then be higher than f 3.

The steps of the MRMC algorithm are summarized in Fig. 3which are explained in detail below:

Step 1: Normalize features value to [01] range. This Step makessure that features with larger values do not overwhelmfeatures with smaller values. Set S ¼ fg and F containsall features.

Step 2: Calculate the relevance score of all features f i in F using(4). Note that the network is constructed using trainingdata, then the generalized performance is calculated usingvalidation data.

Step 3: Select the first feature which has the maximum relevancescore f s ¼maxf i2FRelðf iÞ.

Step 4: Update S and F using Eqs. (1) and (2).Step 5: Check if the size of feature set F is greater than 1. If Yes, go

to Step 6. Otherwise, update S using S ¼ S [ F and termi-nate the algorithm.

Step 6: Calculate the complementary score for all features f i in Fusing (5).

Step 7: Calculate the RC score using (6).Step 8: Select feature f s which has the maximum RC score

f s ¼ fmaxf i2FRCðf iÞg. Go to Step 4.

4. Experimental results

This section presents evaluation results of MRMC against otherfeature selection methods as presented in Section 2. The experi-ments are carried out using two well-defined problems studiedin Cang and Yu (2012) and four benchmark classification data setsincluding Iris, breast cancer, cardiotocography, and chess whichare obtained from UCI Machine Learning Repository (Bache et al.,2013) available at http://archive.ics.uci.edu/ml. The proposed algo-rithm is also evaluated using a real world data set which we havecollected from multiple sensors used for predicting humanactivities.

The input features are discretized using bin 10 for calculatingthe MI of MRMR and NMIFS. For Clamping and MRMC, the numberof hidden nodes is set to 2� number of input nodes and the num-ber of epochs is 300 whether or not the network converges. Allexperiments except the first and second experiments are carriedout using 5-fold cross-validation where 3 folds are used fortraining, 1-fold for validation and 1-fold for testing. In this study,a balanced sampling is used where an equal number of positiveand negative classes are randomly selected using a uniform distri-bution. The sizes of the training, validation and testing data of eachfold used for different data set are shown in Table 1.

For the real world problems, the feature selection methods areevaluated using NN. The number of hidden nodes is set to 2� num-ber of inputs and the number of epochs is set to 300. For each sizeof input, 10 models are constructed and the best one is selectedusing validation data. The test data is then applied to obtain the

Page 5: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

Fig. 3. Flow chart of the MRMC feature selection algorithm.

S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583 577

classification results. The validation data is also used to determinethe size of features. The number of features is selected at the pointwhere there is no significant improvement when more features are

added. The data normality is tested and the appropriate test e.g.paired T-test or the Wilcoxon Signed Ranks test is applied. In addi-tion, the four algorithms are compared using statistical tests at 95%

Page 6: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

578 S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583

confidence interval. First, the data are tested for their normalityusing the Shapiro–Wilk test. If the data is normal distribution, thenthe ANOVA is used, otherwise a Friedman test is used. For theANOVA test, if the Mauchly result is significant, then the Green-house-Geisser test is reported.

4.1. Experiment 1: nonlinear AND problem

In this experiment, a well-defined problem which the correctfeatures are known is studied. We use a nonlinear AND problempreviously studied in Estevez et al. (2009) and Cang and Yu(2012). There are 14 features in this problem. The first five featuresf 1 to f 5 are generated randomly from an exponential distributionwith mean 10. These features represent irrelevant features. Thenext six features f 6 to f 11 are relevant features generated randomlyfrom a uniform distribution range [�11]. The next three featuresf 12 to f 14 are redundant features (fully overlapped features) wheref 12 ¼ f 9; f 13f 10; f 14 ¼ f 11. The class label is determined by:

f ðxÞ ¼C1 If f 6 � f 7 � f 8 > 0 AND f 9 þ f 10 þ f 11 > 0C2 If f 6 � f 7 � f 8 < 0 AND f 9 þ f 10 þ f 11 < 0

�ð7Þ

According to this problem, the optimal feature set isff 6; f 7; f 8; ½f 9 or f 12�; ½f 10 or f 13�; ½f 11 or f 14�g. The set of 500 data sam-ples is generated randomly from a uniform distribution. The classlabel for each data sample is determined using Eq. (7). Featureselection algorithms described in Sections 2 and 3 are applied onthe data set. For Clamping and MRMC which require the validationdata set, the 500 training data set is used. Table 2 presents theranking results using these algorithms.

From Table 2, it can be seen that only NMIFS and MRMC canidentify the correct set of features. The first important featureranked by MRMR and NMIFS is f 11 and by Clamping and MRMCis f 8. This is expected as MRMR and NMIFS selects the first featureusing maximum MI. Similarly, Clamping and MRMR use the samemeasurement to select the first feature. MRMR cannot detect theirrelevant feature where it ranks f 1 as the third important feature.Clamping correctly select the first five features, however it fails todetect that f 12 is the redundancy of f 9 and f 14 is the redundancy off 11. According to this result, it can be seen that NMIFS gives theemphasis on detecting redundancy where it puts redundant fea-tures f 12; f 13; f 14 at the end of the rank. On the contrary, MRMCgives emphasis on complementary where all irrelevant featuresare put at the end.

4.2. Experiment 2: a nonlinear AND problem with partly overlappedfeatures

This experiment aims to show the superior ability of MRMCover the other three algorithms where it can select the correct fea-tures set from the data set which contains irrelevant, completeoverlapped and partly overlapped features.

Table 1Characteristics and data partition of different data sets used in the study.

Data set # Features # Classes

Nonlinear AND 14 2Nonlinear AND with partly overlapped features 17 2Iris 4 3Cancer-1992 9 2Cancer-1995 30 2Cardiotocography-fetal 21 3Cardiotocography-morp 21 10Chess 36 2Multi-sensor activity recognition 141 12

We use the same data set as generated in experiment 1. How-ever, we introduce another three features f 15 to f 17 which repre-sent partly overlapped features. Feature f 15 is set to f 15 ¼ f 6 � f 7

which overlaps the feature f 6 and f 7. Feature f 16 is set tof 16 ¼ f 9 þ f 10 which overlaps the feature f 9 and f 10. Feature f 17 isset to f 17 ¼ f 8 � f 11 which overlaps the feature f 8 and f 11 but hasno relationship to the classes. From this example, it can be seenthat the relevant features are f 6 to f 16. f 15 is the overlap of featuref 6 and f 7. However, it is better to select f 15 and treat f 6 and f 7 asredundant as f 15 contains information from f 6 and f 7, thereforeby selecting f 15 the feature space would be smaller. The same rea-son also applies for selecting f 16 over f 9 and f 10. The optimal subsetof this data set is ff 8; ½f 11 or f 14�; f 15; f 16g.

The result from Table 3 shows that only MRMC can produce thecorrect feature set. Only two features (f 16; f 11) are selected cor-rectly by MRMR. The next five features selected by MRMR are irrel-evant features. Clamping can select the first three features(f 15; f 8; f 14) correctly. However, the fourth feature (f 11) is theredundant of the third feature (f 14). This is because Clamping can-not detect overlap or redundant features. NMIFS can identify thefirst two features correctly. However, it selects f 6 and f 7 insteadof f 15 which makes the feature set larger. It also fails to detect thatf 9 is the redundant feature of f 16.

4.3. Experiment 3: Iris data set

This data set widely used in classification literatures (Dy,Brodley, & Wrobel, 2004; Zhong & Fukushima, 2007) contains threeclasses of the type of Iris plant: Setosa, Versicolor, and Verginica.There are 50 samples each class. One class is linearly separablefrom the others. Two classes are not linearly separable. This dataset has four features including sepal length (cm), sepal width(cm), petal length (cm), and petal width (cm). The four featureselection algorithms are applied on the data set and the mean clas-sification accuracy of the test set is presented in Fig. 4.

From Fig. 4, all algorithms select the first feature correctly.MRMC does not correctly select the second feature in all foldsand MRMR does not correctly select the third feature, thereforethis slightly affects classification accuracy. The size of features, testclassification accuracy and standard deviation are shown in Table 4.The results show that there is no statistical significance in classifi-cation accuracy among different feature selection algorithms(p = 1.00).

4.4. Experiment 4: Wisconsin diagnostic breast cancer data set

This data set is used extensively in previous works (Antos, Kégl,Linder, & Lugosi, 2003; Kabir et al., 2010). The breast cancer dataset was obtained from the University of Wisconsin Hospitals, Mad-ison (Bennett & Mangasarian, 1992). This data set was collected in1992 and we shall refer this data set Cancer-1992. It contains 9integer-valued features such as clump thickness, uniformity of cell

Data type # Sample # Training # Validation # Testing

Real 500 500 – –Real 500 500 – –Real 150 90 30 30Integer 699 288 96 96Real 569 252 84 84Real 2126 315 105 105Real 2126 300 100 100Categorial 3196 1830 610 610Real 39328 15120 5040 5040

Page 7: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

Table 2Feature rankings of four feature selection methods (nonlinear AND).

Algorithm Feature rankings

MRMR f 11 f 9 f 1 f 2 f 4 f 3 f 10 f 5 f 6 f 8 f 7 f 14 f 12 f 13

NMIFS f 11 f 9 f 10 f 6 f 8 f 7 f 3 f 4 f 2 f 1 f 5 f 14 f 12 f 13

Clamping f 8 f 7 f 6 f 9 f 11 f 12 f 14 f 10 f 13 f 4 f 5 f 1 f 2 f 3

MRMC f 8 f 7 f 6 f 9 f 11 f 10 f 14 f 12 f 13 f 4 f 5 f 1 f 3 f 2

Table 3Feature rankings of four feature selection methods (modified nonlinear AND).

Algorithm Feature rankings

MRMR f 16 f 11 f 1 f 2 f 4 f 5 f 3 f 15 f 9 f 8 f 10 f 6 f 7 f 17 f 14 f 12 f 13

NMIFS f 16 f 11 f 6 f 9 f 8 f 7 f 10 f 3 f 4 f 2 f 1 f 14 f 5 f 15 f 12 f 17 f 13

Clamping f 15 f 8 f 14 f 11 f 9 f 12 f 4 f 10 f 13 f 6 f 16 f 1 f 17 f 3 f 5 f 7 f 2

MRMC f 8 f 15 f 16 f 11 f 14 f 6 f 10 f 4 f 12 f 9 f 13 f 1 f 7 f 3 f 17 f 5 f 2

1 2 3 4

95

95.5

96

96.5

97

97.5

98

98.5

Size of input features

Mea

n cl

assi

ficat

ion

accu

racy

(%

)

Test accuracy

MRMRNMIFSClampingMRMC

Fig. 4. Mean classification accuracy of test data on four FS algorithms (Iris data set).

Table 4Feature sets selected by four feature selection algorithms and mean test accuracy (Irisdata set).

Algorithm No. of features Mean test accuracy (%) Standard deviation

MRMR 1 95.333333 1.8257419NMIFS 1 95.333333 1.8257419Clamping 1 95.333333 2.9814240MRMC 1 95.333333 2.9814240

1 2 3 4 5 6 7 8 9

90

91

92

93

94

95

96

97

Size of input features

Mea

n cl

assi

ficat

ion

accu

racy

(%

)

Test accuracy

MRMRNMIFSClampingMRMC

Fig. 5. Mean classification accuracy of test data by four FS algorithms (breast cancerdata 1992 set).

S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583 579

size, uniformity of cell shape, bland chromatin, etc. The values foreach feature range between 1 and 10. There are 699 samples with65.5% benign and 34.5% malignant cases. There are 16 sampleswith missing attribute values. In this study, 0 is used to replaceany missing values. The mean classification accuracy on test datais shown in Fig. 5.

From Fig. 5, the accuracy of Clamping and MRMC starts lowerthan MRMR and NMIFS. MRMR, NMIFS and MRMC reach similaraccuracy when 3 features are used. Clamping reaches the highestaccuracy when 5 features are used. The accuracy of MRMR andMRMC remain steady after 3 features. The number of features usedfor each algorithm is shown in Table 5.

Based on the mean test accuracy, the algorithms’ performancescan be expressed as Clamping < NMIFS < MRMC < MRMR. The sta-tistical test indicate that there is no significant difference betweenthe four algorithms (Chi Square(3) = 1.826, p = 0.609). When welook at the number of features used in each algorithm, it can be

seen that MRMC uses the smallest number of features. Hence,MRMC is the most optimum algorithm for this data set.

We also evaluate the proposed algorithm on another breastcancer data set which was collected in 1995. It is composed of 30real-valued input features computed from a digitalized image ofcell nucleus such as radius, texture, smoothness, mean, standarderror, etc. to determine whether the cell is malignant or benign.The data set contains 357 benign and 212 malignant samples.The mean classification accuracy of the test data set for all fouralgorithms are shown in Fig. 4.

From Fig. 6, the first feature selected by Clamping and MRMChas lower accuracy then the feature selected by MRMR and NMIFS.However, using two selected features by MRMC, the accuracy issignificantly improved. MRMR and NMIFS provide similar perfor-mances on this data set. The number of features and performancesof the four algorithms are shown in Table 6.

The test accuracy for each algorithm is shown in Table 6. Basedon the test accuracy, the algorithms’ performances can beexpressed as Clamping < NMIFS < MRMC < MRMR. The normalitytest shows that the data have normal distribution. The results indi-cate that there is no statistical significant between each algorithm(Fð1:322;5:288Þ ¼ 2:273; p ¼ 0:192). From Table 6, it can be seenthat NMIFS uses the lowest number of features. Therefore, it canbe concluded that NMIFS is the optimum method on this data set.

4.5. Experiment 5: cardiotocography data set

This data set used in Ayres-de campos, Bernardes, Garrido,Marques-de sá, and Pereira-leite (2000) contains the measurement

Page 8: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

Table 5Feature sets selected by four FS algorithms and mean test accuracy (breast cancer1992 data set).

Algorithm # Selectedfeatures

Mean test accuracy(%)

Standarddeviation

MRMR 3 96.666667 2.8905077NMIFS 4 95.625000 2.0036858Clamping 8 95.625000 1.1410887MRMC 2 95.833333 1.6470196

1 4 7 10 13 16 19 22 25 28 3075

80

85

90

95

Size of input features

Mea

n cl

assi

ficat

ion

accu

racy

(%

)

Test accuracy

MRMRNMIFSClampingMRMC

Fig. 6. Mean classification accuracy of test data by four FS algorithms (breast cancer1995 data set).

1 4 7 10 13 16 19 21

84

85

86

87

88

89

90

91

Size of input features

Mea

n cl

assi

ficat

ion

accu

racy

(%

)

Test accuracy

MRMRNMIFSClampingMRMC

Fig. 7. Mean classification accuracy of test data by four FS algorithms (cardioto-cography-fetal data set).

1 4 7 10 13 16 19 21

30

35

40

45

50

55

60

Size of input features

Mea

n cl

assi

ficat

ion

accu

racy

(%

)

Test accuracy

MRMRNMIFSClampingMRMC

Fig. 8. Mean classification accuracy of test data by four FS algorithms (cardioto-cography-morp data set).

580 S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583

of fetal heart rate (FHR) and uterine contraction features e.g. min-imum FHR histogram, percentage of time with abnormal long termvariability, etc. on cardiotocograms classified by expert obstetri-cians. The data set contains 21 input features which are classifiedinto 10 types of morphologic patterns or 3 fetal states. The dataset has the unbalanced class distribution.

The average classification accuracy of 10-class morphologicpatterns and 3-class fetal states are shown in Figs. 7 and 8, respec-tively. From Fig. 7, the classification accuracy of MRMC starts at thelowest but continues improving as the number of featuresincreases. The classification accuracy of MRMR and NMIFS are highwhen using one feature. The classification accuracy of NMIFS fallsto the lowest point when 5 features are used. The performance ofMRMC is better than the other 3 algorithms when 6 to 10 featuresare used. From Fig. 8, all feature selection algorithms produce sim-ilar accuracy trend. The classification accuracy is improved whenmore number of features is used. The performance of MRMC issuperior than the other 3 algorithms when 12 to 17 features areused.

Table 7 shows the number of features selected by eachalgorithm, the mean classification accuracy on test data and thestandard deviation on the cardiotocography data set for classifying3 fetal states. Based on the test classification accuracy, the

Table 6Feature sets selected by four FS algorithms and mean test accuracy (breast cancerdata 1995 set).

Algorithm # Selectedfeatures

Mean test accuracy(%)

Standarddeviation

MRMR 4 95.000000 2.7147034NMIFS 1 85.000000 13.0573044Clamping 2 84.047619 8.7319623MRMC 2 91.904762 2.7147034

performance of each algorithm can be expressed asMRMC < Clamping = NMIFS < MRMR. The results indicate no statis-tical significance between accuracy obtained by four algorithms(Fð1:045;4:18Þ ¼ 6:711; p ¼ 0:058). From Table 7, it can be seenthat MRMC selects the lowest number of features. Therefore, itcan be concluded that MRMC is the optimum method on this dataset.

Table 8 shows the results of four methods on classifying 10morphologic patterns of cardiotocography data set. Based on themean classification accuracy on the test data, the performance ofthe algorithms is expressed as MRMC < NMIFS < Clamp-ing < MRMR. The Shapiro–Wilk is applied to test data normality.The results show that there is no statistical significant in accuracyamong four algorithms (Fð3;12Þ ¼ 0:278; p ¼ 0:840). Among the

Table 7Feature sets selected by four FS algorithms and mean test accuracy (cardiotocogra-phy-fetal data set).

Algorithm # Selectedfeatures

Mean test accuracy(%)

Standarddeviation

MRMR 18 90.666667 1.7036708NMIFS 15 90.476190 2.0203051Clamping 21 90.476190 1.5058465MRMC 4 87.619048 3.5634832

Page 9: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

1 4 7 10 13 16 19 22 25 28 31 34 36

70

75

80

85

90

95

Size of input features

Mea

n cl

assi

ficat

ion

accu

racy

(%

)

Test accuracy

MRMRNMIFSClampingMRMC

Fig. 9. Mean classification accuracy of test data by four FS algorithms (chess dataset).

Table 9Feature sets selected by four FS algorithms and mean test accuracy (chess data set).

Algorithm # Selectedfeatures

Mean test accuracy(%)

Standarddeviation

MRMR 27 96.590164 1.2234825NMIFS 20 96.459016 1.2835134Clamping 14 97.377049 0.9628967MRMC 10 97.540984 0.8439041

60

70

80

90

tion

accu

racy

(%

)

Test accuracy

S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583 581

four algorithms, it can be seen that MRMC uses the lowest numberof features. Therefore, it can be concluded that MRMC is the opti-mum feature selection method for this data set.

4.6. Experiment 6: chess data set

The chess data set used in Kijsirikul, Sinthupinyo, andChongkasemwongse (2001) and Cohen, Cozman, Sebe, Cirelo, andHuang (2004) contains sequences of chess-description for chessend game. The data set consists of 36 categorical-input featuresto classify if the White can win or cannot win. The class distribu-tion is 52% win and 48% cannot win. The equal class distributionis used and the number of training, validation, and testing data isshown in Table 2. The data set uses a string to represent theboard-description e.g. f, l, n, etc. which we convert these intointeger values e.g. f = 1, l = 2, n = 3, etc. The mean classificationaccuracy of the test data set is shown in Fig. 9. The classificationresult of each algorithm using the number of features determinedby validation data is presented in Table 9.

From Fig. 9, the performance of all algorithms increases whenusing more features. When we observe the features selected byeach algorithm, it is found that the first three features selectedare the same. Generally, the performance of Clamping and MRMCis better than that of MRMR and NMIFS in this data set. MRMC per-formance is better than Clamping when 8 to 21 features are used.All algorithms reach similar accuracy when 29 and more featuresare used.

Based on the mean classification accuracy of test data, the algo-rithms’ performance can be expressed as NMIFS < MRMR < Clamp-ing < MRMC. The result reveals that there is no statisticalsignificant difference among the algorithms (p ¼ 0:054). Based onthe number of features used in each algorithm, MRMC uses thelowest number while MRMR uses the highest number of features.Therefore, we can conclude that MRMC is the most optimum algo-rithm for this data set.

4.7. Experiment 7: multi-sensor activity recognition data set

We collect raw sensor data of accelerometer, gyroscope, heartrate monitor, light, temperature, altimeter, and barometer from12 elderly people performing 12 activities of daily livings includingwalking, feeding, exercising, reading, watching TV, washing dishes,sleeping, ironing, feeding, scrubbing, wiping, and brushing teeth.The participants wear the sensors on their wrists and heart ratemonitor on their chests. The data set consists of 141 real-valuedinput features. The classification accuracies of the test data setfor all algorithms are shown in Fig. 10. The performance of thealgorithms is reported in Table 10.

From Fig. 10, MRMC has better accuracy than that of the otheralgorithms when four or more features are used. The large differentaccuracy is noticeable when 10 and 34 features are used. Clampingperformance start lower than other algorithms. However, its accu-racy is similar to MRMR and NMIFS when more than 14 featuresare used. MRMR and NMIFS start at the same accuracy. However,NMIFS performance drops when 3 and 25 features are used.

Table 8Feature sets selected by four FS algorithms and mean test accuracy (cardiotocogra-phy-morp data set).

Algorithm # Selectedfeatures

Mean test accuracy(%)

Standarddeviation

MRMR 21 91.428571 1.5058465NMIFS 16 90.666667 1.2417528Clamping 21 87.428571 4.1184282MRMC 15 83.619048 6.8146834

Based on the mean classification accuracy on test data, thealgorithms’ performance can be expressed as MRMR < Clamp-ing < NMIFS < MRMC. The results indicate that there is nosignificant difference among the four algorithms (Fð1:474;5:895Þ ¼ 1:417; p ¼ 0:301). Based on the number of features usedin each algorithm, it can be seen that MRMC only uses 50 while theother use over 60 features. Therefore, we can conclude that MRMCis the most optimum algorithm for this data set.

5. Discussion and conclusion

The summary of the experiments is presented in Table 11. Theoptimal feature selection algorithms of each data set is based onthe statistical result and the number of features. Based on 8

1 6 11 16 21 26 31 36 41 46 51 56 61 66 70

30

40

50

Size of input features

Mea

n cl

assi

fica

MRMRNMIFSClampingMRMC

Fig. 10. Mean classification accuracy of test data by four FS algorithms (multi-sensor activity recognition data set).

Page 10: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

Table 10Feature sets selected by four FS algorithms and mean test accuracy (multi-sensoractivity recognition data set).

Algorithm # Selectedfeatures

Mean test accuracy(%)

Standarddeviation

MRMR 64 93.313500 0.4858261NMIFS 66 93.662700 0.5337766Clamping 62 93.611120 0.8777444MRMC 50 94.027800 0.6026319

Table 11Optimum FS algorithm on each data set.

Data set Optimum feature selectionalgorithm

Nonlinear AND NMIFS, MRMCNonlinear AND with partly overlapped

featuresMRMC

Iris MRMR, NMIFS, Clamping, MRMCCancer-1992 MRMCCancer-1995 NMIFSCardiotocography-fetal MRMCCardiotocography-morp MRMCChess MRMCMulti-sensor activity recognition MRMC

582 S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583

experiments, MRMC is the optimum feature selection algorithm ingeneral. It is able to obtain high classification result using the min-imum number of features. NMIFS is the second optimumalgorithm.

The results from experiments 1 and 2 show that MRMC iscapable of detecting completely overlapped and partial overlappedfeatures. In other experiments, the result also shows that MRMCcan be used on various data types i.e. categorical, real, and integervalues. The performance of MRMC is not as good as NMIFS in thebreast cancer-1995 data set. This is due to the fact that the first fea-ture selected by MRMC normally has lower classification accuracy.The difference in accuracy between NMIFS and MRMC are about10% when 1 feature is selected. When looking at other data sets,the differences are about 5% or less. This implies that if anotheralgorithm obtains a significantly higher accuracy than MRMC whenusing 1 feature, then that algorithm would be more optimum forthat data set provided the number of input features is small.

The result of cardiotocography data set indicate that MRMC isthe most optimum algorithm among the four algorithms. Itachieves good accuracy while using the smallest number of fea-tures. Experiment 6 demonstrates that MRMC also works well withcategorical data. In experiment 7, we evaluate the proposed algo-rithm with the data set with a large number of inputs. The resultshows that MRMC is much superior in which it uses less featuresthan other algorithms while achieving the highest accuracy. Whencomparing MRMC with Clamping, it can be seen that by introduc-ing a complementary measurement, the performance of thealgorithm is better. For example, in the breast cancer 1995 dataset, using the same number of features, MRMC can obtain higheraccuracy. Overall, the obtained experimental results show thatthe main advantage of MRMC over other state-of-the-art tech-niques is the number of selected features.

From this study, it can be seen that using the Clampingalgorithm to detect the most important feature may not give thecorrect result. This affects the performance of MRMC as it usesthe same criteria to select the first feature. As forward search isused, the performance of the feature selection algorithm dependson the first selected feature. Therefore, in case of the feature selec-tion of a small set of features, using MRMC may not guarantee goodresults. However, when the number of features increases, MRMC is

superior than the other three algorithms. This is due to the fact thatalthough the first feature selected by the Clamping algorithm maynot always be the most important but it is important i.e. the secondor third most important feature, and by using complementarymeasurement, the correct subset of features can later be identified.

In this paper, we propose a new feature selection algorithmbased on maximum relevance maximum complementary basedon neural network. We evaluate the proposed methods onwell-defined problems and real world data sets containing smallto larger set of features (N = 4 to 100+). The study is carried outusing 5-fold cross validation. The algorithms performance is eval-uated empirically using statistical tests at 95% confidence interval.We show that in general MRMC provides a good result comparingto the other three algorithms such that it can select smaller set offeatures. We also demonstrate that the complementary measureintroduced improves the performance of the Clamping algorithm.The main advantages of the proposed technique are the capabilityof selecting smaller set of features and especially works well whenit is used with a large number of features. Nowadays, a large vol-ume of data is available due to the development of sensors andtechnologies. It is essential to be able to select only the importantfeatures in order to reduce the costs such as computation, storage,and battery and to increase performances of any expert and intel-ligent systems. The proposed algorithm can be applied to any typeof data, has most potential when it is used with data which has alarge number of features.

Since we have found that the performance of the algorithm isaffected by the selection of the first feature, future research willbe focusing on the identification of the first feature in order toimprove the MRMC performance. According to the experiment,the MI-based technique is good at obtaining the first feature. Thefuture work will be carried out to integrate the MI-based techniquewith MRMC. Another interesting research direction is to studyfeature selection by considering a group of features rather than asingle feature i.e. relevancy of a group of features. This idea ariseswhen we observe the case when there are more than one featurewith equal importance. In order to correctly identify the feature,the next important feature needs to be considered. Finally, sincewe can see the potential of the complementary concept, it is inter-esting to see how to apply this concept in other feature selectionapproaches.

References

Antos, A., Kégl, B., Linder, T., & Lugosi, G. (2003). Data-dependent margin-basedgeneralization bounds for classification. Journal of Machine Learning Research, 3,73–98. ISSN 1532-4435.

Ayres-de campos, D., Bernardes, J., Garrido, A., Marques-de sá, J., & Pereira-leite, L.(2000). SisPorto 2.0: A program for automated analysis of cardiotocograms.Journal of Maternal-Fetal and Neonatal Medicine, 9(5), 311–318.

Bache, K., Lichman, M. (2013). UCI machine learning repository, <http://archive.ics.uci.edu/ml>.

Banos, O., Damas, M., Pomares, H., Prieto, A., & Rojas, I. (2012). Daily living activityrecognition based on statistical feature quality group selection. Expert Systemswith Applications, 39(9), 8013–8021. ISSN 0957-4174.

Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discriminationof two linearly inseparable sets.

Bishop, C. M. (1995). Neural networks for pattern recognition. New York, NY, USA:Oxford University Press, Inc.

Cang, S., & Yu, H. (2012). Mutual information based input feature selection forclassification problems. Decision Support Systems, 54(1), 691–698. ISSN 0167-9236.

Cohen, I., Cozman, F., Sebe, N., Cirelo, M., & Huang, T. (2004). Semisupervisedlearning of classifiers: Theory, algorithms, and their application to human–computer interaction. IEEE Transactions on Pattern Analysis and MachineIntelligence, 26(12), 1553–1566. ISSN 0162-8828.

Dalton, A., & Olaighin, G. (2013). Comparing supervised learning techniques on thetask of physical activity recognition. IEEE Journal of Biomedical and HealthInformatics, 17(1), 46–52. ISSN 2168-2194.

Dy, J. G., Brodley, C. E., & Wrobel, S. (2004). Feature selection for unsupervisedlearning. Journal of Machine Learning Research, 5, 845–889.

Page 11: Maximum relevancy maximum complementary feature selection for multi-sensor activity recognition

S. Chernbumroong et al. / Expert Systems with Applications 42 (2015) 573–583 583

Ermes, M., Parkka, J., Mantyjarvi, J., & Korhonen, I. (2008). Detection of dailyactivities and sports with wearable sensors in controlled and uncontrolledconditions. IEEE Transactions on Information Technology in Biomedicine, 12(1),20–26. ISSN 1089-7771.

Estevez, P., Tesmer, M., Perez, C., & Zurada, J. (2009). Normalized mutualinformation feature selection. IEEE Transactions on Neural Networks, 20(2),189–201.

Hsu, H.-H., Hsieh, C.-W., & Lu, M.-D. (2011). Hybrid feature selection by combiningfilters and wrappers. Expert Systems with Applications, 38(7), 8144–8150. ISSN0957-4174.

Kabir, M. M., Islam, M. M., & Murase, K. (2010). A new wrapper feature selectionapproach using neural network. Neurocomputing, 73(16-18), 3273–3283. ISSN0925-2312.

Khan, A., Lee, Y.-K., Lee, S., & Kim, T.-S. (2010). A triaxial accelerometer-basedphysical-activity recognition via augmented-signal features and a hierarchicalrecognizer. IEEE Transactions on Information Technology in Biomedicine, 14(5),1166–1172. ISSN 1089-7771.

Kijsirikul, B., Sinthupinyo, S., & Chongkasemwongse, K. (2001). Approximate matchof rules using backpropagation neural networks. Machine Learning, 44(3),273–299. ISSN 0885-6125.

Parkka, J., Ermes, M., Korpipaa, P., Mantyjarvi, J., Peltola, J., & Korhonen, I. (2006).Activity classification using realistic data from wearable sensors. IEEETransactions on Information Technology in Biomedicine, 10(1), 119–128. ISSN1089-7771.

Peng, J.-X., Ferguson, S., Rafferty, K., & Kelly, P. D. (2011). An efficient featureselection method for mobile devices with application to activity recognition.Neurocomputing, 74(17), 3543–3552. ISSN 0925-2312.

Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutualinformation: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence,27, 1226–1238.

Saeys, Y., Inza, I., & Larraaga, P. (2007). A review of feature selection techniques inbioinformatics. Bioinformatics, 23(19), 2507–2517.

Sakar, C. O., Kursun, O., & Gurgen, F. (2012). A feature selection method based onkernel canonical correlation analysis and the minimum redundancy–maximumrelevance filter method. Expert Systems with Applications, 39(3), 3432–3437.ISSN 0957-4174.

Setiono, R., & Liu, H. (1997). Neural-network feature selector. IEEE Transactions onNeural Networks, 8(3), 654–662. ISSN 1045-9227.

Shannon, C. E. (2001). A mathematical theory of communication. SIGMOBILE MobileComputing and Communications Review, 5(1), 3–55. ISSN 1559-1662.

Varkey, J., Pompili, D., & Walls, T. (2012). Human motion recognition using awireless sensor-based wearable system. Personal and Ubiquitous Computing,16(7), 897–910. ISSN 1617-4909.

Verikas, A., & Bacauskiene, M. (2002). Feature selection with neural networks.Pattern Recognition Letters, 23(11), 1323–1335. ISSN 0167-8655.

Wang, W., Jones, P., & Partridge, D. (2000). Assessing the impact of input features ina feedforward neural network. Neural Computing & Applications, 9(2), 101–112.ISSN 0941-0643.

Ward, J., Lukowicz, P., Troster, G., & Starner, T. (2006). Activity recognition ofassembly tasks using body-worn microphones and accelerometers. IEEETransactions on Pattern Analysis and Machine Intelligence, 28(10), 1553–1567.ISSN 0162-8828.

Zhang, M., & Sawchuk, A. (2013). Human daily activity recognition with sparserepresentation using wearable sensors. IEEE Journal of Biomedical and HealthInformatics, 17(3), 553–560. ISSN 2168-2194.

Zhang, Y., & Zhang, Z. (2012). Feature subset selection with cumulate conditionalmutual information minimization. Expert Systems with Applications, 39(5),6078–6088. ISSN 0957-4174.

Zhong, P., & Fukushima, M. (2007). Regularized nonsmooth Newton method formulti-class support vector machines. Optimization Methods and Software, 22(1),225–236.