1100 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING … · ensemble learning, creates multiple classiﬁers from different ... optimal input-pruned neural networks in [34]. In

1100 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 7, JULY 2012

SSC: A Classifier Combination MethodBased on Signal Strength

Haibo He, Senior Member, IEEE, and Yuan Cao, Student Member, IEEE

Abstract— We propose a new classifier combination method,the signal strength-based combining (SSC) approach, to combinethe outputs of multiple classifiers to support the decision-makingprocess in classification tasks. As ensemble learning methods haveattracted growing attention from both academia and industryrecently, it is critical to understand the fundamental issues ofthe combining rule. Motivated by the signal strength concept, ourproposed SSC algorithm can effectively integrate the individualvote from different classifiers in an ensemble learning system.Comparative studies of our method with nine major existingcombining rules, namely, geometric average rule, arithmeticaverage rule, median value rule, majority voting rule, Bordacount, max and min rule, weighted average, and weightedmajority voting rules, is presented. Furthermore, we also discussthe relationship of the proposed method with respect to margin-based classifiers, including the boosting method (AdaBoost.M1and AdaBoost.M2) and support vector machines by marginanalysis. Detailed analyses of margin distribution graphs arepresented to discuss the characteristics of the proposed method.Simulation results for various real-world datasets illustrate theeffectiveness of the proposed method.

Index Terms— Classification, classifier combination, combiningrule, ensemble learning, signal strength.

I. INTRODUCTION

ENSEMBLE learning methods have become an activeresearch topic within the computational intelligence com-

munity. Over the past decade, many theoretical analyses,practical algorithms, and empirical studies have been proposedin this field. Ensemble learning methods also have beenwidely applied in many real-world applications, including Webmining [1], financial engineering [2], geosciences and remotesensing [3], [4], biomedical data analysis [5], [6], decision-making and supporting systems [7]–[9], surveillance [10],homeland security and defense [11], and others.

Generally speaking, ensemble learning approaches havethe advantage of improved accuracy and robustness comparedto the single-hypothesis-based learning methods [12]. In theensemble learning scenario, multiple models/hypotheses are

Manuscript received November 23, 2010; revised April 23, 2012; acceptedApril 28, 2012. Date of publication May 23, 2012; date of current versionJune 8, 2012. This work was supported in part by the Defense AdvancedResearch Projects Agency Mathematics of Sensing, Exploitation, andExecution Program under Grant FA8650-11-1-7148 and the National ScienceFoundation under Grant ECCS 1053717.

H. He is with the Department of Electrical, Computer, and BiomedicalEngineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:[email protected]).

Y. Cao is with MathWorks, Inc., Natick, MA 01760 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2012.2198227

developed, and their decisions are combined by a combinationmethod to predict the testing data. Because differenthypotheses can provide different views of the target function,the combined decision will hopefully provide more robust andaccurate prediction compared to the single-hypothesis-basedlearning methods. Ensemble learning can be applied toany task for which criteria (such as classification/regressionaccuracy, clusterability, among others) can be computed. Inthis paper, we focus on classifier combination.

Classifier combination has been an important research topicin ensemble learning [12]. In fact, no matter what kindsof mechanisms are used to obtain the multiple classifiers,a classifier combination method is needed to combine allthe individual votes for the final decision. Furthermore, oneof the important factors for a successful ensemble learningalgorithm is classifier diversity [13]–[17], which in turn makesthe final combining rule play a critical role in such learningscenarios. In order to obtain such diversified classifiers, variousalgorithms based on the instance space or feature space havebeen proposed.

For instance, bootstrap aggregating (bagging) is anensemble learning method based on the idea of developingmultiple hypotheses by bootstrap sampling (with replacement)of the available training instances [18]. In the bagging method,the probability sampling function is uniformly distributedacross all the training instances. In order to dynamicallyadjust the weights for different data instances according totheir distributions, various boosting algorithms have beendeveloped. For example, adaptive boosting (AdaBoost)algorithms have been proposed to improve the learningaccuracy [19], [20] for weak learning models. The key idea ofthe AdaBoost algorithm is to iteratively update the samplingdistributions of the training instances according to theirdistributions. Data instances that often tend to be misclassified(“difficult” instances) receive higher weights compared tothose instances that often tend to be correctly classified(“easy” instances) [19], [20]. Theoretical error bounds of thefinal hypothesis have been proved [19], [20]. This meansthat a strong learner can be accomplished by an ensemble ofmultiple weak learners that can merely do better than randomguessing [19]–[21]. By integrating the instance selectiontechnique and boosting method, a novel ensemble learningmethod is presented in [22]. This approach uses a boostingmethod to obtain the distribution of weights associated toeach training example, and then uses the instance selectiontechnique to search the subset of the training set by minimizingthe training error to generate the next classifier. For example,this algorithm can be applied to C4.5 decision trees and

2162–237X/$31.00 © 2012 IEEE

HE AND CAO: CLASSIFIER COMBINATION METHOD BASED ON SIGNAL STRENGTH 1101

support vector machines (SVMs) as well as k-nearest neighbor.Empirical study illustrates the effectiveness and efficiencyof the proposed method. In [23], a randomized algorithm ispresented to obtain combinations of weak classifiers that canonly do a little better than random guessing. Simulation resultsand theoretical analyses are provided to illustrate that this algo-rithm can achieve not only good generalization performancebut efficiency in time and space as well. A novel boosting algo-rithm, namely, the dynamically adapted weighted emphasisreal Adaboost algorithm, is proposed in [24]. In this method,a mixed emphasis function that focuses on the samples errorand to their proximity to the classification border is used.

The subspace method, which is another major category forensemble learning, creates multiple classifiers from differentfeature spaces. Some representative work in this area includethe random subspace method, The random forest method, theranked subspace method, and the rotation forest, among others.For instance, the random subspace method creates multipleclassifiers in randomly selected feature spaces [25]. In thisway, different classifiers will build their decision boundariesin different views of the feature space. In the random forestmethod, multiple decision trees are systematically generatedby randomly selecting subsets of feature spaces [26], [27]or subsets of training instances [28]. Recently, the rankedsubspace method [29] was proposed to develop multipleclassifiers by using different sampling probability functions inthe feature space via a feature ranking function. The rotationforest method [30] uses K -axis rotations to form new featuresto train multiple classifiers.

Other representative works in ensemble learning include thestacked generalization and mixture of experts. For instance,stacked generalization aims to learn a meta-classifier basedon the output of the base-level classifiers, normally trainedthrough cross-validation methods [31]. A gating network (usu-ally trained by the expectation-maximization algorithm) can beincorporated into the base-level classifiers based on a weightedcombining rule, in this way a mixture of experts algorithm canbe developed [32], [33].

Some recent developments in ensemble learning are pro-posed in literature. For instance, ELITE, an ensemble learningalgorithm based on a global optimization method TRUST-TECH, is proposed to construct high-quality ensembles ofoptimal input-pruned neural networks in [34]. In this method,TRUST-TECH is used to address two important issues inneural networks, namely, network architecture selection andoptimal weight training. An evolutionary algorithm Bayesianartificial immune system (BAIS) is proposed to learn ensem-bles of neural networks for classification problems in [35]. Inthis algorithm, BAIS is used to generate a pool of high-qualitynetworks with a degree diversity and then further combinethese classifiers. In [36], a layered clustering-based approachis employed to generate ensembles of classifiers. Training dataare clustered first at multiple layers and are used to train a setof base classifiers on the patterns within each cluster. Finally,the classification decision of a testing sample is obtained byfusing the decisions from the corresponding base classifiers ateach layer using majority voting rule. The second-order Walshcoefficients are used in [37] to determine base classifier com-

plexity for optimal ensemble performance for multiple clas-sifier system. In [38], ensemble multilayer perceptron (MLP)weights combined with recursive feature elimination methodare used to eliminate irrelevant features when constructing anensemble of MLP classifiers. In addition, the ensemble out-of-bootstrap estimate is designed to determine when to stopeliminating features. In [39], a generalized multiple kernellearning model is designed by introducing a linear combinationof the L1-norm and the squared L2-norm regularization on thekernel weights to search for the optimal kernel combinationin multiple kernel learning.

In this paper, we focus our attention on how to effectivelycombine the output of each individual classifier to support thefinal classification decision. Our goal is not to compete withthe best classification results reported in literature, instead,the major objective is to propose a new classifier combinationmethod based on signal strength (SSC) for ensemble learning,and illustrate its effectiveness on various datasets. Specifically,we define the signal strength s j and uncertainty degree n j as acriterion related to the posterior class probability. In this way,the signal strength can be calculated as the absolute differencebetween the posterior probability and a given threshold (e.g.,0.5 in this paper), and the uncertainty degree can be calculatedas the difference between the given threshold and the signalstrength (Section III-A and Fig. 1 for details). In this way,we present a new way to analyze the combining rule ofmultiple classifiers in an ensemble learning system. To our bestknowledge, this is the first time this idea is being presented inthe community. We believe that this idea provides new insightsinto this fundamental issue and may motivate future theoreticaland practical research developments in the community.

The rest of this paper is organized as follows. Section IIbriefly reviews the existing combining rules, which providethe foundation for the proposed research. Section III presentsthe proposed classifier combination method and the SSC algo-rithm. In Section IV, simulation results are presented to illus-trate and compare the performance of the proposed methodwith the major existing combining rules over various data sets.Discussions of the margin analysis of the proposed methodwith respect to boosting methods and SVMs are presented inSection V. Finally, conclusions are drawn in Section VI.

II. RELATED WORKS

Since an appropriate choice of the combining rule is afundamental issue for ensemble learning, there have beenmany efforts in the community to investigate this problem.In this paper, we only consider the combination methods thatare based on estimates of posterior probabilities. Consider atraining dataset Dt r with m instances, which can be repre-sented as {xq, yq}, q = 1, . . . , m, where xq is an instance inthe n-dimensional feature space X , and yq ∈ Y = {1, . . . , C}is the class identity label associated with xq . Through atraining procedure, such as bootstrap sampling or subspacemethods, one can develop L classifiers, h j , j = 1, . . . , L.Therefore, for each testing instance xt in the testing datasetDt e, each classifier can vote an estimate of the posteriorprobability across all the possible class labels, Pj (Yi |xt),


j = 1, . . . , L and Yi = 1, . . . , C . Based on Bayesian theory,given the measurements Pj (Yi |xt), where j = 1, . . . , L andYi = 1, . . . , C , the testing instance xt is assigned to Yi

provided that the posterior probability is maximum [12]. Asdiscussed in [12], the Bayesian decision rule illustrates that itis critical to compute the probabilities of various classifierswith the consideration of all measurements simultaneouslyin order to fully utilize all available information to reach aprediction. However, this might not be practical in real patternrecognition tasks due to the computational cost, which leads tothe theoretical framework on various classifier combinations,as discussed in detail in [12]. The objective here is to find acombining rule for an improved estimation of the final poste-rior probability, P(Yi |xt), based on the individual Pj (Yi |xt)from each classifier h j [40].

The most commonly adopted combining rules includegeometric average rule (GA rule), arithmetic average rule(AA rule), median value rule (MV rule), majority votingrule (MajV rule), Borda count rule (BC rule), max andmin rule, weighted average rule (Weighted AA rule) andweighted majority voting rule (Weighted MajV rule) [40]–[42],which are summarized below. For the theoretical frameworkof different combining rules for classifiers, interested readerscan refer to [12] for further details.

A. GA Rule

GA rule finds P(Yi |xt) to minimize the averageKullback–Leibler (KL) divergence among probabilities

Dav = 1

L

L∑

j=1

D j (1)

where

D j =C∑

i=1

P(Yi |xt) lnP(Yi |xt)

Pj (Yi |xt). (2)

Taking Lagrange multipliers and considering∑C

i=1P(Yi |xt) = 1, the optimization of (1) with respect to P(Yi |xt)gives us

P(Yi |xt) = 1

A

L∏

j=1

(Pj (Yi |xt))1L (3)

where A is a class-independent number.Based on (3), GA rule predicts the testing instant xt to the

class identity label that maximizes the product of Pj (Yi |xt).GA Rule:

xt → Yi satisfy maxYi

L∏

j=1

Pj (Yi |xt). (4)

B. AA Rule

Instead of using (2), one can also define the probabilitydistance by an alternative KL divergence as follows:

D j =C∑

i=1

Pj (Yi |xt) lnPj (Yi |xt)

P(Yi |xt). (5)

Substituting (5) into (1), one can get

P(Yi |xt) = 1

L

L∑

j=1

Pj (Yi |xt). (6)

Therefore, the AA rule can be defined as finding themaximal value of the arithmetic average of Pj (Yi |xt).

AA Rule:


1

L

L∑

j=1

Pj (Yi |xt). (7)

C. MV Rule

In the situation of probability outliers of Pj (Yi |xt), the AArule may lead to poor combination performance since theoutliers will dominate the voting procedure. In such a case,the MV rule will predict the final class label with the maximummedian value.

MV Rule:xt → Yi satisfy max

Yi{median(Pj (Yi |xt))}. (8)

D. MajV Rule

In addition to the soft type rules such as GA and AA, MajVrule is a hard type ensemble strategy. Briefly, each individualclassifier directly predicts the class label of the testing sample,and then the MajV rule simply outputs the final predicted labelas the one that receives most of the votes from the individualclassifiers across all classes. When multiple class labels receivethe same number of maximum counts, a random class labelamong them can be selected. Note that the predicted label fromeach individual classifier may be obtained from posterior classprobabilities. For instance, each net in an ensemble of neuralnetworks first outputs the posterior class probabilities, basedon which it predicts the class label. Then the MajV rule countsthe votes from these nets.

MajV Rule:


L∑

j=1

� j (Yi |xt) (9)

where

� j (Yi |xt) ={

1, if h j (xt) = Y j

0, otherwise.

E. Max Rule

Max rule is based on the information provided by the maxi-mal value of Pj (Yi |xt) across all potential class labels. Unlikethe AA rule which is based on the mean value of Pj (Yi |xt),Max rule is more like a winner-take-all style of voting.

Max Rule:xt → Yi satisfy max

Yi{max

j(Pj (Yi |xt))}. (10)

F. Min Rule

Similar to the Max rule, Min rule is based on the idea tovote the final predicted class label based on the maximal of theminimal values of Pj (Yi |xt) across all potential class labels.


Similar to (10), Min rule can be defined as follows.Min Rule:


{min

j(Pj (Yi |xt))

}. (11)

G. BC Rule

The BC rule is based on the ranked order of class labelsprovided by individual Pj (Yi |xt). Based on the classifieroutput, each classifier ranks all the potential class labels. Fora C class problem, the kth ranked candidate receives (C − k)votes for the final voting systems. Finally, the class label thatreceives most of the votes will be the final predicted result.The BC rule can be defined as follows.

BC Rule:


L∑

j=1

� j (Yi |xt) (12)

where � j (Yi |xt) = C − k if classifier h j ranked xt in the kthposition for class label Yi , and C is the number of classes.

H. Weighted Rules

In order to reflect different contributions from differentclassifiers, a weight coefficient can be introduced to eachindividual classifier in several of the aforementioned methods.Here we define two commonly used methods.

Weighted AA Rule:


1

L

L∑

j=1

ω j · Pj (Yi |xt). (13)

Weighted MajV Rule:


L∑

j=1

ω j · � j (Yi |xt) (14)

where ω j is a weight coefficient for classifier h j : ω j ≥ 0 and∑Lj=1 ω j = 1.There are two critical issues in the design of such weighted

voting methods: classifier diversity and combining weights.For instance, intuitively speaking, if all classifiers in anensemble learning system provide the same vote, one can-not get any additional benefit by integrating all decisionsfrom each individual classifier. Therefore, it is important tounderstand how to create diverse classifier ensembles andmeasure such diversity for ensemble learning. For instance,in [13]–[15], a novel method to create diverse ensembles,namely, diverse ensemble creation by oppositional relabelingof artificial training examples (DECORATE) is proposed.Some representative assessment metrics for diversity measure-ment developed in the community include Q-statistics, correla-tion coefficient, disagreement measure, double-fault measure,entropy, Kohavi–Wolpert variance, interrate agreement, mea-sure of difficulty, generalized diversity, and coincident failurediversity [16], [17]. The second important thing is how todecide the combining weights for each individual classifier.While it is difficult to adaptively decide such weights, in manypractical situations, one can use cross-validation techniques todecide such combining weights [40].

Fig. 1. Ensemble learning system with multiple classifiers: h j represents theindividual classifier and ω j is its corresponding associated combining weight,and s j and n j represent the associated signal strength and uncertainty degreefor each classifier h j , respectively.

All of these aforementioned combining methods haveattracted a significant amount of research effort in the com-putational intelligence community. For instance, a mathemat-ical framework for analyzing the substantial improvementsobtained by combining the outputs of multiple classifiers isdiscussed in [43]. In [44], a theoretical study of the classifi-cation error for six combination methods with the assumptionthat each classifier in the ensemble system commits indepen-dent and identically distributed errors is presented. In order torelease the classifier independence assumption, the theoreticalupper and lower bounds of the MajV rule for binary classifica-tion problems are developed in [45]. In [46], the classifier com-bination problem is addressed using a non-Bayesian proba-bilistic framework. Two linear combination rules are proposedto minimize misclassification rates when the learning data-bases satisfy certain conditions. A generalized local weightingvoting method for multi-atlas image segmentation combinationis proposed in [47]. The voting weights in this method are cal-culated based on the local estimation of the segmentation per-formance. Negative correlation learning (NCL) was analyzedin [48] and a regularized negative correlation learning algo-rithm is proposed to solve the overfitting problem of the NCLapproach. In [49], a local classifier weighting method is pro-posed. In this method, the local classifier accuracy is estimatedby solving a convex quadratic optimization problem, and theestimates are used to weight the classifier outputs. In thispaper, we propose a new classifier combination method moti-vated by signal strength concept, namely, the SSC approach,to integrate the output of each individual classifier to supportthe final decision-making process in classification tasks.

III. PROPOSED METHOD

A. Combining Rule Based on Signal Strength

In order to discuss the proposed voting strategy, we con-sider a general ensemble learning scenario in this section.Fig. 1 illustrates an ensemble system with L hypotheses, eachassociated with a signal strength s j as a criterion related tothe posterior probability Pj (Yi |xt). For clear presentation, wealso introduce a related concept, namely, uncertainty degreen j , j = 1, . . . , L. For instance, in a two-class classificationproblem, Pj = 0.5 represents the lowest certainty, meaning


that out of the two classes, each one is equally likely. Onthe other hand, Pj = 0 or Pj = 1 represents full certainty,meaning that the hypothesis is certain about the class identitylabel. In multiclass classification problems, given a class labelYi , the predicted label yt of any testing instance xt can berepresented as a Boolean type: yt = Yi or yt ∈ Y i , whereY i = {Yk, k �= i}. In this way, the multiclass classificationproblem can also be transformed analogous to a two-classproblem. To this end, the signal strength can be represented as|Pj − 0.5|, whereas the uncertainty degree is 0.5 −|Pj − 0.5|.In this way, we can define the aggregate signal strength anduncertainty degree in the ensemble voting system as

s = ω1s1 + ω2s2 + · · · + ωL sL =L∑

k=1

ωksk (15)

n = ω1n1 + ω2n2 + · · · + ωLnL =L∑

k=1

ωknk (16)

where we assume that each hypothesis is associated with acombination weight ω j ≥ 0, and ω j is normalized, so

∑Lj=1

ω j = 1.The signal strength s j and the uncertainty degree n j can be

used to represent the knowledge level of the hypothesis j . Inour algorithm, higher weights are assigned to the classifiersthat have higher signal strengths and lower uncertaintydegrees, i.e., they are more certain on their decisions, whereaslower weights are assigned to those classifiers that have lowersignal strengths and higher uncertainty degrees, i.e., theyare less certain on their decisions. To do so, the weights ω j

should be proportional to the signal strength to uncertaintydegree ratio β j as

ω j ∝ β j = s j

n j(17)

where s j = |Pj − 0.5|, and n j = 0.5 − s j . Then we willdevelop our classifier combining strategy based on (17).

Assume that each classifier ensemble system is associatedwith a decision profile Pd(Yi |xt), which is defined as thevoting probability from each hypothesis h j for each testinginstance xt across all possible class identity labels. For eachhypothesis, the elements in Pd(Yi |xt) can be obtained eitherdirectly from the hypothesis output, or from the confusionmatrix based on the training data or cross-validation method,depending on the different base learning algorithms or imple-mentation details. We include a discussion about how to obtainthe decision profiles in the Appendix A.

Once the decision profile is obtained, we can define thesignal strength s j and uncertainty degree n j for each elementin the decision profile as

s j = |p j − 0.5|, n j = 0.5 − s j (18)

where p j ∈ [0, 1], s j ∈ [0, 0.5] and n j ∈ [0, 0.5].We introduce a new variable s j to reflect the signal strength

as well as its direction

s j = p j − 0.5 (19)

where s j ∈ [−0.5, 0.5].

Therefore, the signal strength to uncertainty degree ratiowith the consideration of its direction can be defined as

β j = s j

n j= s j

0.5 − |s j | (20)

where β j ∈ (−∞,+∞). Note that if the posterior probabilityp j is 0 or 1, which means n j is 0, (17) and (20) will havedivision by zero problems. In order to handle this issue,in algorithm implementation we can assign n j to be animplementation-defined small positive constant value ε if p j

is detected to be 0 or 1 (in our current implementation we setε = 1E−8). We would also like to note that, when p j equals 0or 1, this means this one particular classifier can provide fullcertainty (see Fig. 1) about the prediction results, thereforeit should carry very high weight in the voting method. Onone side, this characteristic has been captured by (20). On theother, it may not be desired for this single classifier tooverdominate the final decision in multiple classifiers system.Therefore, we further introduce a modified logistic functionin (29) in the SSC algorithm to smooth the influence of sucha classifier on the final decision. More detailed discussions canbe found in Section III-B.

According to (15) and (16), we may combine β j fromdifferent hypotheses for each potential class label Yi andcompute the aggregate signal strength to uncertainty degreeratio with the consideration of its direction βout as

βout =

L∑

k=1

ωk sk

L∑

k=1

ωknk

. (21)

Since according to (17), the weights ω j are proportional toβ j = s j/n j , then we can have

βout =

L∑

k=1

βk sk

L∑

k=1

βknk

(22)

where βout ∈ (−∞,+∞).We can rewrite (20) as

sout = βout

2(1 + |βout|)(23)

where sout ∈ [−0.5, 0.5]. Substituting (22) into (23), we canobtain sout. Therefore, the final voting probability P(Yi |xt)will be

pout = sout + 0.5 (24)

where pout ∈ [0, 1]. In this way, the pout provides a finalvoting probability for each potential class label.

B. Proposed SSC Algorithm

Based on the discussion in Section III-A, we present theproposed SSC algorithm as follows.


Algorithm 1 SSC Algorithm1

Input:1) A set of L classifiers h j , j = 1, . . . , L, each trained basedon the training data Dt r by different means, such as bootstrapsampling or subspace method.2) A testing instance xt ∈ Dt e, where Dt e is the entire testingdata set.Procedure:1) Apply the testing instance xt to each classifier h j , andreturn the decision profile Pd(Yi |xt), where Yi = 1, . . . , C isthe potential class identity label.2) Based on each column of the Pd(Yi |xt) in the decisionprofile, calculate the signal strength SYi , the signal strengthwith direction SYi , and uncertainty degree NYi for each classidentity label

SYi = |Pd(Yi |xt) − 0.5| (25)

SYi = Pd(Yi |xt) − 0.5 (26)

NYi = 0.5 − SYi (27)

where SYi ∈ [0, 0.5]L , SYi ∈ [−0.5, 0.5]L , and NYi ∈[0, 0.5]L .3) Calculate βYi and βYi

βYi = SYi

NYi

(28)

βYi = 1

1 + e−αβYi− 0.5 (29)

where βYi ∈ [0,+∞)L , βYi ∈ [0, 0.5)L , and α is a parameterused to adjust the sensitivity level of each voting classifier forits contribution to the final decision.4) Calculate βout(Yi ) and Sout(Yi )

βout(Yi ) =

L∑

k=1

βYi (k)SYi (k)

L∑

k=1

βYi (k)NYi (k)

(30)

Sout(Yi ) = βout(Yi )

2(1 + |βout(Yi )|)(31)

where βout(Yi ) ∈ � and Sout(Yi ) ∈ [−0.5, 0.5].5) Calculate the final voting probability P(Yi |xt)

P(Yi |xt) = Sout(Yi ) + 0.5 (32)

where P(Yi |xt) ∈ [0, 1].Output: the predicted class identity label Yi


P(Yi |xt). (33)

1The MATLAB implementation of the proposed algorithm is available from

the authors.

One should note that in the procedure steps (1) and (2), thecalculations of the signal strength and uncertainty degree canbe based either on the training data or testing data, depending

0 0.1 0.2 0.3 0.4 0.50

50

100

150

200

250

300

350

400

450

500sensitivity analysis for β

signal strength(a)

Sens

itivity

0 0.1 0.2 0.3 0.4 0.50

5

10

15sensitivity analysis for β

signal strength(b)

Sens

itivity

α = 5

α = 0.5

α = 0.1

α = 0.05

α = 0.01

Fig. 2. Sensitivity analyses of β and β. (a) Sensitivities of β with respect tothe signal strength. (b) Sensitivities of β with respect to the signal strengthwith different α coefficients.

on different base learning algorithms or implementation details(Section III-A and Appendix A for more discussions). In theSSC algorithm, a modified logistic function is introduced in(29) to adjust the sensitivity level of β. To analyze this, letus define the normalized sensitivity function of β and β withrespect to signal s as [50]

ξβs = s · ∂β

β · ∂s, ξ β

s = s · ∂β

β · ∂s. (34)

Theorem 1: For s ∈ [0, 0.5], β ∈ [0,+∞), and β ∈ [0, 0.5)

that are defined in (25), (28), and (29), ξβs and ξ

βs satisfy

ξ βs ≤ ξβ

s (35)

and ξβs = ξ

βs if and only if s = 0.

The proof of Theorem 1 can be found in Appendix B.Fig. 2 illustrates the sensitivity of β and β with differentα coefficients. By adjusting the value of α, one can shiftthe sensitivity curve to the lower signal strength region orhigh signal strength region, which provides the flexibility inadjusting the sensitivity level from each voting unit for thefinal decision. The parameter α is a positive number, i.e.,α ∈ (0,+∞). When α approaches 0, one can prove that itis equivalent to replacing βYi with βYi in (30). The proofand discussion on this can be found in Appendix C. Whenα approaches infinity, βYi approaches a constant value 0.5.Then from (30), one can see that the performance of theproposed method will be pushed toward the AA rule. Inpractical applications, the α coefficient can be obtained bythe cross-validation method. In this paper, we set α to 0.1 forall datasets.

Fig. 3 gives an example of an ensemble system with fourclassifiers for different combining rules. Assume that we haveobtained the weight coefficients (for weighted AA rule andweighted MajV rule) and the decision profile for the testingexample xt . From Fig. 3 one can see, in this particularexample, the MajV rule and weighted MajV rule vote thistesting instance, xt , as a class-2 label. For the MV rule,because the votes for classes 1 and 2 are the same, thefinal predicted label can be randomly selected from these


Fig. 3. Exemplary comparison of the SSC and other combining methods for a three-class problem using four classifiers.

two classes. For the BC rule, the final predicted label can berandomly selected from classes 1–3. All other methods votethis testing instance as a class 1 label. In Fig. 4, we alsoillustrate the major calculation steps for the proposed SSCalgorithm for this case.

IV. SIMULATIONS ANALYSIS

A. Experimental Conditions

In this section, we compare the classification performance ofthe proposed SSC algorithm with other combining methods aspresented in Section II over real-world classification datasets.The datasets used in this paper include 20 datasets from theUCI Machine Learning Repository [51]. Table I illustrates thedataset characteristics.

A neural network model with MLP is used as the baseclassifier, in which the number of input neurons and outputneurons are equal to the number of attributes and classesfor each dataset, respectively. We note that, although herewe set the number of output neurons to be the same as thenumber of classes, we can still directly use the proposedSSC algorithm with 0.5 as the threshold because, once each

output neuron provides an estimation of the decision profilevalue, we can still use 0.5 as the reference point to calculatethe final voting probability for each corresponding class. Ofcourse, an alternative approach is to transform the multiclassclassification problem to be multiple two-class classificationproblems as discussed before. The number of hidden neuronsfor each dataset is shown in Table I, satisfying the conditionN = O(W/ε) [52], where N is the size of the training set,O(·) denotes the order of the quantity enclosed within, W isthe total number of free parameters of the model (i.e., synapticweights), and ε denotes the expected testing errors, which isset to 0.1 in this paper. The sigmoid function is used for theactivation functions, and backpropagation is used to train thenetwork. The weights of the network were initialized to smallrandom values uniformly distributed between −0.5 and 0.5.The learning rate is 0.1, and the number of epochs in eachtraining is 500.

All results presented in this paper are based on the aver-age of 100 random runs. At each run, we randomly selecthalf (50%) of the dataset as the training set, and use theremaining half (50%) for testing. The bagging method isadopted to create the classifier ensemble system [18]. Specif-


TABLE I

DATASET CHARACTERISTICS USED IN THIS PAPER

Data set Attributes Hidden

Name Instances Classes Cont. Disc. Neurons

Ecoli 336 8 7 0 2German 1000 2 7 13 2Glass 214 7 9 0 2

Haberman 306 2 0 3 10Ionosphere 351 2 34 0 2

iris 150 3 4 0 10letter-recognit 20000 26 16 0 10

musk1 476 2 166 0 2Pima-indians-di 786 2 8 0 10

Satimage 6435 6 36 0 10Segmentation 2310 7 19 0 10

Shuttle 59000 2 9 0 2Sonar 208 2 60 0 2

Soybean-small 47 4 0 35 2Spectf 267 2 44 0 2

Vehicle 846 4 18 0 2vowel 990 11 10 0 2Wdbc 569 2 30 0 2Wine 178 3 13 0 2Yeast 1484 10 8 0 10

Fig. 4. Calculation of the SSC algorithm.

ically, we first obtain bootstrap samples by randomly drawinginstances, with replacement, from the original training set.Once the bootstrap samples are obtained, we then developthe classifier (i.e., neural network MLP) on top of that withdifferent initial weights. Opitz and Maclin suggested for theensemble learning method that relatively large gains can beobserved by using 25 ensemble classifiers [53], therefore, inthis paper we construct 25 classifiers through bootstrap sam-

Fig. 5. Final posterior probability for the SSC algorithm and AA rule onthe Pima-indians-di dataset.

Fig. 6. Final posterior probability for the SSC algorithm and MV rule onthe Pima-indians-di dataset.

pling in each run. For weighted AA rule and weighted Majorityrule, we equally divide the training dataset to two subsets. Onesubset is used to build the ensemble of classifiers, and theother is used to test the performance of the generated learningsystem. The normalized accuracy rate of each classifier in theensemble system over the second subset is used as the weightof each classifier.

B. Results

In order to illustrate the performance of the proposedSSC algorithm, Figs. 5 and 6 show the voting values ofthe SSC algorithm with respect to the AA and MV rules,respectively. In these two figures, we use half of the Pima-indians-di dataset to develop the 25 classifiers based on thebagging method, and then plot the combined final posteriorprobability P(Yi |xt) for class 1 of the testing data (Pima-indians-di is a two-class dataset). For clear presentation, weonly show the results of randomly selected 50 testing instanceshere. The squares represent the true probability Ptrue(1|xt)of each instance, i.e., if the testing instance xt belongs toclass 1, then Ptrue(1|xt) = 1; otherwise, Ptrue(1|xt) = 0.


TABLE II

TESTING ERROR PERFORMANCE. (AVERAGE ERROR RATES IN PERCENTAGE. EACH ROW REPRESENTS A DATASET UNDER INVESTIGATION, EACH

COLUMN REPRESENTS A VOTING METHOD, AND THE HIGHLIGHTED NUMBERS REPRESENT THE BEST PERFORMANCE FOR THE DATASETS)

GA AA MV MajV Max Min BC Weighted Weightedrule rule rule rule rule rule rule AA rule MajV rule SSC

Ecoli 19.82 21.07 21.26 21.39 22.94 17.53 19.79 21.07 21.39 18.43German 24.35 24.44 24.48 24.48 24.47 24.47 24.48 24.43 24.48 24.23Glass 43.37 43.87 44.88 45.06 44.36 43.48 44.01 43.53 44.66 42.82

Haberman 25.50 25.61 25.82 25.82 25.44 25.44 25.82 25.65 25.82 25.33Ionosphere 13.32 12.99 12.88 12.88 11.34 11.34 12.88 12.95 12.88 13.59

Iris 3.55 3.55 3.63 3.63 3.59 3.57 3.63 3.57 3.63 3.56Letter 38.36 40.10 42.56 51.26 49.32 39.64 43.00 39.61 48.47 37.66Musk1 23.90 24.17 25.24 25.24 25.82 25.82 25.24 23.97 25.10 24.36

Pima-indians-di 32.59 32.65 33.14 33.14 32.23 32.23 33.14 32.65 33.14 32.22Satimage 23.12 25.12 31.68 44.65 25.72 22.09 37.78 23.44 41.22 21.73

Segmentation 12.29 13.06 14.24 20.05 16.15 12.32 14.17 12.92 17.65 11.50Shuttle 7.35 7.36 8.60 8.63 6.06 4.46 8.77 7.34 8.57 7.32Sonar 22.23 21.63 21.51 21.51 25.09 25.09 21.51 21.61 21.51 22.24

Soybean-small 1.54 1.42 1.29 1.17 7.25 3.71 1.46 1.33 1.13 2.17Spectf 21.00 20.98 20.98 20.98 22.48 22.48 20.98 20.98 20.98 21.46Vehicle 35.23 36.02 46.11 49.23 38.69 35.09 46.55 36.13 49.22 33.48Vowel 44.34 45.21 46.11 46.37 46.74 44.28 44.96 45.31 46.32 42.97Wdbc 13.61 20.42 37.20 37.20 8.57 8.57 37.20 13.92 34.86 8.79Wine 29.47 34.26 47.88 50.04 28.56 17.66 48.04 30.89 45.42 18.69Yeast 40.02 40.00 40.08 40.09 40.49 40.47 40.05 40.00 40.08 40.06

Winning times 2 2 2 2 2 5 2 2 3 9

TABLE III

STANDARD DEVIATION OF TESTING ERROR RATES. AVERAGE STANDARD DEVIATION IN PERCENTAGE. EACH ROW REPRESENTS A DATASET

UNDER INVESTIGATION, AND EACH COLUMN REPRESENTS A VOTING METHOD


Ecoli 2.87 2.92 2.71 2.69 3.00 2.93 2.76 2.91 2.72 2.77German 1.55 1.53 1.63 1.63 1.68 1.68 1.63 1.55 1.63 1.51Glass 6.48 6.49 6.46 6.55 5.88 5.65 6.64 6.45 6.24 6.29

Haberman 2.81 2.61 2.71 2.71 3.05 3.05 2.71 2.67 2.71 2.93Ionosphere 2.76 2.78 2.78 2.78 2.48 2.48 2.78 2.78 2.78 2.72

Iris 1.74 1.72 1.67 1.67 1.65 1.66 1.67 1.76 1.67 1.69Letter 1.65 1.84 2.05 3.12 3.39 3.26 2.30 1.93 2.86 1.78Musk1 4.14 4.18 4.02 4.02 3.94 3.94 4.02 4.10 4.04 3.59

Pima-indians-di 2.18 2.27 2.16 2.16 2.02 2.02 2.16 2.24 2.16 2.09Satimage 4.87 6.33 9.09 8.27 6.51 2.88 4.04 5.05 6.16 3.97

Segmentation 1.56 1.67 2.35 6.51 2.94 1.96 2.65 1.63 5.60 1.77Shuttle 0.63 0.79 2.10 2.11 2.42 2.92 2.11 0.71 2.11 0.58Sonar 3.93 4.02 3.72 3.72 4.19 4.19 3.72 3.99 3.72 3.88

Soybean-small 3.78 3.61 3.64 3.61 8.16 6.12 3.81 3.60 3.39 4.57Spectf 2.54 2.55 2.55 2.55 2.75 2.75 2.55 2.55 2.55 2.69Vehicle 4.96 5.48 8.35 7.54 6.79 4.13 9.05 5.68 7.96 3.94Vowel 2.71 2.72 2.58 2.75 2.91 3.19 2.55 2.72 2.73 2.99Wdbc 10.70 13.85 4.14 4.14 1.58 1.58 4.14 10.98 8.60 3.58Wine 13.29 13.89 16.77 14.54 11.13 10.05 15.80 13.31 16.15 10.87Yeast 1.51 1.51 1.51 1.51 1.46 1.56 1.52 1.52 1.51 1.48

The circles, x-marks, and plus-marks represent the resultsby the SSC algorithm, AA rule and MV rule, respectively.From Figs. 5 and 6, one can see that the proposed methodhas the advantage of making the combined voting decisionmore deterministic (push toward either “1” or “0” values).Furthermore, the SSC algorithm can correctly classify someinstances that are misclassified by the AA rule or MV rule

(highlighted by ellipses in the figures), thus improving theprediction performance of the ensemble system.

Tables II and III show the testing error performance andstandard deviation for the SSC rule with respect to the ninemajor combining rules as presented in Section II, respectively.For each dataset, the winning method is underlined. Further-more, the total numbers of winning times for each method


TABLE IV

HYPOTHESIS TESTING RESULTS

GA AA MV MajV Max Min BC Weighted Weightedrule rule rule rule rule rule rule AA rule MajV rule Win Tie Loss

Ecoli −3.47 −6.55 −7.30 −7.64 −11.02 2.25 −3.47 −6.56 −7.62 8 1 0German −0.53 −0.95 −1.09 −1.09 −1.04 −1.04 −1.09 −0.92 −1.09 0 9 0

Glass −0.61 −1.16 −2.28 −2.46 −1.79 −0.77 −1.30 −0.79 −2.08 1 8 0Haberman −0.42 −0.72 −1.23 −1.23 −0.25 −0.25 −1.23 −0.79 −1.23 0 9 0Ionosphere 0.69 1.53 1.83 1.83 6.12 6.12 1.83 1.64 1.83 0 7 2

Iris 0.06 0.06 −0.28 −0.28 −0.11 −0.06 −0.28 −0.05 −0.28 0 9 0Letter −2.91 −9.54 −18.05 −37.85 −30.45 −5.34 −18.37 −7.45 −32.07 9 0 0Musk1 0.84 0.35 −1.62 −1.62 −2.74 −2.74 −1.62 0.71 −1.36 2 7 0

Pima-indians-di −1.22 −1.39 −3.06 −3.06 −0.03 −0.03 −3.06 −1.37 −3.06 4 5 0Satimage −2.22 −4.54 −10.03 −24.98 −5.24 −0.73 −28.33 −2.66 −26.60 7 2 0

Segmentation −3.34 −6.42 −9.32 −12.68 −13.57 −3.11 −8.39 −5.90 −10.47 9 0 0Shuttle −0.39 −0.46 −5.91 −6.03 5.05 9.60 −6.65 −0.25 −5.76 4 3 2Sonar 0.02 1.10 1.36 1.36 −4.98 −4.98 1.36 1.14 1.36 2 7 0

Soybean-small 1.05 1.29 1.50 1.72 −5.44 −2.02 1.19 1.43 1.83 1 8 0Spectf 1.25 1.31 1.31 1.31 −2.64 −2.64 1.31 1.31 1.31 2 7 0Vehicle −2.78 −3.78 −13.68 −18.52 −6.64 −2.83 −13.25 −3.85 −17.71 9 0 0Vowel −3.40 −5.55 −7.97 −8.38 −9.05 −3.02 −5.07 −5.79 −8.28 9 0 0Wdbc −4.27 −8.13 −51.89 −51.89 0.55 0.55 −51.89 −4.44 −27.99 7 2 0Wine −6.28 −8.83 −14.60 −17.27 −6.35 0.69 −15.31 −7.10 −13.73 8 1 0Yeast 0.19 0.29 −0.10 −0.13 −2.04 −1.89 0.06 0.31 −0.09 0 9 0Total 82 94 4

across all these datasets are also summarized at the bottom ofTable II. From Tables II and III, we can see that the SSC rulecan provide competitive results compared with all the existingcombining rules.

To compare the statistical characteristics of the proposedSSC method with those of traditional approaches, we usehypothesis testing of the mean error rates to compare thestatistical significance.

We formulate three hypotheses as follows:

1) H10 : μ1 = μ2 versus H11 : μ1 �= μ2;2) H20 : μ1 < μ2 versus H21 : μ1 ≥ μ2;3) H30 : μ1 > μ2 versus H31 : μ1 ≤ μ2.

The independent two-sample t-statistic is calculated asfollows:

Z = μ1 − μ2√σ 2

1

n1+ σ 2

2

n2

. (36)

We conduct the pairwise hypothesis testing between themean error rate μ1 of SSC and those of other combin-ing rules μ2. Table IV shows the hypothesis testing resultsbetween SSC and the specified combining method. For eachhypothesis testing, we will accept H10 if |Z | < 2.345 (2.345is for a two-tailed test where the results are significant withp = 0.02 and a one-tailed test where the results are significantwith p = 0.01), i.e., the two methods tie. We will acceptH20 if Z < −2.345, i.e., SSC wins, we will accept H30if Z > 2.345, i.e., SSC loses. The rightmost three columnsof Table IV summarize the Win-Tie-Loss ratio of SSC overother nine combining rules for each dataset, and the total Win-Tie-Loss ratio is summarized at the bottom of Table IV. Forinstance, the number −3.47 in the first column and first rowof the Table IV is the hypothesis testing result between SSC

GA AA MV MajV Max Min BC WAA WMajV SSC

25

30

35

40

45

50

55

60

erro

r rate

s %

Fig. 7. Visualization of the testing error of 10 voting methods on the vehicledataset.

and GA rule over the Ecoli datasets. In this case, since Z =−3.47 < −2.345, this means the SSC rule can significantlyoutperform the GA rule, i.e., SSC wins. From Table IV, wecan see that SSC wins 82 times, ties 94 times, and loses4 times across all the hypothesis testing. In other words, SSCcan achieve competitive performance compared with othertraditional combining rules.

Fig. 7 provides an example of the box plot results forVehicle dataset. The box plot method can illustrate groups ofnumerical data by presenting their five-number characteristics:minimum and maximum range values, the upper and lowerquartiles, and the median [54]. The horizontal dash-dotted linein the figure shows the median value of the error rates ofthe SSC rule compared to other combining rules. Fig. 7 alsoconfirms that the proposed method can provide competitiveresults on the vehicle dataset.

In order to compare the numerical characteristics of differentcombining rules, Table V demonstrates another example of


TABLE V

NUMERICAL CHARACTERISTICS OF 10 VOTING METHODS ON THE GLASS DATASET

GA AA MV MajV Max Min BC Weighted WeightedVoting method rule rule rule rule rule rule rule AA rule MajV rule SSC

Largest nonoutlier 57.01 55.14 59.81 58.88 56.07 57.01 59.81 55.14 58.88 55.14

Upper quartile 46.26 46.26 48.13 47.66 46.73 47.66 47.66 45.79 47.66 45.79

Median 42.06 42.06 43.46 43.93 43.46 42.99 42.06 42.06 43.93 42.06

Lower quartile 38.79 39.25 40.19 40.19 40.19 39.25 39.25 39.25 40.19 38.32

Smallest nonoutlier 32.71 33.64 33.64 34.58 34.58 32.71 33.64 32.71 34.58 32.71

62.62 60.75 60.75 61.68 60.75 62.62 60.75 60.75 59.8160.75 59.81 60.75 59.81 60.75 59.81 58.88

“Mild” outlier 59.81 58.88 59.81 58.88 − 58.88 57.9458.88 57.01 57.94 57.01 57.0157.94 57.01

10−2 10−1 100 10122.5

23

23.5

24

24.5

25

α value

Aver

age c

lassif

icatio

n erro

r rate

s α = 5

α = 8

α = 3

α = 10

α = 1α = 0.8

α = 0.5α = 0.01α = 0.1α = 0.05 α = 0.3

Fig. 8. Average classification error rates over 20 datasets versus α coeffi-cients.

the numerical results of the box plot method on the glassdataset across all these methods. For example, from this tableone can see that the proposed method provides competitiveperformance in terms of all the five evaluation metrics (thelargest and smallest nonoutliers, the upper and lower quartiles,and the median value). One should also note that, in termsof the “lower quartile” criterion, the GA rule has the samelevel of performance as that of our proposed method. For the“smallest nonoutlier” criterion, the GA rule, Min rule, andweighted AA rule all have the same level of performance asour proposed method. In summary, the hypothesis testing andbox plot indicate that the proposed SSC rule can provide robustand competitive results.

To see the influence of different α coefficients, we haveconducted extensive experiments to test the performance ofthe proposed algorithm versus different α values. For spaceconsideration, Fig. 8 illustrates the average classification errorrate over 20 datasets versus α coefficients. For clear presenta-tion, here we use a base-10 logarithmic scale for the x-axis.As discussed earlier in Fig. 2, small values of α parameter willshift the sensitivity curve to the right side on the signal strengthaxis, which makes the system more sensitive to the large signalstrength values. On the other hand, large values of α will shiftthe sensitivity curve to the left side, which corresponds tothe small signal strength value region, and make the systemless sensitive to the variations of signal strength from differentclassifiers. For different application situations, the α parametervalue can be determined via the cross-validation method.

We also investigate the effect of the noises in the posteriorclass probabilities of each individual classifier on the proposedcombination rule on the dataset Satimage. In our analysis, weexplicitly add random noise to the outputs from the neuralnetworks on a relative basis and apply these decision profileswith noise to the voting methods. In Table VI, we summarizethe error performance of SSC and the other nine votingmethods when noise is present. For each noise percentagelevel, the left columns in Table VI are the average errorrates of the voting methods when from 1% to 10% plus15% and 20% uniformly distributed noise is added to thedecision profiles over 100 runs, and the right columns arethe hypothesis testing results between SSC and other ninevoting methods. Win-Loss-Tie times are also calculated inTable VI. From Table VI, one can see that SSC can achievestable performance when the decision profiles are perturbedby the addition of random noise.

From the above simulation results, one can see that SSCcan provide superior performance when a very weak learneris used in training the ensemble system. In this paper, we alsoinvestigate the performance of the proposed voting methodwhen a relatively strong base learner is used. In this experi-ment, we increase the number of hidden neurons in the neuralnetwork structure for each dataset, as well as the number ofepochs in each training. For instance, for datasets Musk1,Satimage, and Segmentation, 20 hidden neurons are used, andfor datasets Shuttle and Wine, 10 hidden neurons are used.The number of epochs is increased to 5000 in each training.For space considerations, in Table VII, we present the errorrates of SSC and other nine voting methods for five datasetswith relatively strong base learners, and in Table VIII thepairwise hypothesis testing results. From Table VIII we cansee that SSC win 7 times, ties 36 times, and loses 2 times.Therefore, we can draw the conclusion that, with relativelystrong individual classifiers, SSC can still achieve competitiveperformance compared with other traditional voting methods.

One question is: “is there a best combining method in allcases, and if so, which one is the best?” Generally speaking,there is no single universally optimal combining rule that canalways achieve the best performance for ensemble learning,which is discussed extensively in literature [55]–[59]. Instead,the best approach for each particular problem may depend


TABLE VI

TESTING ERROR PERFORMANCE (MEAN AND STANDARD DEVIATION) WHEN NOISE IS ADDED OVER THE SATIMAGE DATASET (IN PERCENTAGE)

1% 2% 3% 4% 5% 6%GA rule 23.12 2.15 23.13 2.16 23.13 2.18 23.14 2.19 23.16 2.22 23.18 2.24AA rule 25.16 4.53 25.14 4.54 25.15 4.60 25.15 4.60 25.15 4.64 25.19 4.69MV rule 31.83 10.10 31.86 10.30 31.78 10.52 31.66 10.69 31.55 10.85 31.41 10.92

MajV rule 44.46 25.83 44.28 26.45 44.08 27.28 43.87 27.76 43.67 28.17 43.46 28.59Max rule 25.72 5.19 25.72 5.21 25.71 5.23 25.73 5.27 25.75 5.31 25.74 5.31Min rule 22.09 0.66 22.08 0.63 22.07 0.63 22.08 0.62 22.08 0.63 22.07 0.60BC rule 37.73 29.04 37.63 28.96 37.59 29.32 37.47 29.02 37.33 28.77 37.17 28.51

Weighted AA rule 23.40 2.57 23.41 2.57 23.42 2.61 23.44 2.62 23.44 2.63 23.46 2.65Weighted MajV rule 41.25 25.80 41.21 25.45 41.08 25.66 40.92 25.73 40.69 25.81 40.47 25.94

SSC rule 21.77 − 21.77 − 21.77 − 21.78 − 21.78 − 21.79 −Win − 7 − 7 − 7 − 7 − 7 − 7Tie − 2 − 2 − 2 − 2 − 2 − 2

Loss − 0 − 0 − 0 − 0 − 0 − 07% 8% 9% 10% 15% 20%

GA rule 23.21 2.31 23.23 2.36 23.24 2.38 23.27 2.40 23.40 2.72 23.42 2.89AA rule 25.19 4.71 25.19 4.74 25.19 4.78 25.18 4.77 25.21 5.01 25.17 5.22MV rule 31.24 10.92 31.14 11.07 31.00 11.15 30.87 11.20 30.63 11.95 30.52 12.82

MajV rule 43.22 28.59 42.98 29.00 42.75 29.27 42.44 29.54 40.98 29.61 39.53 28.78Max rule 25.76 5.31 25.75 5.38 25.76 5.40 25.76 5.36 25.81 5.56 25.91 5.86Min rule 22.09 0.60 22.09 0.65 22.10 0.66 22.08 0.57 22.10 0.66 22.12 0.78BC rule 37.17 28.51 36.70 26.87 36.48 26.45 36.21 25.69 34.82 22.28 33.26 19.26

Weighted AA rule 23.46 2.65 23.45 2.66 23.51 2.75 23.52 2.74 23.63 3.03 23.65 3.20Weighted MajV rule 40.47 25.94 39.89 26.01 39.61 26.08 39.30 25.81 37.68 24.44 35.16 22.16

SSC rule 21.79 − 21.78 − 21.78 − 21.81 − 21.80 − 21.77 −Win − 7 − 8 − 8 − 8 − 8 − 8Tie − 2 − 1 − 1 − 1 − 1 − 1

Loss − 0 − 0 − 0 − 0 − 0 − 0

TABLE VII

TESTING ERROR PERFORMANCE WITH RELATIVELY STRONG BASE LEARNERS. AVERAGE ERROR RATES IN PERCENTAGE: EACH ROW

REPRESENTS A DATASET UNDER INVESTIGATION, EACH COLUMN REPRESENTS A VOTING METHOD


Musk1 13.77 13.43 16.18 13.53 16.18 13.42 13.53 13.44 20.08 11.79Satimage 11.04 11.04 13.02 11.03 12.78 11.04 11.05 11.04 11.00 11.06

Segmentation 3.43 3.36 4.67 3.32 4.75 3.35 3.38 3.36 3.39 3.33Shuttle 0.12 0.13 0.19 0.13 0.14 0.13 0.13 0.13 0.11 0.14Wine 2.33 2.26 2.55 2.22 2.67 2.26 2.24 2.24 2.22 2.26

on the particular domain knowledge and data characteristics.Our idea in this paper is to address the voting method froma new angle based on signal strength, and develop a practicalalgorithm for the community. Our simulation results demon-strate that our proposed approach can achieve competitiveperformance across a wide range of datasets. We hope such amethod can provide a useful technique and some new insightsinto address this challenging and important research topic inthe community.

V. DISCUSSION

There is a connection of the proposed SSC method withrespect to the SVMs and boosting methods. In this section,we provide a detailed discussion on this.

Consider a two-class classification problem. Suppose allinstances in the training data Dt r can be represented by{xq, yq}, q = 1, . . . , m, and yq ∈ {−1,+1} is the classidentity label associated with xq . We further assume thath(x) is some fixed nonlinear mapping of instances into the

high-dimensional spaces Rn . Therefore, the maximal marginclassifier can be defined by the vector σ which maximizes thefollowing term [60]

min(x,y)∈Dtr

y(σ · h(x))

‖σ‖2(37)

where ‖σ‖2 is the l2 norm of the vector σ . Therefore, theobjective of this approach is to find the optimal hyperplane thatmaximizes the minimal margin in a high-dimensional space.On the other hand, the key idea of the AdaBoost method isto iteratively update the distribution function over the trainingdataset Dt r . In this way, on each iteration t := 1, . . . , T ,where T is a preset number of the total number of iterations,a distribution function Dt (the initial distribution D1 can beset to a uniform distribution) is updated sequentially and usedto train a new hypothesis

Dt+1(q) = Dt (q) exp(−yqσt ht (xq))

Zt(38)

where σt = (1/2) ln(1 − εt/εt ), ht (xq) is the predictionoutput of hypothesis ht on the instance xq , εt is the error of


TABLE VIII

HYPOTHESIS TESTING RESULTS WITH RELATIVELY STRONG BASE LEARNERS

GA AA MV MajV Max Min BC Weighted Weightedrule rule rule rule rule rule rule AA rule MajV rule Win Tie Loss

Musk1 −1.00 0.02 −8.18 −0.26 −8.18 0.05 −0.26 −0.26 −0.26 2 7 0Satimage 0.27 0.19 −26.08 0.43 −23.73 0.22 0.08 0.23 0.87 2 7 0

Segmentation −1.20 −0.29 −14.79 0.08 −14.76 −0.25 −0.62 −0.32 −0.73 2 7 0Shuttle 3.32 0.32 −9.50 1.18 0.58 0.34 0.35 0.74 5.59 1 6 2Wine −0.34 0.00 −1.45 0.18 −2.12 0.00 0.12 0.12 0.18 0 9 0Total 7 36 2

TABLE IX

TESTING ERROR PERFORMANCE OF MARGIN-BASED LEARNING METHODS (IN PERCENTAGE)

Voting SVM SVM SVM AdaBoost AdaBoostmethod (linear kernel) (polynomial kernel) (RBF kernel) M1 M2 SSC

Ecoli 22.11 19.60 17.64 17.65 17.56 15.19German 24.53 33.72 28.08 24.64 23.80 24.72Glass 40.56 36.93 35.97 32.13 34.30 32.50

Haberman 26.21 26.60 26.78 28.44 25.69 25.33Ionosphere 13.43 14.15 10.46 12.26 10.17 12.23

Iris 2.77 5.47 2.96 4.44 4.93 4.52Letter 15.27 6.29 5.77 23.57 20.12 18.94Musk1 20.36 11.87 43.24 12.96 11.51 13.44

Pima-indians-di 23.40 28.30 34.43 26.96 26.38 25.24Satimage 14.05 12.50 9.18 10.90 12.05 11.06

Segmentation 8.16 3.81 7.38 2.84 4.79 3.33Shuttle 2.40 0.36 0.46 0.05 0.92 0.14Sonar 25.63 19.95 24.63 19.66 18.85 21.44

Soybean-small 0.79 1.04 0.79 6.94 0.42 0.46Spectf 24.18 23.63 20.63 24.12 22.69 23.10Vehicle 21.95 19.41 30.97 19.21 19.57 17.88Vowel 25.33 6.05 20.75 17.25 26.67 22.95Wdbc 4.86 4.56 8.90 3.63 3.82 3.27Wine 5.80 6.31 26.00 2.71 2.58 2.26Yeast 47.31 42.62 44.71 42.96 40.71 40.55

Winning times 2 1 3 2 5 6

hypothesis ht over the training set εt = ∑q:ht (xq) �=yq

Dt (q),and Zt = 2

√εt (1 − εt ) is a normalization factor so that Dt+1

is a distribution function, i.e.,∑m

q=1 Dt+1(q) = 1. In this way,the final combined hypothesis is a weighted majority vote ofall these sequentially developed hypotheses [60]

f (x) =

T∑

t=1

σt ht (x)

T∑

t=1

σt

. (39)

In [60], Schapire et al. illustrated that, if one considers thecoefficient {σt }T

t=1 as the coordinates of the vector σ ∈ RT

and the hypothesis output {ht (x)}Tt=1 as the coordinates of the

vector h(x) ∈ {−1,+1}T , (39) can be rewritten as

f (x) = σ · h(x)

‖σ‖1(40)

where ‖σ‖1 is the l1 norm of σ (‖σ‖1 = ∑Tt=1 |σt |).

Comparing (37) and (40), one can see the links ofthe maximal margin classifiers with the boosting methods.

Both methods aim to find a linear combination of large marginin the high dimensional spaces. There are also some differ-ences in terms of specific implementation and computationaldetails. For instance, SVMs target to maximize the minimalmargin based on support vectors, whereas boosting aims tominimize an exponential weighting of the instances. In termsof learning mechanisms, SVM relies on the kernel trick forefficient computation in the high-dimensional space, whereasboosting relies on a base learning classifier to explore thehigh-dimensional space one coordinate at a time. In [61] and[62], Rätsch et al. also provided some interesting insights intothe connections between SVM and boosting-like algorithms.They showed that a support vector algorithm can be translatedinto a boosting-like algorithm by replacing the kernel by anappropriately designed hypothesis space, and vice versa, bytranslating a hypothesis set into a SV-kernel. Based on theseconnections and barrier method, a new one-class leveragingalgorithm was proposed for novelty detection.

Based on the analysis in Section III-A and the final votingposterior probability in Figs. 5 and 6, one can see thatthe proposed method can also increase the separation mar-gin in support of final decision making by maximizing the


Fig. 9. Margin distribution graph for the SSC method, AA rule, and MVrule over the dataset Satimage.

signal strength in the ensemble voting systems. For detailedperformance comparisons, Table IX illustrates the testingerror performance of the proposed method in comparison toSVM [linear kernel, polynomial kernel with degree of 3, andradial basis function (RBF)] and AdaBoost (AdaBoost.M1 andAdaBoost.M2) algorithms. The AdaBoost algorithms use theneural network mode as the relatively strong base learnerthat is used in Section IV-A. Again, the sigmoid function isused for the activation functions, and backpropagation is usedto train the network. We construct 25 boosting hypothesesin each run. And the classification error is calculated basedon the average of 100 random runs. The winning times foreach method across these 20 datasets are also illustrated atthe bottom of Table IX. From our observation, the proposedalgorithm can provide competitive results compared with thoseof SVM and AdaBoost. One should note that the performanceof AdaBoost depends on many parameters, such as the choiceof base classifier model. By choosing different parameters,the performance of the AdaBoost method may be furtherimproved. In Table X, we present the hypothesis testing resultsbetween SVM with linear kernel and the SSC algorithm.These results show that for 14 out of 20 datasets SSCoutperforms SVM with a linear kernel, which demonstrates theeffectiveness of the SSC algorithm. There are some interestingdiscussions about the performance of AdaBoost presented inthe community. For instance, the empirical comparison ofSVM and AdaBoost is presented in [63]. Furthermore, Measeand Wyner have presented some interesting experiments anddiscussions in [64] to show the inconsistencies in the per-formance of AdaBoost algorithms contrary to the statisticalexplanations and suggestions of boosting. Interested readerscan refer to these papers [63], [64] for details.

Although Tables II and IX demonstrate that the proposedmethod can achieve competitive results when compared to theexisting voting strategies and the margin-based classifiers, theycannot reflect the margin distribution information. To give aformal discussion of the margin analysis, we adopt the marginand margin distribution graph terminology used in [60].

Definition 1: Consider a classification problem. The classi-fication margin on an instance is the difference between the

weight assigned to the correct label and the maximal weightassigned to any single incorrect label, i.e., for an instance{x, y}

margin(x) = wh(x)=y − max{wh(x) �=y}. (41)Definition 2: Given a data distribution D, the margin dis-

tribution graph is defined as the fraction of instances whosemargin is at most λ as a function of λ ∈ [−1, 1]

F(λ) = |Dλ||D| , λ ∈ [−1, 1] (42)

where Dλ = {x : margin(x) ≤ λ}, | • | stands for the sizeoperation, and F(λ) ∈ [0, 1].

According to these definitions, for the proposed method wecan calculate the margin according to the P(Yi |xt) as definedin (32). Fig. 9 shows an example of the margin distributiongraphs for the proposed method on the testing data of theSatimage dataset with respect to the AA rule and MV rule.Here, the x-axis is the margin as defined in (41), and the y-axisis the cumulative distribution based on (42). It is clearly shownthat the proposed SSC method can achieve a high margin inthis case. For instance, as illustrated by the dash-dotted line,for the proposed method, there are 72.06% of the testing datawith margin less than 0.5, whereas for the MV rule and AArule, there are 97.47% and 99.64% of the testing data witha margin less than 0.5, respectively. This observation is alsoconsistent with our previous analysis. One should also note inthe margin distribution graph that the cumulative distributionvalue (y-axis) corresponding to the margin value of 0 (thedotted line in Fig. 9) represents the classification error rate.In the case of Fig. 9, we highlight that the classification errorfor the testing data is 21.73, 25.12, and 31.68% for the SSCmethod, AA rule, and MV rule, respectively. These results arealso consistent with our reported results in Table II.

Based on this discussion, we have analyzed the mar-gin distribution graphs for all the datasets for the votingstrategies presented in Section II and the boosting methods(AdaBoost.M1 and AdaBoost.M2 algorithm). Fig. 11(a)–(d)give several snapshots of this analysis for the Glass andVehicle datasets for both training and testing data. In allthese figures, the bold line represents the margin distributionof the proposed SSC method. From these figures, one cansee that both AdaBoost.M1 and our proposed method canincrease the margins to benefit the classification process. Notethat, it seems that AdaBoost.M1 is particularly aggressive inincreasing the margins of the data examples. It significantlypushes the margin distribution graph toward both the positive 1and negative 1 areas. Our simulation results and observa-tions presented here are also consistent with the discussionsin [60] regarding this characteristic of the boosting method.As pointed out in [60], large margins (positive and negative)will provide high tolerance to the perturbations of the votingnoise for the learning system. We hope the presented analysesin this section bring interesting observations and importantinsights into the proposed method with respect to other votingstrategies in terms of margin analysis.


TABLE X

HYPOTHESIS TESTING RESULTS BETWEEN SSC AND LINEAR SVM

Ecoli German Glass Haberman Ionosphere Iris Letter Musk1 Pima Satimage

t-Test −17.53 0.85 −12.19 10.81 −3.85 7.75 69.60 −19.93 7.50 −41.08Win 1 0 1 0 1 0 0 1 0 1Tie 0 1 0 0 0 0 0 0 0 0

Lose 0 0 0 1 0 1 1 0 1 0

Segmentation Shuttle Sonar Soybean Spectf Vehicle Vowel Wdbc Wine Yeast Total

t-Test −45.24 −48.64 −4.90 −1.09 −2.52 −18.33 −9.08 −10.89 −10.72 −25.55Win 1 1 1 1 0 1 1 1 1 1 14Tie 0 0 0 0 1 0 0 0 0 0 2

Lose 0 0 0 0 0 0 0 0 0 0 4

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin

cum

ulat

ive d

istrib

utio

n

Margin distribution graph: the training data of glass data set

AdaBoost.M1

GA

BC

AdaBoost.M2

Min

Max

Med

WeightedAA

AA

SSC

MajV

WeightedMajV

margin =0.5

(a)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin

cum

ulat

ive d

istrib

utio

n

Margin distribution graph: the testing data of glass data set

Min

AdaBoost.M2

GA

BC

Max

MV

AA

WeightedAA

MajV

WeightedMajV

SSC AdaBoost.M1

margin = 0.5

(b)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin

cum

ulat

ive d

istrib

utio

n

Margin distribution graph: the training data of vehicle data set

GA

Max

MV

AA

Min

BC

WeightedAA

AdaBoost.M2

MajVWeighted MajV

SSC

AdaBoost.M1

margin = 0.5

(c)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin

cum

ulat

ive d

istrib

utio

n

Margin distribution graph: the testing data of vehicle data set

AdaBoost.M2

AA

Max

Min

MV

SSCMajV

Weighted MajV

Weighted AA

GA

BC

AdaBoost.M1

margin = 0.5

(d)

Fig. 10. Margin distribution graphs. (a) Glass dataset (training data). (b) Glass dataset (testing data). (c) Vehicle dataset (training data). (d) Vehicle data set(testing data).

VI. CONCLUSION

We proposed a novel classification voting strategy basedon signal strength for ensemble learning systems. Theoreticalanalysis followed by a voting algorithm, namely, SSC, werepresented in detail. Simulation results of this method comparedwith those of nine major voting strategies were used toshow the effectiveness of this method. Furthermore, we alsodiscussed the relationship of the proposed method with respectto margin-based classifiers, including the boosting methods(AdaBoost.M1 and AdaBoost.M2) and SVMs by marginanalysis. The margin distribution graph analysis indicates thatthe proposed method could increase the separation margin insupport of final decision-making processes. Based on various

real-world machine learning datasets, the proposed methodshows competitive results compared with the existing votingstrategies and margin-based classifiers.

There are many interesting future research directions underthis topic. For instance, currently we are conducting theconvergence analysis of the proposed algorithm. Theoreticalanalysis of the computational complexity of the SSC algorithmis also under investigation. In addition, in this paper, weassumed that data distributions were balanced almost equally.However, in many real-world datasets, the data under inves-tigation might include highly skewed distributions, i.e., theimbalanced learning problem [65]. In such biased class distrib-utions, how to adjust the threshold to be an appropriate value toreflect the data imbalanced characteristic will be a challenging


Fig. 11. Decision profile based on neural network outputs.

task. In our previous research, we presented a critical surveyon such imbalanced learning issues, and interested readers canrefer to [65] for details. Furthermore, in terms of performanceassessment, we only adopted the classification accuracy in thispaper. It would be useful to investigate the receiver operatingcharacteristic curves or precision-recall curves to obtain adetailed performance analysis under different threshold values.A detailed discussion about such curve-based assessmentmetrics is also discussed in [65]. Regarding the number ofclassifiers involved, it would be interesting to investigate howthe proposed SSC method might depend on the number ofclassifiers in the ensemble learning system. Furthermore, inour current bagging approach, we only considered the neuralnetwork with MLP as the base classifier. It would be importantto see how other types of base classifiers, such as SVMs,decision trees, and K -nearest neighbors, can contribute tothis type of research. In addition to the classification tasks asdiscussed in this paper, it is also possible to extend the SSCapproach to regression problems with appropriate adjustmentof the algorithm. Finally, the sizes of the datasets considered inthis paper are mild (the largest data is with 59 000 instances).For example, the decision profile matrix might depend on theamount of data. It would also be interesting to study howsuch an approach will perform under extremely large scaledata sizes. In a more general way, it would be interestingto investigate whether and how the amount of used data inthe experiments might affect the results. All of these will notonly introduce the challenge of how to develop an effectiveensemble classifier system, but it will also be critical to seethe scalability and robustness of the proposed approach withdifferent amount of datasets.

APPENDIX A

In this appendix, we provide a discussion regarding thedecision profile for the proposed SSC method.

For the base classifiers that can provide soft-type outputs(continuous values), such as neural network models, one candirectly use a scaled output value to obtain the decision profileinformation to calculate the signal strength and uncertaintydegree values. Fig. 11 illustrates this idea for a neural networkmodel with C output neurons, each represents a class identitylabel. In this case, the decision profile element can be decided

by the normalized output value from each correspondingoutput neuron. In this paper, we use linear normalization toobtain the decision profile as shown in Fig. 11. Interestedreader can refer to [66] for more normalization methods.

We would also like to point out that, for many of thestandard off-the-shelf base learning algorithms, it is generallyquite straightforward to transform from hard-type classifiersto soft-type classifiers [19]. For instance, in [67], the methodsof generating the decision profiles from different types ofclassifiers, including crisp classifiers, fuzzy classifiers, andpossibilistic classifiers, are summarized. Based on the decisionprofiles over the training dataset, one decision template canbe constructed for each class label. Then, a multiclassifierfusion method is developed by calculating the similarity ofthe decision template and the decision profile of the comingtesting samples. One can refer to [67] for further details.

APPENDIX B

In this appendix, we provide the proof of Theroem 1 inSection III-B.

According to the sensitivity definition in [50], we have

ξβs = s · ∂β

β · ∂s= 0.5 + s

0.5 − s(43)

ξ βs = s · ∂β

β · ∂s= s · ∂β

β · ∂β· ∂β

∂s= β · ∂β

β · ∂β· ξβ

s

= α · β · (0.25 − β2)

β· ξβ

s . (44)

In order to find the relationship between ξβs and ξ

βs , we

transform the first term at the right-hand side of (44) and define

K (s) = α · β · (0.25 − β2) − β. (45)

By taking the gradient of K (s) with respect to s [thedependence of the right-hand side of the above equation on scan be found from (34)], we have

d K

ds= α(0.25 − β2)

∂β

∂s− (2α · β · β + 1) · ∂β

∂s

= −2α2 · β · β · (0.25 − β2) · ∂β

∂s

= −2α2 · β · β · (0.25 − β2) · (0.5 + s)

(0.5 − s)3 . (46)


Since s ∈ [0, 0.5], β ∈ [0,+∞), β ∈ [0, 0.5), and α > 0,we have

d K

ds≤ 0 and

d K

ds= 0, if and only if s = 0. (47)

This means that K (s) is a monotonically decreasing func-tion for s ∈ [0, 0.5].

Since K (0) = 0, we have K (s) ≤ K (0) = 0. This means(α · β · (0.25 − β2)/β) ≤ 1. According to (44), we have

ξ βs ≤ ξβ

s and ξ βs = ξβ

s , if and only if s = 0. (48)

APPENDIX C

In this appendix, we provide a proof and discussion on therelations between βYi and βYi when α → 0 in (30).

When α approaches zero, i.e., α → 0, one can see from (29)that βYi will also approach zero in this case, i.e., βYi → 0.Then, (30) can be calculated according to the L’Hospital’s rule.Here we define

Qβ =∑L

k=1 βYi (k) · SYi (k)∑L

k=1 βYi (k) · NYi (k). (49)

Therefore, one can get

limα→0

Qβ = limα→0

ddα (

∑Lk=1 βYi (k) · SYi (k))

ddα (

∑Lk=1 βYi (k) · NYi (k))

= limα→0

∑Lk=1 SYi (k) · d βYi (k)

dα∑L

k=1 NYi (k) · d βYi (k)

dα

. (50)

To calculate (dβYi (k)/dα), we have

dβYi (k)

dα= βYi (k)e−αβYi (k)

(1 + e−αβYi (k))2. (51)

Substituting (51) into (50), we get

limα→0

Qβ = limα→0

∑Lk=1 SYi (k) · βYi (k)e

−αβYi(k)

(1+e−αβYi

(k))2

∑Lk=1 NYi (k) · βYi (k)e

−αβYi(k)

(1+e−αβYi

(k))2

=∑L

k=1 βYi (k) · SYi (k)∑L

k=1 βYi (k) · NYi (k). (52)

Comparing (52) and (49), one can see, in the case, that thisis equivalent to use the original βYi in (30) for the proposedmethod.

REFERENCES

[1] J. Fürnkranz, “Hyperlink ensembles: A case study in hypertext classifi-cation,” Inf. Fusion, vol. 3, no. 4, pp. 299–312, Dec. 2002.

[2] H. He and X. Shen, “Bootstrap methods for foreign currency exchangerates prediction,” in Proc. Int. Conf. Artif. Intell., 2007, pp. 1272–1277.

[3] M. Pal, “Ensemble learning with decision tree for remote sensingclassification,” in Proc. World Acad. Sci., Eng. Technol., vol. 26. 2007,pp. 735–737.

[4] G. Giacinto and F. Roli, “Design of effective neural network ensemblesfor image classification purposes,” Image Vis. Comput., vol. 19, nos.9–10, pp. 699–707, 2001.

[5] K. Tumer, N. Ramanujam, J. Ghosh, and R. Richards-Kortum, “Ensem-bles of radial basis function networks for spectroscopic detection ofcervical precancer,” IEEE Trans. Neural Netw., vol. 45, no. 8, pp. 953–961, Aug. 1998.

[6] W. Y. Goh, C. P. Lim, and K. K. Peh, “Predicting drug dissolutionprofiles with an ensemble of boosted neural networks: A time seriesapproach,” IEEE Trans. Neural Netw., vol. 14, no. 2, pp. 459–463, Mar.2003.

[7] M. D. Muhlbaier, A. Topalis, and R. Polikar, “Learn ++.NC: Combiningensemble of classifiers with dynamically weighted consult-and-vote forefficient incremental learning of new classes,” IEEE Trans. Neural Netw.,vol. 20, no. 1, pp. 152–168, Jan. 2009.

[8] R. Elwell and R. Polikar, “Incremental learning of concept drift innonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no.10, pp. 1517–1531, Oct. 2011.

[9] G. L. Grinblat, L. C. Uzal, H. A. Ceccatto, and P. M. Granitto,“Solving nonstationary classification problems with coupled supportvector machines,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 37–51, Jan. 2011.

[10] M.-T. Pham and T.-J. Cham, “Online learning asymmetric boostedclassifiers for object detection,” in Proc. IEEE Comput. Vis. PatternRecognit., Jun. 2007, pp. 1–8.

[11] C. Kelly, D. Spears, C. Karlsson, and P. Polyakov, “An ensemble ofanomaly classifiers for identifying cyber attacks,” in Proc. Int. SIAMWorkshop Feature Sel. Data Mining, 2006, pp. 1–8.

[12] J. Kittler, M. Hatel, R. P. W. Duin, and J. Matas, “On combiningclassifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp.226–239, Mar. 1998.

[13] P. Melville and R. J. Mooney, “Constructing diverse classifier ensemblesusing artificial training examples,” in Proc. 18th Int. Joint Conf. Artif.Intell., 2003, pp. 505–510.

[14] P. Melville and R. J. Mooney, “Creating diversity in ensembles usingartificial data,” J. Inf. Fusion, vol. 6, no. 1, pp. 99–111, 2004.

[15] P. Melville and R. J. Mooney, “Diverse ensembles for active learning,”in Proc. 21st Int. Conf. Mach. Learn., 2004, pp. 584–591.

[16] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifierensembles,” Mach. Learn., vol. 51, no. 2, pp. 181–207, 2003.

[17] T. Windeatt, “Accuracy/diversity and ensemble MLP classifier design,”IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1194–1211, Sep. 2006.

[18] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp.123–140, 1996.

[19] Y. Freund and R. E. Schapire, “Experiments with a new boostingalgorithm,” in Proc. 13th Int. Conf. Mach. Learn., 1996, pp. 148–156.

[20] Y. Freund and R. E. Schapire, “Decision-theoretic generalization of on-line learning and application to boosting,” J. Comput. Syst. Sci., vol. 55,no. 1, pp. 119–139, 1997.

[21] R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol.5, no. 2, pp. 197–227, 1990.

[22] N. García-Pedrajas, “Constructing ensembles of classifiers by means ofweighted instance selection,” IEEE Trans. Neural Netw., vol. 20, no. 2,pp. 258–277, Feb. 2009.

[23] C. Ji and S. Ma, “Combinations of weak classifiers,” IEEE Trans. NeuralNetw., vol. 8, no. 1, pp. 32–42, Jan. 1997.

[24] V. Gómez-Verdejo, J. Arenas-García, and A. R. Figueiras-Vidal, “Adynamically adjusted mixed emphasis method for building boostingensembles,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 3–17, Jan.2008.

[25] T. K. Ho, “Nearest neighbors in random subspaces,” in Proc. Int.Workshop Stat. Tech. Pattern Recognit., 1998, pp. 640–648.

[26] T. K. Ho, “Random decision forests,” in Proc. 3rd Int. Conf. DocumentAnal. Recognit., Aug. 1995, pp. 278–282.

[27] T. K. Ho, “C4.5 decision forests,” in Proc. 14th Int. Conf. PatternRecognit., vol. 1. Aug. 1998, pp. 545–549.

[28] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,2001.

[29] H. He and X. Shen, “A ranked subspace learning method for geneexpression data classification,” in Proc. Int. Conf. Artif. Intell., 2007,pp. 358–364.

[30] J. J. Rodríguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest:A new classifier ensemble method,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 28, no. 10, pp. 1619–1630, Oct. 2006.

[31] D. H. Wolpert, “Stacked generalization,” Neural Netw., vol. 5, no. 2, pp.241–259, 1992.

[32] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptivemixtures of local experts,” Neural Comput., vol. 3, no. 1, pp. 79–87,1991.

[33] M. J. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts andthe EM algorithm,” Neural Comput., vol. 6, no. 2, pp. 181–214, 1994.

[34] B. Wang and H. D. Chiang, “ELITE: Ensemble of optimal input-prunedneural networks using TRUST-TECH,” IEEE Trans. Neural Netw., vol.22, no. 1, pp. 96–109, Jan. 2011.


[35] P. A. D. Castro and F. J. V. Zuben, “Learning ensembles of neuralnetworks by means of a Bayesian artificial immune system,” IEEE Trans.Neural Netw., vol. 22, no. 2, pp. 304–316, Feb. 2011.

[36] A. Rahman and B. Verma, “Novel layered clustering-based approach forgenerating ensemble of classifiers,” IEEE Trans. Neural Netw., vol. 22,no. 5, pp. 781–792, Apr. 2011.

[37] T. Windeatt and C. Zor, “Minimising added classification error usingWalsh coefficients,” IEEE Trans. Neural Netw., vol. 22, no. 8, pp. 1334–1339, Aug. 2011.

[38] T. Windeatt, R. Duangsoithong, and R. Smith, “Embedded featureranking for ensemble MLP classifiers,” IEEE Trans. Neural Netw., vol.22, no. 6, pp. 988–994, Jun. 2011.

[39] H. Yang, Z. Xu, J. Ye, I. King, and M. R. Lyu, “Efficient sparsegeneralized multiple kernel learning,” IEEE Trans. Neural Netw., vol.22, no. 3, pp. 433–446, Mar. 2011.

[40] S. Theodoridis and K. Koutroumbas, Pattern Recognition, 3rd ed. NewYork: Elsevier, 2006, pp. 181–197.

[41] T. K. Ho, J. Hull, and S. Srihari, “Decision combination in multipleclassifier systems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no.1, pp. 66–75, Jan. 1994.

[42] J. Kittler and F. M. Alkoot, “Sum versus vote fusion in multiple classifiersystems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 1, pp.110–115, Jan. 2003.

[43] K. Tumer and J. Ghosh, “Theoretical foundations of linear and orderstatistics combiners for neural pattern classifiers,” Computer VisionResearch Center, Univ. Texas, Austin, Tech. Rep. TR-95-02-98, 1995.

[44] L. I. Kuncheva, “A theoretical study on six classifier fusion strategies,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 281–286,Feb. 2002.

[45] A. Narasimhamurthy, “Theoretical bounds of majority voting perfor-mance for a binary classification problem,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 12, pp. 1988–1995, Dec. 2005.

[46] O. R. Terrades, E. Valveny, and S. Tabbone, “Optimal classifier fusionin a non-Bayesian probabilistic framework,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 31, no. 9, pp. 1630–1644, Sep. 2009.

[47] X. Artaechevarria, A. M. Barrutia, and C. O. Solórzano, “Combinationstrategies in multi-atlas image segmentation: Applicatoin to brain MRdata,” IEEE Trans. Med. Imag., vol. 28, no. 8, pp. 1266–1277, Aug.2009.

[48] H. Chen and X. Yao, “Regularized negative correlation learning forneural network ensembles,” IEEE Trans. Neural Netw., vol. 20, no. 12,pp. 1962–1979, Dec. 2009.

[49] H. Cevikalp and R. Polikar, “Local classifier weighting by quadraticprogramming,” IEEE Trans. Neural Netw., vol. 19, no. 10, pp. 1832–1838, Oct. 2008.

[50] J. Vlach and K. Singhal, Computer Methods for Circuit Analysis andDesign, 2nd ed. Norwell, MA: Kluwer, 1994, pp. 185–214.

[51] UCI Machine Learning Repository [Online]. Available: http://mlearn.ics.uci.edu/MLRepository.html

[52] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed.Upper Saddle River, NJ: Prentice-Hall, 1999.

[53] D. Opitz and R. Maclin, “Popular ensemble methods: An empiricalstudy,” J. Artif. Intell. Res., vol 11, no. 1, pp. 169–198, 1999.

[54] K. Potter, “Methods for presenting statistical information: The box plot,”in Visualization of Large and Unstructured Data Sets (LNI), vol. S-4,2006, pp. 97–106.

[55] R. Polikar, “Esemble based systems in decision making,” IEEE CircuitsSyst. Mag., vol. 6, no. 3, pp. 21–45, Jul. 2006.

[56] T. G. Dietterich, “An experimental comparison of three methods forconstructing ensembles of decision trees: Bagging, boosting and ran-domization,” Inf. Softw. Technol., vol. 40, no. 2, pp. 139–157, 2000.

[57] L. I. Kuncheva, “Switching between selection and fusion in combin-ing classifiers: An experiment,” IEEE Trans. Syst., Man, Cybern., B,Cybern., vol. 32, no. 2, pp. 146–156, Apr. 2002.

[58] D. Tax, M. van Breukelen, R. P. W. Duin, and J. Kittler, “Combiningmultiple classifiers by averaging or by multiplying?” Pattern Recognit.,vol. 33, no. 9, pp. 1475–1485, 2000.

[59] D. H. Wolpert and W. Macready, “No free lunch theorems for opti-mization,” IEEE Trans. Evol. Comput., vol. 1, no. 1, pp. 67–82, Apr.1997.

[60] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting themargin: A new explanation for the effectiveness of voting methods,”Ann. Stat., vol. 26, no. 5, pp. 1624–1686, 1998.

[61] G. Rätsch, S. Mika, B. Schölkopf, and K. Müller, “Constructing boostingalgorithms from SVMs: An application to one-class classification,” IEEETrans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1184–1199, Sep.2002.

[62] G. Ratsch, B. Scholkopf, S. Mika, and K.-R. Muller, “SVM andboosting: One class,” GMD FIRST, Berlin, Germany, Tech. Rep. 119,Nov. 2000.

[63] E. Romero, X. Carreras, and L. Marquez, “Exploiting diversity ofmargin-based classifiers,” in Proc. IEEE Int. Joint Conf. Neural Netw.,vol. 1. Jul. 2004, pp. 419–424.

[64] D. Mease and A. Wyner, “Evidence contrary to the statistical view ofboosting,” J. Mach. Learn. Res., vol. 9, no. 2, pp. 131–156, 2008.

[65] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.

[66] S. Ribaric and I.Fratric, “A matching-score normalization techniquefor multimodal biometric systems,” in Proc. 3rd COST 275 Workshop:Biometrics Internet, Hatfield, U.K., Oct. 2005, pp. 55–58.

[67] L. I. Kuncheva, J. C. Bezdek, and R. P. W. Duin, “Decision templatesfor multiple classifier fusion: An experimental comparison,” PatternRecognit., vol. 34, no. 2, pp. 299–314, 2001.

Haibo He (SM’11) received the B.S. and M.S.degrees in electrical engineering from the HuazhongUniversity of Science and Technology, Wuhan,China, in 1999 and 2002, respectively, and thePh.D. degree in electrical engineering from OhioUniversity, Athens, OH, in 2006.

He was an Assistant Professor with the Depart-ment of Electrical and Computer Engineering,Stevens Institute of Technology, Hoboken, NJ, from2006 to 2009. He is currently an Assistant Professorwith the Department of Electrical, Computer, and

Biomedical Engineering, University of Rhode Island, Kingston, RI. He haspublished one research book (Wiley), edited six conference proceedings(Springer). He has authored or co-authored over 100 peer-reviewed paperspublished in journals and conference proceedings. His researches have beencovered by national and international media, such as the IEEE Smart GridNewsletter, The Wall Street Journal, and Providence Business News. Hiscurrent research interests include adaptive dynamic programming, machinelearning, computational intelligence, very large scale integration and field-programmable gate array design, and various applications such as smart grid.

Dr. He received the National Science Foundation CAREER Award in 2011and Providence Business News Rising Star Innovator of The Year Award in2011. Currently, he is an Associate Editor of the IEEE TRANSACTIONS ONNEURAL NETWORKS AND LEARNING SYSTEMS and the IEEE TRANSAC-TIONS ON SMART GRID.

Yuan Cao (S’05) received the B.S. and M.S. degreesfrom Zhejiang University, Hangzhou, China, in 2001and 2004, respectively, and the M.S. degree fromOklahoma State University, Stillwater, all in electri-cal engineering, and the Ph.D. degree in computerengineering from the Stevens Institute of Technol-ogy, Hoboken, NJ, in 2011.

He is currently with MathWorks, Inc., Natick, MA.His current research interests include computationalintelligence, pattern recognition, machine learningand data mining, and their applications in parallel

computing and multicore systems.

Documents

1100 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING … · ensemble learning, creates multiple classiﬁers from different ... optimal input-pruned neural networks in [34]. In