10
Toward the scalability of neural networks through feature selection D. Peteiro-Barral , V. Bolón-Canedo, A. Alonso-Betanzos, B. Guijarro-Berdiñas, N. Sánchez-Maroño Laboratory for Research and Development in Artificial Intelligence (LIDIA), Computer Science Dept., University of A Coruña, 15071 A Coruña, Spain article info Keywords: Neural networks Machine learning Feature selection High dimensional datasets abstract In the past few years, the bottleneck for machine learning developers is not longer the limited data avail- able but the algorithms inability to use all the data in the available time. For this reason, researches are now interested not only in the accuracy but also in the scalability of the machine learning algorithms. To deal with large-scale databases, feature selection can be helpful to reduce their dimensionality, turning an impracticable algorithm into a practical one. In this research, the influence of several feature selection methods on the scalability of four of the most well-known training algorithms for feedforward artificial neural networks (ANNs) will be analyzed over both classification and regression tasks. The results dem- onstrate that feature selection is an effective tool to improve scalability. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction Machine learning algorithms are increasingly being applied to data sets of high dimensionality, that is, they present several of the following characteristics: the number of learning samples is very high; or the number of input features is very high; or the number of groups or classes to be classified is very high. Most algo- rithms were developed when data set sizes were much smaller, but nowadays distinct compromises are required for the case of small- scale and large-scale learning problems. Small-scale learning prob- lems are subject to the usual approximation-estimation trade-off. In the case of large-scale learning problems, the trade-off is more complex because it involves not only the accuracy but also the computational complexity of the learning algorithm. Moreover, the problem here is that the majority of algorithms were designed under the assumption that the data set would be represented as a single memory-resident table. So if the entire data set does not fit in main memory, these algorithms are useless. For all these reasons, scaling up learning algorithms is a trend- ing issue. The organization of the workshop ‘‘PASCAL Large Scale Learning Challenge’’ at the 25th International Conference on Ma- chine Learning (ICML’08), and the workshop ‘‘Big Learning’’ at the conference of the Neural Information Processing Systems Founda- tion (NIPS2011) are cases in point. Scaling up is desirable because increasing the size of the training set often increases the accuracy of algorithms (Catlett, 1991). Scalability is defined as the effect that an increase in the size of the training set has on the computational performance of an algorithm: accuracy, training time and allocated memory. Thus the challenge is to find a deal among them or, in other words, getting ‘‘good enough’’ solutions as ‘‘fast’’ as possible and as ‘‘efficiently’’ as possible. This issue becomes critical in situ- ations in which there exist temporal or spatial constraints like: real-time applications dealing with large data sets, unapproachable computational problems requiring learning, or initial prototyping requiring quickly-implemented solutions. Data partitioning is probably one of the most promising lines of research for scaling up learning algorithms, involving breaking the data set up into subsets and learning from one or more of the sub- sets, thus avoiding processing too large data sets in main memory. On the other hand, selecting a subset of features, also known as feature selection, is a straightforward method for reducing prob- lem size that it is often forgotten in discussions of scaling. Feature selection is usually employed to avoid over-fitting (especially with small-size data sets), train more reliable learners or provide more insights into the underlying causal relationships; but as the num- ber of samples is increased (and thereby feature selection becomes less necessary from data-fitting perspective), feature selection be- comes more necessary from both run-time and spatial-complexity perspectives. The process of selecting a subset of features from the data set and ignoring irrelevant features for training, could be an effective way to improve the performance and decrease the train- ing time and memory requirements of a learning algorithm, thus, scaling it up. In this paper, we study the influence of feature selection meth- ods on the scalability of ANN training algorithms by using the mea- sures defined during the PASCAL workshop (Sonnenburg, Franc, Yom-Tov, & Sebag, 2008). These measures evaluate the scalability of algorithms in terms of error, computational effort, allocated memory and training time. Previous results showed (Peteiro- Barral, Guijarro-Berdinas, Pérez-Sánchez, & Fontenla-Romero, 2011) that popular algorithms for ANNs are unable to deal with very large data sets. So preprocessing methods may be desirable 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.11.016 Corresponding author. E-mail address: [email protected] (D. Peteiro-Barral). Expert Systems with Applications 40 (2013) 2807–2816 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Toward the scalability of neural networks through feature selection

  • Upload
    n

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward the scalability of neural networks through feature selection

Expert Systems with Applications 40 (2013) 2807–2816

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Toward the scalability of neural networks through feature selection

D. Peteiro-Barral ⇑, V. Bolón-Canedo, A. Alonso-Betanzos, B. Guijarro-Berdiñas, N. Sánchez-MaroñoLaboratory for Research and Development in Artificial Intelligence (LIDIA), Computer Science Dept., University of A Coruña, 15071 A Coruña, Spain

a r t i c l e i n f o a b s t r a c t

Keywords:Neural networksMachine learningFeature selectionHigh dimensional datasets

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.eswa.2012.11.016

⇑ Corresponding author.E-mail address: [email protected] (D. Peteiro-Barral

In the past few years, the bottleneck for machine learning developers is not longer the limited data avail-able but the algorithms inability to use all the data in the available time. For this reason, researches arenow interested not only in the accuracy but also in the scalability of the machine learning algorithms. Todeal with large-scale databases, feature selection can be helpful to reduce their dimensionality, turningan impracticable algorithm into a practical one. In this research, the influence of several feature selectionmethods on the scalability of four of the most well-known training algorithms for feedforward artificialneural networks (ANNs) will be analyzed over both classification and regression tasks. The results dem-onstrate that feature selection is an effective tool to improve scalability.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Machine learning algorithms are increasingly being applied todata sets of high dimensionality, that is, they present several ofthe following characteristics: the number of learning samples isvery high; or the number of input features is very high; or thenumber of groups or classes to be classified is very high. Most algo-rithms were developed when data set sizes were much smaller, butnowadays distinct compromises are required for the case of small-scale and large-scale learning problems. Small-scale learning prob-lems are subject to the usual approximation-estimation trade-off.In the case of large-scale learning problems, the trade-off is morecomplex because it involves not only the accuracy but also thecomputational complexity of the learning algorithm. Moreover,the problem here is that the majority of algorithms were designedunder the assumption that the data set would be represented as asingle memory-resident table. So if the entire data set does not fitin main memory, these algorithms are useless.

For all these reasons, scaling up learning algorithms is a trend-ing issue. The organization of the workshop ‘‘PASCAL Large ScaleLearning Challenge’’ at the 25th International Conference on Ma-chine Learning (ICML’08), and the workshop ‘‘Big Learning’’ at theconference of the Neural Information Processing Systems Founda-tion (NIPS2011) are cases in point. Scaling up is desirable becauseincreasing the size of the training set often increases the accuracyof algorithms (Catlett, 1991). Scalability is defined as the effect thatan increase in the size of the training set has on the computationalperformance of an algorithm: accuracy, training time and allocatedmemory. Thus the challenge is to find a deal among them or, in

ll rights reserved.

).

other words, getting ‘‘good enough’’ solutions as ‘‘fast’’ as possibleand as ‘‘efficiently’’ as possible. This issue becomes critical in situ-ations in which there exist temporal or spatial constraints like:real-time applications dealing with large data sets, unapproachablecomputational problems requiring learning, or initial prototypingrequiring quickly-implemented solutions.

Data partitioning is probably one of the most promising lines ofresearch for scaling up learning algorithms, involving breaking thedata set up into subsets and learning from one or more of the sub-sets, thus avoiding processing too large data sets in main memory.On the other hand, selecting a subset of features, also known asfeature selection, is a straightforward method for reducing prob-lem size that it is often forgotten in discussions of scaling. Featureselection is usually employed to avoid over-fitting (especially withsmall-size data sets), train more reliable learners or provide moreinsights into the underlying causal relationships; but as the num-ber of samples is increased (and thereby feature selection becomesless necessary from data-fitting perspective), feature selection be-comes more necessary from both run-time and spatial-complexityperspectives. The process of selecting a subset of features from thedata set and ignoring irrelevant features for training, could be aneffective way to improve the performance and decrease the train-ing time and memory requirements of a learning algorithm, thus,scaling it up.

In this paper, we study the influence of feature selection meth-ods on the scalability of ANN training algorithms by using the mea-sures defined during the PASCAL workshop (Sonnenburg, Franc,Yom-Tov, & Sebag, 2008). These measures evaluate the scalabilityof algorithms in terms of error, computational effort, allocatedmemory and training time. Previous results showed (Peteiro-Barral, Guijarro-Berdinas, Pérez-Sánchez, & Fontenla-Romero,2011) that popular algorithms for ANNs are unable to deal withvery large data sets. So preprocessing methods may be desirable

Page 2: Toward the scalability of neural networks through feature selection

2808 D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816

for reducing the input space size and improving scalability. Weplan to demonstrate that feature selection methods are an appro-priate approach to improve scalability. By reducing the numberof input features and, consequently, the dimensionality of the dataset, we expect to reduce the computational time while maintainingthe performance on the other measures mentioned above, as wellas being able to apply certain algorithms which could not deal withlarge data sets. There are three main models that deal with featureselection: filters, wrappers and embedded methods (Guyon, 2006).Although wrappers and embedded methods tend to obtain betterperformances, they are very time consuming and they will beintractable to deal with high dimensional data sets without com-promising the time and memory requirements of machine learningalgorithms, therefore this work will be focused on the filterapproach.

The rest of the paper is structured as follows: Section 2 presentsa background, Section 3 describes the feature selection methodsemployed, Section 4 introduces the experimental study and Sec-tions 5 and 6 show the results and discussion, respectively. Finally,some conclusions are included in Section 7.

2. Background

The appearance of very large data sets is not sufficient to moti-vate scaling efforts. The most commonly cited reason for scaling upalgorithms is based on (typically) increasing the accuracy of algo-rithms when increasing the size of the training data set (Catlett,1991). In fact, learning from small data sets frequently decreasesthe accuracy of algorithms as a result of over-fitting.

For most scaling problems the limiting factor has been the num-ber of samples and features describing each sample. The growthrate of the training time of an algorithm as the data set size in-creases is an outstanding question that appears. But temporal com-plexity does not reflect scaling in its entirety, and must be used inconjunction with other metrics. For scaling up learning algorithmsthe issue is not so much as one of speeding up a slow algorithm asone of tuning an impracticable algorithm into a practical one. Thecrucial point in question is seldom how fast you can run on a cer-tain problem but rather how large a problem can you deal with(Provost & Kolluri, 1999). More precisely, space considerationsare critical to scale up learning algorithms. The absolute size ofthe main memory plays a key role in this matter. Almost all exist-ing implementations of learning algorithms operate with the train-ing set entirely in main memory. If the spatial complexity of thealgorithm exceeds the main memory then the algorithm will notscale well – regardless of its computational complexity – becausepage thrashing renders algorithms useless. Page thrashing is theconsequence of many accesses to disk occurring in a short time,cutting drastically the performance of a system using virtual mem-ory. Virtual memory is a technique for making a machine behave asif it had more memory than it really has, by using disk space tosimulate RAM. But accessing to disk is much slower than accessingto RAM. In the worst case scenario, out of memory exceptions willmake algorithms unfeasible in practice.

It is a fact that most existing learning algorithms were devel-oped to handle medium-size data sets. A promising approach toscale up algorithms is to circumvent the need to run algorithmson very large data sets by partitioning the data. The data partition-ing approach involves breaking the data set up into subsets, learn-ing from one or more of the subsets, and possibly combining theresults. On the one hand, data partitioning methods can be catego-rized based on whether they separate subsets of instances or sub-sets of features (Provost & Kolluri, 1999). On the other hand, datapartitioning methods can be categorized based on whether theylearn from single subsets or multiple subsets. Furthermore,

multiple subsets can be processed in sequence or concurrently.The most used methods in practice in data partition are as follows,

� Instance sampling (single subset separated by instances).� Feature selection (single subset separated by features).� Incremental learning (multiple subsets processed in sequence).� Distributed learning (multiple subsets processed concurrently).

2.1. Instance sampling

A common method to deal with large data sets is to select a sin-gle, smaller subset of samples. Instance sampling methods differ onthe particular selection procedure used. The main methods for in-stance sampling involve simple random sampling, duplicate com-paction that removes duplicated instances from a data set, andstratified sampling by selecting instances of the minority classeswith a greater frequency in order to even out the distribution(Catlett, 1991). The most important but very difficult task is to deter-mine the appropriate sample size to maintain an acceptable accu-racy and some works about adaptive sampling have been carriedout (Domingo, Gavaldà, & Watanabe, 2002; Weiss & Provost, 2001).

2.2. Feature selection

An orthogonal approach to instance selection is feature selec-tion. Feature selection selects a single, smaller subset of features.Up to the authors’ knowledge, the majority of the literature on fea-ture selection has not focused on scalability directly but onincreasing the accuracy of learning algorithms when reducing thenumber of features in a proper manner. The three main methodsfor feature selection are as follows (Guyon, 2006),

� Filters rely on the general characteristics of the training data,carrying out the feature selection process as a pre-processingstep with independence of the learning algorithm.� Wrappers involve a learning algorithm as a black box and con-

sist of using its prediction performance to assess the relativeusefulness of subsets of variables. In other words, the featureselection algorithm uses the learning algorithm as a subroutinewith the computational burden that comes from calling thelearning algorithm to evaluate each subset of features.� Embedded methods perform feature selection in the process of

training of the classifier and are usually specific to given learningmachines. Therefore, the search for an optimal subset of featuresis built into the classifier construction, and can be seen as asearch in the combined space of feature subsets and hypotheses.

Feature selection has proven to be effective in many diversefields, such as DNA microarray analysis (Yu & Liu, 2004), intrusiondetection (Bolón-Canedo, Sánchez-Maroño, & Alonso-Betanzos,2010; Lee, Stolfo, & Mok, 2000), text categorization (Forman,2003; Gomez, Boiy, & Moens, 2011) or information retrieval (Egozi,Gabrilovich, & Markovitch, 2008), including image retrieval (Dy,Brodley, Kak, Broderick, & Aisen, 2003) or music information retrie-val (Saari, Eerola, & Lartillot, 2011).

2.3. Incremental learning

Incremental learning methods learn from multiple subsets ofdata processed in sequence. Incremental learning is a type of learn-ing where the learner updates its model whenever new data be-come available. In other words, when multiple subsets are beingprocessed in sequence it is possible to take advantage of theknowledge learned in previous steps to guide learning in the nextstep. Incremental learning has been used to scale up to data setsthat are too large for batch learning because of limits of main

Page 3: Toward the scalability of neural networks through feature selection

D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816 2809

memory. Some examples of incremental algorithms are in the lit-erature (Parikh & Polikar, 2007; Pérez-Sánchez, Fontenla-Romero,& Guijarro-Berdiñas, 2010; Polikar, Upda, Upda, & Honavar, 2001;Ruping, 2001).

2.4. Distributed learning

Distributed learning methods learn from multiple subsets ofdata processed concurrently. In order to increase efficiency, learn-ing can be parallelized by distributing the subsets of data to multi-ple processors, learning in parallel and then combining them.Distributed precludes incremental learning where a prior knowl-edge is needed as input to a subsequent step. Distributed learninghas been used to scale up data sets that are too large for batchlearning where incremental learning is too slow. Some examplesof distributed algorithms are in the literature (Ananthanarayana,Subramanian, & Murty, 2000; Chan & Stolfo, 1993; Peteiro-Barral,Guijarro-Berdinas, Pérez-Sánchez, & Fontenla-Romero, 2011;Tsoumakas & Vlahavas, 2002). All of these algorithms process in-stances in a distributed manner. While not common, there aresome other developments that process features (Kargupta,Park, Sanseverino, Silvestre, & Hershberger, 1998; McConnell &Skillicorn, 2004; Skillicorn & McConnell, 2008).

Instance sampling offers speedups because learning from fewerinstances is faster. But Carlett’s work (Catlett, 1991) showed that,in general, learning from a smaller subset of data decreasesaccuracy and consequently will be discarded in this research. Inprevious works, some research was done in both incremental(Pérez-Sánchez et al., 2010) and distributed learning (Peteiro-Barralet al., 2011). However, the impact of feature selection in thescalability of learning algorithms has not been explored in depthyet (Bolón-Canedo, Peteiro-Barral, Alonso-Betanzos, Guijarro-Berdinas, & Sánchez-Marono, 2011).

3. Filter methods for feature selection

Feature selection consists on detecting the relevant features anddiscarding the irrelevant ones. It has several advantages (Guyon,2006), such as:

� Improving the performance of the machine learning algorithms.� Data understanding, gaining knowledge about the process and

perhaps helping to visualize it.� Data reduction, limiting storage requirements and perhaps

helping in reducing costs.� Simplicity, possibility of using simpler models and gaining

speed.

The benefits provided by feature selection may help in reducingthe computational effort, allocated memory and training time,measures that will be considered to study the scalability of ma-chine learning algorithms.

As was stated in the Introduction, this work will be focused onthe filter model, due to the high computational demand of wrap-pers and embedded methods. Filters rely on the general character-istics of the training data in order to select features withindependence of any predictor and are usually computationallyless expensive than wrappers and embedded methods. Further-more, filters have the ability to scale to large data sets and resultin a better generalization because they act independently of theinduction algorithm. There exist two major approaches for filter-ing: individual evaluation and subset evaluation (Yu & Liu, 2004).On the one hand, individual evaluation, also known as featureranking, assesses individual features by assigning them weightsaccording to their degrees of relevance. On the other hand, subset

evaluation produces candidate feature subsets based on a certainsearch strategy. Each candidate subset is evaluated by a certainevaluation measure and compared with the previous best one withrespect to this measure. While the individual evaluation is not ableto remove redundant features because redundant features likelyhave similar rankings, the subset evaluation approach can handlefeature redundancy with feature relevance. However, methods inthis framework can suffer from the inevitable problem caused bysearching through feature subsets required in the subset genera-tion step, and thus, both approaches will be studied in this re-search. In what follows, the four filters used in this work will bedescribed. The three first follow the subset evaluation frameworkwhilst the last one belongs to the individual evaluation approach.

3.1. Correlation-based Feature Selection (CFS)

Correlation based Feature Selection (CFS) is a simple filter algo-rithm suitable for both classification and regression tasks. It ranksfeature subsets according to a correlation based heuristic evalua-tion function (Hall, 1999). The bias of the evaluation function is to-ward subsets that contain features that are highly correlated withthe output to be predicted and uncorrelated with each other. Irrel-evant features should be ignored because they will have low corre-lation with the class. Redundant features should be screened out asthey will be highly correlated with one or more of the remainingfeatures. The acceptance of a feature will depend on the extentto which it predicts classes in areas of the instance space not al-ready predicted by other features. CFS’s feature subset evaluationfunction is:

MS ¼krcfffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

kþ kðk� 1Þrff

p ;

where MS is the heuristic ‘merit’ of a feature subset S containing kfeatures, rcf is the mean feature-class correlation (f 2 S) and rff isthe average feature-feature intercorrelation. The numerator of thisequation can be thought of as providing an indication of how pre-dictive of the class a set of features is; and the denominator ofhow much redundancy there is among the features.

3.2. Consistency-based filter

The consistency-based filter (Dash & Liu, 2003) only supportsclassification and evaluates the worth of a subset of features bythe level of consistency in the values of the class to be predictedwhen the training instances are projected onto the subset of attri-butes. The algorithm generates a random subset S from the num-ber of features in every round. If the number of features of S isless than the current best, the data with the features prescribedin S is checked against the inconsistency criterion. If its inconsis-tency rate is below a pre-specified one, S becomes the new currentbest.

3.3. INTERACT

The INTERACT algorithm (Zhao & Liu, 2007) is a subset filtersupported for classification tasks based on symmetrical uncer-tainty (SU) and the consistency contribution (c-contribution). C-contribution of a feature is an indicator about how significantlythe elimination of that feature will affect consistency. The algo-rithm consists of two major parts. In the first part, the featuresare ranked in descending order based on their SU values. In the sec-ond part, features are evaluated one by one starting from the end ofthe ranked feature list. If c-contribution of a feature is less than anestablished threshold, the feature is removed, otherwise it is se-lected. The authors stated in Zhao and Liu (2007) that INTERACT

Page 4: Toward the scalability of neural networks through feature selection

Table 1Characteristics of each dataset.

Dataset Features Classes Training Test Task

Connect-4 42 3 60,000 7557 ClassificationKDD Cup 99 42 2 494,021 311,029 Classification

Covertype 54 2/– 100,000 50,620 Classification/regression

MNIST 748 2/– 60,000 10,000 Classification/regression

Friedman 10 – 1,000,000 100,000 RegressionLorenz 8 – 1,000,000 100,000 Regression

2810 D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816

can thus handle feature interaction, and efficiently selects relevantfeatures.

3.4. ReliefF

ReliefF Kononenko (1994) is an extension of the original Reliefalgorithm (Kira & Rendell, 1992) which supports both regressionand classification tasks. The original Relief works by randomlysampling an instance from the data and then locating its nearestneighbor from the same and opposite output to be predicted. Thevalues of the attributes of the nearest neighbors are compared tothe sampled instance and used to update relevance scores for eachattribute. The rationale is that a useful attribute should differenti-ate between instances from different classes and have the same va-lue for instances from the same output. ReliefF adds the ability ofdealing with multiclass problems and is also more robust andcapable of dealing with incomplete and noisy data. Since this filterfollows the individual evaluation approach, a threshold is requiredin order to select the attributes that have a relevance score higherthan it.

4. Experimental study

In order to check the effect of different feature selection meth-ods on the scalability of machine learning algorithms, four of themost popular training algorithms for ANNs were selected. Two ofthese algorithms are gradient descent (GD) (Bishop, 2006) and gra-dient descent with momentum and adaptive learning rate (GDX)(Bishop, 2006), whose complexity is O(n). The other algorithmsare scaled conjugated gradient (SCG) (Moller, 1993) and Leven-berg–Marquardt (LM) (More, 1978), whose complexities are O(n2)and O(n3), respectively.

4.1. Data sets

Classification and regression are two of the most common tasksin machine learning. Table 1 depicts the datasets1 used in this pa-per along with a brief description of them (number of features, clas-ses, training examples and test examples).

Covertype and MNIST datasets, which are originally classifica-tion taks, were also transformed into a regression task by predict-ing �1 for samples of class 1; and +1 for samples of class 2(Collobert & Bengio, 2008). Friedman and Lorenz are artificial data-sets. Friedman is defined by the equation y = 10sin(px1 x2) + 20(x3

� 0.5)2 + 10x4 + 5x5 + r(0,1) where the input attributes x1, . . . ,x10

are generated independently, each of which is uniformly distrib-uted over interval [0,1]. Variables x6 � x10 are randomly generated.On the other hand, Lorenz is defined by the simultaneous solutionof three equations dX

dt ¼ dY � dX; dYdt ¼ �XZ þ rX � Y ; dZ

dt ¼ XY � bZ,

1 Connect-4 and Covertype datasets are available on <http://archive.ics.uci.edu/ml/datasets.html>; KDD Cup 99 on <http://kdd.ics.uci.edu/databases/kddcup99/kdd-cup99.html>; and MNIST on <http://yann.lecun.com/exdb/mnist/>.

where the systems exhibits chaotic behavior for d = 10, r = 28 andb ¼ 8

3. The goal of the network is to predict the current samplebased on the four previous samples.

4.2. Performance measures

In order to assess the performance of learning algorithms, com-mon measures as accuracy are insufficient since they do not takeinto account all aspects involved when dealing with large datasets.Accordingly, the goal for machine learning developers is to find alearning algorithm such that it achieves a low error in the shortestpossible time using as few samples as possible. Since there are nostandard measures of scalability, those defined in the PASCAL LargeScale Learning Challenge (Sonnenburg et al., 2008) are used:

� Fig. 1(a) shows the relationship between training time and testerror, computed on the largest dataset size the algorithm is ableto deal with, aimed at answering the question ‘‘which test errorcan we expect given limited training time resources?’’. Followingthe PASCAL Challenge, the different training time budgets areset to 10[. . .,�1,0,1,2,. . .] seconds. We compute the following scalarmeasures based on this figure:– Err: minimum test error (standard class error for classifica-

tion and mean squared error for regression (Weiss &Kulikowski, 1991)).

– AuTE: area under training time vs test error curve (grayarea).

– Te5%: the time t for which the test error e falls below athreshold e�Err

e < 0:05.� Fig. 1(b) shows the relationship between different training set

sizes and the test error of each one aimed at answering the ques-tion ‘‘which test error can be expected given limited training dataresources?’’. Following the PASCAL Challenge, the differenttraining set sizes (training samples) are set to 10[2,3, 4,. . .] up tothe maximum size of the dataset. We compute the followingscalar measures based on this figure:– AuSE: area under training set size vs test error curve (gray

area).– Se5%: the size s for which the test error e falls below a

threshold e�Erre < 0:05

� Fig. 1(c) shows the relationship between different training setsizes and the training time for each one aimed at answeringthe question ‘‘which training time can be expected given limitedtraining data resources?’’. Again, the different training set sizesare set to 10[2,3,4,. . .] and the maximum size of the dataset. Wecompute the following scalar measure based on this figure:– Eff: slope b of the curve using a least squares fit to axb.

In order to establish a general measure of scalability, the finalScore of the algorithms is calculated as the average rank of its con-tribution with regard to the six scalar measures defined above.

4.3. Experimental procedure

As a preprocessing step, several feature selection methods wereapplied over the training set to obtain a subset of features. CFS,INTERACT and Consistency-based were used for classificationwhilst CFS and ReliefF were employed for regression (see Sec-tion 3). The latter follows the individual evaluation framework,and therefore a threshold is required. In this work, we have optedfor an aggressive (lower number of features to retain) and a softreduction (higher number of features to retain).

After the preprocessing step, different simulations (N = 10) werecarried out over the training set for accurately estimating the sca-lability of algorithms on each dataset, as showed in the followingprocedure:

Page 5: Toward the scalability of neural networks through feature selection

(a) Training time vs Test error. (b) Training set size vs Test error.

(c) Training set size vs Training time.

Fig. 1. Performance measures.

Table 2Features selected by each feature selection method along with the time required forthis task.

Data Filter Features Time (s) hh:mm:ss

Connect4 None 42 0.00 00:00:00.00CFS 6 8.00 00:00:08.00Consistency 40 2760.73 00:46:00.73INTERACT 38 173.32 00:02:53.33

Forest None 54 0.00 00:00:00.00CFS 13 17.28 00:00:17.28Consistency 31 7702.12 02:08:22.12INTERACT 30 203.57 00:03:23.58

KDD Cup 99 None 42 0.00 00:00:00.00CFS 5 61.94 00:01:01.95Consistency 7 382.01 00:06:22.01INTERACT 7 106.71 00:01:46.72

MNIST None 748 0.00 00:00:00.00CFS 55 805.27 00:13:25.27Consistency 18 13775.09 03:49:35.10INTERACT 36 652.96 00:10:52.96

D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816 2811

1. Select features over the training set using the methods pre-sented in Section 3.

2. Set the number of hidden units of the ANN to 2 � num-ber_of_inputs + 1 (Hecht-Nielsen, 1990) and train the network.It is important to remark that the aim here is not to investigatethe optimal topology of an ANN for a given dataset, but to checkthe scalability of learning algorithms on large networks.

3. Compute the score of algorithms as the average rank of theircontribution with regard to the six scalar measures defined inSection 4.2.

4. Apply a Kruskal–Wallis test to check if there are significant dif-ferences among the medians for each algorithm with and with-out FS for a level of significance a = 0.05.

5. If there are differences among the medians, then apply a multi-ple comparison procedure (Tukey’s) to find the simplestapproach whose score is not significantly different from theapproach with the best score.

5. Results

5.1. Classification

During the preprocessing step, the filters CFS, Consistency-based and INTERACT were applied over the training set in orderto obtain a subset of features which will be employed in the clas-sification stage. The number of features selected by each method,along with the time required for this task, is depicted in Table 2.It has to be noted that CFS is the filter which achieves the greatestreduction in the number of features in the minimum time in 3 outof the 4 dataset tested. On the other hand, Consistency-based is thefilter which requires more time to perform the selection (see thatfor Forest dataset, Consistency-based takes 2 h to perform this taskwhile CFS only needs 17 s).

Table 3 presents the average test results obtained by the ma-chine learning algorithms after applying the three different filterscompared with those where no feature selection was performed.

Regarding the performance measures defined in Section 4.2, noticethat the lower the result, the higher the scalability.

It has to be noted that not all the learning algorithms were ableto deal with all available samples for every dataset, mostly due tothe spatial complexity of the algorithms. In particular, on theMNIST dataset the LM algorithm is not able to train even on thesmallest subset when no feature selection is applied. If this occurs,the measures explained in Section 4.2 were computed on the larg-est dataset that the learning algorithms were able to process andthis fact was specified along with the results.

5.2. Regression

The filters CFS and ReliefF were selected for the regression task.It has to be reminded that ReliefF provides a ranking of features

Page 6: Toward the scalability of neural networks through feature selection

Table 3Performance measures for classification tasks. N/A stands for Not Applicable. Those algorithms whose Score average test results are not significantly worse than the best are labeledwith a cross (�).

Method Filter Score Err AuTE AuSE Te5% Se5% Eff

(a) Connect-4GD None 8.67 0.38 5.16e1 0.97 1.08e2 1.00e2 0.43

CFS 5.50� 0.51 1.01e1 1.24 1.39e1 1.00e2 0.26Consistency 7.83 0.40 4.32e1 0.94 8.44e1 1.00e2 0.40INTERACT 6.67� 0.35 3.37e1 0.90 8.14e1 1.00e2 0.40

GDX None 7.00 0.31 3.71e1 0.92 7.98e1 6.00e4 0.40CFS 4.00� 0.32 9.44e0 0.89 1.70e1 1.00e4 0.25Consistency 4.83� 0.31 2.61e1 0.87 5.71e1 1.00e3 0.37INTERACT 4.83� 0.28 2.71e1 0.80 6.96e1 1.00e4 0.38

LM Nonea 8.83� 0.23 3.79e2 0.77 7.80e2 1.00e4 0.77CFS 6.67� 0.31 4.79e1 0.87 6.68e1 1.00e4 0.44Consistency 9.33� 0.27 2.31e2 0.85 5.01e2 1.00e4 0.71INTERACT 8.17� 0.24 1.60e2 0.79 3.50e2 1.00e4 0.68

SCG None 7.17 0.21 7.01e1 0.77 2.62e2 1.00e4 0.50CFS 3.83� 0.29 9.97e0 0.82 2.28e1 1.00e4 0.31Consistency 6.83 0.23 5.34e1 0.72 1.44e2 6.00e4 0.47INTERACT 6.17� 0.23 4.95e1 0.70 1.41e2 6.00e4 0.47

(b) ForestGD None 9.00 0.38 1.24e2 1.20 2.78e2 1.00e3 0.49

CFS 5.67� 0.45 2.74e1 1.36 3.34e1 1.00e2 0.35Consistency 7.67� 0.41 5.67e1 1.30 1.12e2 1.00e3 0.42INTERACT 7.00� 0.38 4.97e1 1.28 1.08e2 1.00e5 0.41

GDX None 7.33 0.42 4.74e1 1.32 1.01e2 1.00e4 0.41CFS 5.33� 0.51 6.81e0 1.41 0.43e0 1.00e2 0.24Consistency 5.67� 0.38 3.21e1 1.23 7.20e1 1.00e5 0.37INTERACT 4.33� 0.40 2.26e1 1.11 4.93e1 1.00e3 0.35

LM Nonea 9.33� 0.24 6.41e2 0.94 1.74e3 1.00e4 0.84CFS 9.17� 0.32 2.99e2 0.95 5.15e2 1.00e3 0.58Consistency 8.17� 0.26 6.71e1 0.96 1.72e2 1.00e4 0.59INTERACT 6.67� 0.25 5.72e1 0.93 1.55e2 1.00e4 0.58

SCG None 7.50 0.20 1.64e2 0.81 5.80e2 1.00e5 0.55CFS 4.00� 0.29 3.51e1 0.86 5.97e1 1.00e3 0.40Consistency 6.50� 0.23 7.21e1 0.84 2.48e2 1.00e4 0.48INTERACT 5.67� 0.20 6.21e1 0.84 1.70e2 1.00e5 0.47

(c) KDD Cup 99GD Noneb 7.00 0.13 4.29e1 0.43 5.53e1 1.00e2 0.50

CFS 4.67� 0.16 8.67e0 0.54 5.41e0 1.00e2 0.34Consistency 6.50 0.20 8.80e0 0.70 2.49e1 1.00e3 0.32INTERACT 3.33� 0.12 6.55e0 0.45 2.83e1 1.00e2 0.33

GDX Noneb 7.00 0.15 2.55e1 0.46 5.93e1 1.00e3 0.44CFS 1.83� 0.11 4.61e0 0.37 2.15e1 1.00e3 0.30Consistency 5.33 0.19 7.06e0 0.70 5.65e0 1.00e2 0.31INTERACT 3.50� 0.11 5.68e0 0.48 3.85e1 1.00e3 0.32

LM Nonea 9.17� 0.11 2.21e2 0.46 1.24e3 1.00e4 0.80CFS 8.00� 0.12 3.38e1 0.49 1.47e2 1.00e2 0.46Consistency 9.33� 0.17 3.11e1 0.63 1.10e2 1.00e4 0.42INTERACT 8.00� 0.12 2.89e1 0.55 6.32e1 1.00e5 0.44

SCG Noneb 9.67 0.14 1.10e2 0.51 3.54e2 1.00e4 0.55CFS 4.17� 0.08 1.13e1 0.31 4.40e1 1.00e4 0.38Consistency 7.83 0.18 2.06e1 0.70 4.88e1 1.00e3 0.39INTERACT 8.00 0.17 2.15e1 0.56 8.41e1 1.00e2 0.39

(d) MNISTGD Nonea 9.17 0.36 1.41e2 0.85 2.26e2 1.00e2 0.65

CFS 6.67 0.26 4.04e1 0.69 1.07e2 1.00e3 0.43Consistency 5.00� 0.37 1.72e1 1.07 3.99e1 1.00e3 0.33INTERACT 5.50� 0.31 2.92e1 0.79 7.46e1 1.00e3 0.39

GDX Nonea 9.00 0.22 2.30e2 0.66 6.91e2 1.00e3 0.72CFS 5.50 0.21 3.32e1 0.66 9.98e1 1.00e3 0.42Consistency 3.50� 0.22 1.50e1 0.68 3.98e1 1.00e4 0.33INTERACT 3.83� 0.21 2.33e1 0.64 6.80e1 1.00e3 0.38

LM None N/A N/A N/A N/A N/A N/A N/ACFS 9.17� 0.13 3.22e2 0.56 1.10e3 1.00e4 0.80Consistency 6.83� 0.10 8.52e1 0.49 3.38e2 6.00e4 0.55INTERACT 6.33� 0.13 6.45e1 0.48 2.05e2 1.00e4 0.62

2812 D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816

Page 7: Toward the scalability of neural networks through feature selection

Table 3 (continued)

Method Filter Score Err AuTE AuSE Te5% Se5% Eff

SCG Nonea 8.00 0.05 2.85e2 0.40 1.62e3 1.00e4 0.81CFS 6.33� 0.11 4.77e1 0.49 2.42e2 6.00e4 0.50Consistency 3.83� 0.11 1.73e1 0.51 8.62e1 6.00e4 0.40INTERACT 5.50� 0.11 3.14e1 0.54 1.61e2 6.00e4 0.46

a Largest training set it can deal with: 1e4 samples.b Largest training set it can deal with: 1e5 samples.

Table 4Features selected by each feature selection method along with the time required forthis task. ReliefF_ and ReliefF^ stand for aggressive and soft reduction, respectively.

Data Filter Features Time (s) hh:mm:ss

Forest None 54 0.00 00:00:00.00CFS 23 21.76 00:00:21.76ReliefF_ 5 11631.56 03:13:51.57ReliefF^ 48 11631.56 03:13:51.57

Friedman None 10 0.00 00:00:00.00CFS 5 17.39 00:00:17.39ReliefF_ 4 289191.43 80:19:51.44ReliefF^ 5 289191.43 80:19:51.44

Lorenz None 8 0.00 00:00:00.00CFS 1 11.83 00:00:11.83ReliefF_ 4 262575.40 72:56:15.40ReliefF^ 6 262575.40 72:56:15.40

MNIST None 748 0.00 00:00:00.00CFS 103 938.73 00:15:38.73ReliefF_ 31 57048.71 15:50:48.72ReliefF^ 418 57048.71 15:50:48.72

D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816 2813

and a threshold is required. In this work, we have opted for two dif-ferent arrangements: an aggressive and a soft reduction, repre-sented in the tables as ReliefF_ and ReliefF^, respectively.

As for classification, Table 4 shows the number of features se-lected by each method along with the time required. Note thatthe time required by ReliefF is the same for the two arrangements,since the construction of the ranking is a mutual step. Again, CFS isable to perform the selection mostly in the order of seconds whilstReliefF needs in the order of hours.

Table 5 depicts the average test results achieved by the machinelearning algorithms after applying CFS and ReliefF filters comparedwith those where no feature selection was performed. As well as inthe classification case, when the learning algorithm is not able totrain on all available samples, this fact is specified along with theresults. LM was again not able to train on MNIST dataset (seeTable 5(d)). Even when the number of weights of an ANN is lowerin a regression task than in a classification task (as the number ofoutputs is also lower), the spatial complexity of the algorithm LMis still very high.

6. Discussion

The aim of the experiments carried out in this work is to assessthe performance of ANN algorithms in terms of scalability and notsimply in terms of error like the great majority of papers in the lit-erature. All the six scalar measures defined in Section 4.2 are con-sidered to evaluate the scalability of learning algorithms, trying toachieve a balance among them. When applying feature selection, itis expected that some measures are positively affected by thedimensionality reduction (such as AuTe, Te5% or Eff) because thealgorithms can deal with a larger number of samples employingthe same execution time. Nevertheless, the goal of this researchis to demonstrate that this reduction does not always affect nega-

tively the remaining measures and also to find a trade-off amongall the measures which will be reflected on the final average Score.

6.1. Classification

In general lines, results without FS show a scarcely lower error(although in some cases it is maintained or even improved,depending on the filter) at the expense of a longer training time.On the other hand, the results after applying feature selection pres-ent a shorter training time.

As expected, the measures related to the training time (AuTe,Te5% and Eff) improve, since a shorter time is needed to train thesame number of data. On the other hand, AuSE and Se5% deterioratetheir results after applying feature selection. Although the errorwas expected to be higher after applying feature selection, INTER-ACT maintains or improves the classification error in most of thecases, obtaining also a good performance on the other measures.

Since the assessment of the scalability of learning algorithms isa multi-objective problem and there is no chance of defining a sin-gle optimal order of importance among measures, we have optedto focus on the general measure of scalability (Score). Table 3shows that in most cases applying feature selection was signifi-cantly better than not applying it (13 out of 16 cases). In order todecide which filter is the best option, Table 6 depicts the averagescore for each filter on each dataset, as well as the average filteringtime. Since the algorithm LM is not able to train over MNIST whenno feature selection is applied, the results over this dataset areaveraged on three learning algorithms, instead of four.

In light of the results showed in Table 6, it remains clear thatapplying feature selection is better than not doing it. Among thethree filters tested, CFS exhibits the best Score in average, closelyfollowed by INTERACT. Bearing in mind the average time requiredby each filter (see last column in Table 6), the consistency-basedfilter does not seem to be a good option, due to the fact that it ob-tains the worst score along with the highest processing time. CFStends to select the smallest number of features at the expense ofa slightly higher error than INTERACT, therefore the decision ofwhich one is more adequate for the scalability of ANNs dependson if the user is more interested in minimizing either the error orthe other measures.

6.2. Regression

Over Forest and MNIST datasets, the performance measures fol-low the same trends as with classification tasks: those related withthe training time (AuTe, Te5% and Eff) improve while AuSE and Se5%slightly worsen. However, this is not the case with Friedman andLorenz datasets. This fact is explained because these datasets haveonly 10 and 8 features, respectively, and reducing the input dimen-sionality does not lead to a significant reduction in the trainingtime, whilst it may remove some important information which af-fects on the accuracy.

Although in general it seems that feature selection methods donot achieve results as good as over classification tasks, an in-depth

Page 8: Toward the scalability of neural networks through feature selection

Table 5Performance measures for regression tasks. N/A stands for Not Applicable. Those algorithms whose Score average test results are not significantly worse than the best are labeledwith a cross (�). ReliefF_ and ReliefF^ stand for aggressive and soft reduction, respectively.

Method Filter Score Err AuTE AuSE Te5% Se5% Eff

(a) ForestGD None 9.17 0.90 1.26e3 3.62 5.38e2 1.00e4 0.55

CFS 7.67 0.99 1.29e2 3.15 8.27e1 1.00e3 0.38ReliefF_ 4.67� 0.95 2.37e1 2.96 1.54e1 1.00e3 0.26ReliefF^ 8.67 0.90 3.90e2 2.98 1.59e2 1.00e4 0.44

GDX None 8.50 0.68 1.01e3 4.17 4.54e2 1.00e5 0.53CFS 7.00� 1.13 6.91e1 3.32 3.59e1 1.00e3 0.32ReliefF_ 4.67� 0.99 1.61e1 3.00 1.07e1 1.00e3 0.21ReliefF^ 9.33 1.16 3.60e2 4.63 1.03e2 1.00e3 0.41

LM Nonea 9.17 0.60 1.02e4 3.42 1.35e3 1.00e4 0.82CFS 6.83� 0.80 4.20e2 2.48 8.83e1 1.00e3 0.40ReliefF_ 4.50� 0.64 4.71e1 2.39 4.07e1 1.00e4 0.35ReliefF^ 9.67 0.54 2.36e3 3.04 2.83e2 1.00e4 0.65

SCG None 8.17 0.57 1.64e3 2.72 9.86e2 1.00e5 0.60CFS 7.00 0.81 1.41e2 2.66 5.89e1 1.00e4 0.42ReliefF_ 3.67� 0.67 2.83e1 2.29 2.48e1 1.00e4 0.31ReliefF^ 8.00 0.61 5.33e2 2.60 2.16e2 1.00e5 0.50

(b) FriedmanGD Noneb 6.50� 8.33 2.19e3 36.77 7.51e1 1.00e3 0.37

CFS 7.83� 8.16 6.72e3 35.80 1.71e2 1.00e3 0.37ReliefF_ 7.67� 9.90 6.30e3 43.70 1.36e2 1.00e4 0.35ReliefF^ 8.67� 9.71 7.09e3 36.80 1.71e2 1.00e3 0.37

GDX Noneb 5.84� 4.41 1.83e3 24.57 7.20e1 1.00e5 0.37CFS 6.50� 3.86 6.36e3 23.00 1.69e2 1.00e5 0.37ReliefF_ 6.33� 5.08 5.49e3 31.10 1.36e2 1.00e6 0.35ReliefF^ 6.67� 4.00 6.44e3 23.71 1.69e2 1.00e4 0.37

LM Noneb 5.00� 0.11 1.11e3 8.57 8.74e2 1.00e5 0.59CFS 8.33 2.34 1.87e4 14.82 1.03e3 1.00e4 0.50ReliefF_ 7.50 2.35 1.38e4 14.24 7.76e2 1.00e4 0.48ReliefF^ 7.00 � 0.27 1.79e4 6.98 1.03e3 1.00e4 0.50

SCG Noneb 5.00� 0.79 1.67e3 10.33 1.71e2 1.00e5 0.44CFS 6.33� 2.66 4.77e3 16.10 3.88e2 1.00e5 0.43ReliefF_ 6.17� 2.85 4.49e3 17.81 3.10e2 1.00e6 0.41ReliefF^ 5.33� 0.93 5.58e3 9.34 3.87e2 1.00e4 0.43

(c) LorenzGD Noneb 5.17� 0.74 4.82e2 2.98 6.17e1 1.00e2 0.36

CFS 7.00� 6.57 1.57e3 26.01 5.60e1 1.00e2 0.29ReliefF_ 7.67 1.44 2.16e3 6.37 1.37e2 1.00e3 0.35ReliefF^ 8.17 0.94 2.26e3 4.30 2.02e2 1.00e6 0.38

GDX Noneb 4.83� 2.66 2.45e2 13.63 2.04e1 1.00e4 0.26CFS 4.83� 1.29 1.19e3 5.37 4.26e1 1.00e3 0.27ReliefF_ 6.83� 4.02 1.76e3 13.23 7.53e1 1.00e3 0.31ReliefF^ 7.17 5.45 1.57e3 18.71 7.73e1 1.00e2 0.32

LM Noneb 8.17� 0.00 3.26e3 0.00 5.19e2 1.00e5 0.54CFS 5.17� 0.18 8.40e2 0.80 1.24e2 1.00e3 0.36ReliefF_ 7.83� 0.00 3.12e3 0.00 7.82e2 1.00e6 0.48ReliefF^ 8.50� 0.00 8.97e3 0.00 1.44e3 1.00e6 0.52

SCG Noneb 5.83� 0.01 5.61e2 0.05 1.38e2 1.00e4 0.43CFS 5.00� 0.21 4.86e2 1.59 1.15e2 1.00e4 0.35ReliefF_ 6.33� 0.01 1.21e3 0.10 3.11e2 1.00e6 0.41ReliefF^ 7.17� 0.01 2.00e3 0.12 4.53e2 1.00e4 0.44

(d) MNISTGD Nonea 8.17 303.12 1.66e3 903.14 6.60e0 1.00e2 0.44

CFS 5.17� 37.14 1.44e2 143.11 3.23e0 6.00e4 0.23ReliefF_ 3.50� 1.33 3.86e1 3.95 1.40e1 1.00e3 0.28ReliefF^ 6.17� 210.98 9.11e2 528.32 4.55e0 1.00e2 0.35

GDX Nonea 9.33 9.25 7.49e4 66.06 9.71e2 1.00e4 0.75CFS 6.33 1.52 1.34e3 6.02 1.62e2 1.00e3 0.46ReliefF_ 2.83� 0.85 3.73e1 3.86 1.26e1 1.00e4 0.27ReliefF^ 8.83 12.92 1.51e4 76.51 3.06e2 1.00e4 0.62

LM Nonea N/A N/A N/A N/A N/A N/A N/ACFS N/A N/A N/A N/A N/A N/A N/AReliefF_ 4.83� 0.59 3.80e2 3.08 8.68e1 1.00e4 0.51ReliefF^ N/A N/A N/A N/A N/A N/A N/A

2814 D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816

Page 9: Toward the scalability of neural networks through feature selection

Table 5 (continued)

Method Filter Score Err AuTE AuSE Te5% Se5% Eff

SCG None 9.17 3.06 3.10e4 41.52 1.82e3 1.00e4 0.82CFS 5.83 0.44 1.09e3 3.62 4.20e2 6.00e4 0.55ReliefF_ 3.00� 0.55 3.87e1 2.12 1.56e1 1.00e4 0.32ReliefF^ 8.50 2.51 1.12e4 36.71 6.68e2 1.00e4 0.71

a Largest training set it can deal with: 1e4 samples.b Largest training set it can deal with: 1e5 samples.

Table 6Average of Score for each filter on each dataset for classification tasks along with theaverage time required by the filters.

Filter Connect-4 Forest KDD Cup 99 MNIST Average Time (s)

None 7.92 8.29 8.21 8.72 8.29 –CFS 5.00 6.04 4.67 6.17 5.47 223.12Consistency 7.21 7.00 7.25 4.11 6.39 6154.99INTERACT 6.46 5.92 5.71 4.94 5.76 284.14

Table 7Average of Score for each filter on each dataset for regression tasks along with theaverage time required by the filters.

Filter Forest Friedman Lorenz MNIST Average Time (s)

None 7.88 5.59 6.00 8.89 7.09 –CFS 7.13 7.25 5.50 5.78 6.42 247.43ReliefF_ 4.38 6.92 7.17 3.11 5.39 155111.60ReliefF^ 8.92 6.92 7.75 7.83 7.86 155111.60

D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816 2815

analysis reveals that for all the combinations between dataset andlearning algorithm, using feature selection obtained a significantlybetter or equal Score (for all filters) than not using it.

Studying in detail the behavior of the filters, one may find thatReliefF_ (aggressive reduction) is the best (or one of the best ones)method with a significant difference in all cases but two. Table 7reinforces this fact by showing that this method obtains the bestScore in average for all datasets and learning algorithms. On theother hand, ReliefF^ (soft reduction) presents the worst Score inaverage. This fact can be explained because ReliefF^ selects a highernumber of features, which leads to a slightly better error, but at theexpense of requiring more training time and hence getting worseresults on the other measures.

Albeit ReliefF_ is the best method according to Score, it has to bereminded that ReliefF requires a much higher computational timethan CFS (see last column in Table 7). For this reason, in the opinionof the authors, CFS is the best option when looking for a featureselection method which could help on the scalability of ANNs onregression tasks. In fact, its Score shows that applying CFS is betterthan not doing it and the computational time required is not pro-hibitive, as happens with ReliefF.

Regarding the results for Friedman and Lorenz datasets, with 10and 8 input features respectively, and bearing in mind the resultsdepicted in Table 7, one may question the adequacy of featureselection on these datasets with such a small number of features.However, there is not a universal answer to this question, since itdepends on the nature of the problem and on the presence of irrel-evant features. In this work, no benefits were found after applyingfeature selection over Friedman dataset, but it was worth to do itover Lorenz. In fact, the filter CFS over the latter dataset only needsa couple of seconds to perform the selection and it retains one sin-gle feature. Further experimentation showed that this feature ishighly correlated with the output and obtained as good results aswith the whole set of features. It is remarkable the fact that for

GDX algorithm, the lower error was achieved by CFS using onlythat one feature (see Table 5(c)).

Finally, notice that the current performance measures and theaggregation of the ranks are detrimental to learning algorithmswhich are accurate but slow. A case in point of the change in ap-proach for assessing training algorithms for ANNs followed in thispaper concerns the LM (see Tables 3 and 5), as it usually reachesthe lowest error by a large margin but never ranks among the best.In this manner, the ranking could be scrambled by using a naivebut fast training algorithm. For example, in the regression task ofMNIST dataset (see Table 5(d)), the algorithm GD ranks better thanGDX and SCG due to its short training time, early stopping, in spiteof obtaining a huge error. Further experiments showed conver-gence problems with regard to the training process of the algo-rithm GD. This is an isolated case and the conclusions of thiswork will not be disturbed by it. However, and this is the point,if a learning algorithm which shows convergence problems is ableto beat other algorithms in terms of scalability, the performancemeasures must be revised.

7. Conclusions

When dealing with the performance of machine learning algo-rithms, most papers are focused on the accuracy obtained by thealgorithm. However, with the advent of high dimensionality prob-lems, researchers must study not only accuracy but also scalability.Aiming at dealing with a problem as large as possible, featureselection can be helpful as it reduces the input dimensionalityand therefore the run-time required by an algorithm.

In this work, the effectiveness of feature selection on the scala-bility of training algorithms for ANNs was evaluated, both for clas-sification and regression tasks. Since there are no standardmeasures of scalability, those defined in the PASCAL Large ScaleLearning Challenge were used to assess the scalability of the algo-rithms in terms of error, computational effort, allocated memoryand training time. Results showed that feature selection as a pre-processing step is beneficial for the scalability of ANNs, even allow-ing certain algorithms to be able to train on some datasets in caseswhere it was impossible due to the spatial complexity. Moreover,some conclusions about the adequacy of the different featureselection methods over this problem were extracted.

As future work, it is necessary to develop fairer measures of sca-lability. We are aware that evaluating scalability is a multi-objec-tive problem and there is no chance of establishing a single fairabsolute order. However, we believe that some shortcoming ofPASCAL measures can be overcome by putting some constraintson learning algorithms (e.g. avoiding early stopping) and by addingsome new scalar measures (e.g. largest dataset the algorithm isable to deal with).

Acknowledgments

This work was supported by Secretaría de Estado de Investiga-ción of the Spanish Government under projects TIN 2009-02402and TIN2012-37954, and by the Xunta de Galicia through projects

Page 10: Toward the scalability of neural networks through feature selection

2816 D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816

CN2011/007 and CN2012/211, all partially supported by the Euro-pean Union ERDF. D. Peteiro-Barral and V. Bolón-Canedo acknowl-edge the support of Xunta de Galicia under Plan I2C Grant Program.

References

Catlett, J. (1991). Basser Department of Computer Science, Megainduction: Machinelearning on very large databases. Ph.D. thesis, University of Sydney Australia.

Sonnenburg, S., Franc, V., Yom-Tov, E., & Sebag, M. (2008). Pascal large scale learningchallenge, machine learning research.

Peteiro-Barral, D., Guijarro-Berdinas, B., Pérez-Sánchez, B., & Fontenla-Romero, O.(2011). On the scalability of machine learning algorithms for artificial neuralnetworks. Journal of Neural Networks (Under Review).

Guyon, I. (2006). Feature extraction: Foundations and applications (Vol. 207).Springer-Verlag.

Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductivealgorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.

Domingo, C., Gavaldà, R., & Watanabe, O. (2002). Adaptive sampling methods forscaling up knowledge discovery algorithms. Data Mining and KnowledgeDiscovery, 6(2), 131–152.

Weiss, G., & Provost, F. (2001). The effect of class distribution on classifier learning:An empirical study, Rutgers University.

Yu, L., & Liu, H. (2004). Redundancy based feature selection for microarray data. InProceedings of the tenth ACM SIGKDD international conference on knowledgediscovery and data mining (pp. 737–742). Springer.

Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2010). Featureselection and classification in multiple class datasets. An application to kdd cup99 dataset. Expert Systems with Applications, 38(5), 5947–5957.

Lee, W., Stolfo, S., & Mok, K. (2000). Adaptive intrusion detection: A data miningapproach. Artificial Intelligence Review, 14(6), 533–567.

Forman, G. (2003). An extensive empirical study of feature selection metrics for textclassification. The Journal of Machine Learning Research, 3, 1289–1305.

Gomez, J., Boiy, E., & Moens, M. (2011). Highly discriminative statistical features foremail classification. Knowledge and Information Systems, 1–31.

Egozi, O., Gabrilovich, E., & Markovitch, S. (2008). Concept-based feature generationand selection for information retrieval. In AAAI08 (pp. 1132–1137).

Dy, J., Brodley, C., Kak, A., Broderick, L., & Aisen, A. (2003). Unsupervised featureselection applied to content-based retrieval of lung images. IEEE Transactions onPattern Analysis and Machine Intelligence, 25(3), 373–378.

Saari, P., Eerola, T., & Lartillot, O. (2011). Generalizability and simplicity as criteria infeature selection: Application to mood classification in music. IEEE Transactionson Audio, Speech, and Language Processing, 19(6).

Polikar, R., Upda, L., Upda, S., & Honavar, V. (2001). Learn++: An incrementallearning algorithm for supervised neural networks. IEEE Transactions on Systems,Man, and Cybernetics. Part C: Applications and Reviews, 31(4), 497–508.

Ruping, S. (2001). Incremental learning with support vector machines. InProceedings IEEE international conference on data mining, ICDM 2001(pp. 641–642). IEEE.

Parikh, D., & Polikar, R. (2007). An ensemble-based incremental learning approachto data fusion. IEEE Transactions on Systems, Man, and Cybernetics. Part B:Cybernetics, 37(2), 437–450.

Pérez-Sánchez, B., Fontenla-Romero, O., & Guijarro-Berdiñas, B. (2010). Anincremental learning method for neural networks based on sensitivityanalysis. Current Topics in Artificial Intelligence, 42–50.

Chan, P., & Stolfo, S. (1993). Toward parallel and distributed learning by meta-learning. In AAAI workshop in knowledge discovery in databases (pp. 227–240).

Ananthanarayana, V., Subramanian, D., & Murty, M. (2000). Scalable, distributedand dynamic mining of association rules. High Performance Computing HiPC,2000, 559–566.

Tsoumakas, G., & Vlahavas, I. (2002). Distributed data mining of large classifierensembles. In Proceedings companion volume of the second hellenic conference onartificial intelligence (pp. 249–256). Citeseer.

Peteiro-Barral, D., Guijarro-Berdinas, B., Pérez-Sánchez, B., & Fontenla-Romero, O.(2011). A distributed learning algorithm based on two-layer artificialneural networks and genetic algorithms. In Proceedings of the ESANN11 (pp.471–476).

Kargupta, H., Park, B. Johnson, E., Sanseverino, E., Silvestre, L., & Hershberger, D.(1998). Collective data mining from distributed vertically partitioned featurespace. In Workshop on distributed data mining, international conference onknowledge discovery and data mining.

McConnell, S., & Skillicorn, D. (2004). Building predictors from vertically distributeddata. In Proceedings of the conference of the centre for advanced studies oncollaborative research (pp. 150–162). IBM Press.

Skillicorn, D., & McConnell, S. (2008). Distributed prediction from verticallypartitioned data. Journal of Parallel and Distributed Computing, 68(1), 16–36.

Bolón-Canedo, V., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdinas, B., &Sánchez-Marono, N. (2011). Scalability analysis of ann training algorithms withfeature selection. Advances in Artificial Intelligence, 84–93.

Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance andredundancy. The Journal of Machine Learning Research, 5, 1205–1224.

Hall, M. (1999). Correlation-based feature selection for machine learning. Ph.D.thesis, The University of Waikato.

Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. ArtificialIntelligence, 151(1–2), 155–176.

Zhao, Z., & Liu, H. (2007). Searching for interacting features. In Proceedings of the20th international joint conference on artificial intelligence (pp. 1161–1516).Morgan Kaufmann Publishers Inc..

Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. InMachine learning: ECML-94 (pp. 171–182). Springer.

Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proceedingsof the ninth international workshop on machine learning (pp. 249–256). MorganKaufmann Publishers Inc..

Bishop, C. (2006). Pattern recognition and machine learning. New York:Springer.Moller, M. (1993). A scaled conjugate gradient algorithm for fast supervised

learning. Neural Networks, 6(4), 525–533.More, J. (1978). The Levenberg–Marquardt algorithm: Implementation and theory.

Numerical Analysis, 105–116.Collobert, R., & Bengio, S. (2008). Support vector machines for large-scale regression

problems. The Journal of Machine Learning Research, 1.Weiss, S., & Kulikowski, C. (1991). Computer systems that learn: Classification and

prediction methods from statistics, neural nets, machine learning, and expertsystems. San Francisco: Morgan Kaufmann.

Hecht-Nielsen, R. (1990). Neurocomputing. Menlo Park, California: Addison-Wesley.