8
Neurocomputing 71 (2008) 3553–3560 Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks A. Luchetta Department of Electronics and Telecommunications (DET), University of Florence, Via S. Marta 3, 50139 Florence, Italy Received 29 January 2006; received in revised form 6 August 2007; accepted 18 August 2007 Communicated by K. Li Available online 1 November 2007 Abstract In this paper a new procedure for the selection of pruning threshold in feedforward artificial neural networks (FANN) is presented. It is based on an evaluation of a local sensitivity index which has been previously calculated with respect to any single output of the network. Special emphasis has been given to a particular class of neural networks with multiple heterogeneous outputs. The effectiveness of the proposed method will be shown by the development of a neural architecture devoted to a specific multi-output inversion system. The proposed pruning technique provides criteria in deciding ‘‘when’’ and ‘‘how much’’ to prune the designed neural network. r 2007 Elsevier B.V. All rights reserved. Keywords: Feedforward neural networks; Multiple output systems; Pruning techniques; Satellite data 1. Introduction One of the most important issues in neural network applications is to find a way to improve generalization behavior of the network. The majority of the systems modeled by means of a ‘‘black box’’ approach require an acceptable compromise between the complexity of the network and capability of fitting data, when new and never seen situations are presented to the system. While a system trained by examples might be capable of accurately fitting the training data, it may fail miserably if even slightly different inputs are presented. The generalization aptitude depends on the number of neurons in each hidden layer and on the number of active connections between each neuron [13]. Thus the optimum size can be obtained either by starting from a small network and dynamically increasing the number of neurons, or else, oppositely by starting from an oversized network and removing the less important connections or entire neurons [3]. The latter solution is usually called ‘‘pruning’’ of the neural network. A definitive result on how to establish a deterministic relation between the number of network parameters and the generalization capability is still lacking presently. The problem has been explored by several authors who have proposed efficient solutions for specific applications, such as [2] for classification of multisource data set, [11] for facial expression recognition, [4] for inversion of synthetic aperture radar data, [6,18] for radial basis function (RBF) networks for function approximation, [16] for hybrid networks, [19] in wireless applications, [20] for freeway accident detection and [21] for discrete Hopfield networks. Some of the general available results for fully connected networks [5,8,12,14] are often very expensive in terms of calculation requirements. A new method will be proposed in this paper modifying existing techniques [7,15,17], to explicitly adapt them to the class of feedforward neural networks with backpropaga- tion training algorithm and multiple heterogeneous out- puts. In these cases, the outputs may have a different level of significance. Therefore a suitable technique to evaluate which connections or neurons to prune should take into account the nature of the distinct outputs and suitably adjust the pruning thresholds. In fact, in the type of feedforward artificial neural networks (FANNs) treated in this work, the presence of several outputs often emerged, ARTICLE IN PRESS www.elsevier.com/locate/neucom 0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.08.028 Tel.: +39 55 4796461; fax: +39 55 4796442. E-mail address: luchetta@unifi.it

Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

Embed Size (px)

Citation preview

Page 1: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�Tel.: +39 55

E-mail addr

Neurocomputing 71 (2008) 3553–3560

www.elsevier.com/locate/neucom

Automatic generation of the optimum threshold for parameter weightedpruning in multiple heterogeneous output neural networks

A. Luchetta�

Department of Electronics and Telecommunications (DET), University of Florence, Via S. Marta 3, 50139 Florence, Italy

Received 29 January 2006; received in revised form 6 August 2007; accepted 18 August 2007

Communicated by K. Li

Available online 1 November 2007

Abstract

In this paper a new procedure for the selection of pruning threshold in feedforward artificial neural networks (FANN) is presented. It

is based on an evaluation of a local sensitivity index which has been previously calculated with respect to any single output of the

network. Special emphasis has been given to a particular class of neural networks with multiple heterogeneous outputs. The effectiveness

of the proposed method will be shown by the development of a neural architecture devoted to a specific multi-output inversion system.

The proposed pruning technique provides criteria in deciding ‘‘when’’ and ‘‘how much’’ to prune the designed neural network.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Feedforward neural networks; Multiple output systems; Pruning techniques; Satellite data

1. Introduction

One of the most important issues in neural networkapplications is to find a way to improve generalizationbehavior of the network. The majority of the systemsmodeled by means of a ‘‘black box’’ approach require anacceptable compromise between the complexity of thenetwork and capability of fitting data, when new and neverseen situations are presented to the system. While a systemtrained by examples might be capable of accurately fittingthe training data, it may fail miserably if even slightlydifferent inputs are presented. The generalization aptitudedepends on the number of neurons in each hidden layerand on the number of active connections between eachneuron [13]. Thus the optimum size can be obtained eitherby starting from a small network and dynamicallyincreasing the number of neurons, or else, oppositely bystarting from an oversized network and removing the lessimportant connections or entire neurons [3]. The lattersolution is usually called ‘‘pruning’’ of the neural network.

e front matter r 2007 Elsevier B.V. All rights reserved.

ucom.2007.08.028

4796461; fax: +39 55 4796442.

ess: [email protected]

A definitive result on how to establish a deterministicrelation between the number of network parameters and thegeneralization capability is still lacking presently. Theproblem has been explored by several authors who haveproposed efficient solutions for specific applications, such as[2] for classification of multisource data set, [11] for facialexpression recognition, [4] for inversion of synthetic apertureradar data, [6,18] for radial basis function (RBF) networksfor function approximation, [16] for hybrid networks, [19] inwireless applications, [20] for freeway accident detection and[21] for discrete Hopfield networks. Some of the generalavailable results for fully connected networks [5,8,12,14] areoften very expensive in terms of calculation requirements.A new method will be proposed in this paper modifying

existing techniques [7,15,17], to explicitly adapt them to theclass of feedforward neural networks with backpropaga-tion training algorithm and multiple heterogeneous out-puts. In these cases, the outputs may have a different levelof significance. Therefore a suitable technique to evaluatewhich connections or neurons to prune should take intoaccount the nature of the distinct outputs and suitablyadjust the pruning thresholds. In fact, in the type offeedforward artificial neural networks (FANNs) treated inthis work, the presence of several outputs often emerged,

Page 2: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESSA. Luchetta / Neurocomputing 71 (2008) 3553–35603554

due to the need to benefit of correlations among systemparameters [9]. However, as described in Section 5, there isa complete lack of any methods dealing with this specifichypothesis being that it is also concretely plausible forimportant existing physical systems. In order to enable this,new parameters as well as some extensions to alreadyexisting mathematical tools have been introduced in thiswork. The most substantial one is the introduction of an‘‘importance degree’’ (ID) which is associated with everysingle output, allowing the correct dimensioning of thepruning entity. In other words it results in the determina-tion of the right number of connections and neurons thatcan be removed from the network.

This paper is organized as follows. In Section 2, a generaldescription of the problem has been stated and a methodfor evaluating the importance of connections presented. InSection 3 the pruning approach and method have beendescribed. In Section 4 the algorithm has been explainedand commented on. Finally, an example in Section 5 andsome conclusions in Section 6 have been proposed.

2. Theoretical foundation and local sensitivity evaluation

The generic neural network architecture is assumed to bea two- or three-layer FANN. For generality, a three-layernetwork will be used in the rest of paper. To begin with, thefollowing notations which will be used throughout the textare introduced here:

Ni, Nout number of neurons in the ith layer and in theoutput layer, respectively

xk kth component of the input vector of the FANNoi ouput of the ith neuron in the output layer of the

networkwðhÞjk weight value of the connection between the nodes j

and k of the layers h and h�1, respectivelynetðhÞj sum of the weighted inputs of the jth neuron in the

hth layeryðhÞj output of he jth neuron in the hth layer

bðhÞj bias input of he jth neuron in the hth layer

eq ¼ ðdq � oqÞ error between the target and actual outputq

SSE ¼ 12

PNout

q¼1 ðdq � oqÞ2 sum of output squared errors

SEðqÞ ¼ 12ðdq � oqÞ

2 squared error relating to output q

fðhÞj ðdÞ activation function of the jth neuron in the hth layer

For any weighted connection wðhÞjk the estimated sensitiv-

ity of the output oq with respect to the connection (WSs) isthe rate of change of SE(q) with respect to wjk in the nthtraining iteration and it is calculated by using a slightlymodified version of the Karnin equation [7]:

WSðqÞjk ¼

XN�1n¼0

qSEðqÞ

qwjk

ðnÞDwjkðnÞ

� �wjkðfinalÞ

wjkðfinalÞ � wjkðinitialÞ,

(1)

where wjkðfinalÞ is the final value of wjk, wjkðinitialÞ is theinitial value of wjk, DwjkðnÞ is the change of weight wjk atthe nth iteration and finally N is the number of trainingiterations. The index relating to the layer has been droppedfor the sake of clarity. In the original Karnin equation,global error is used instead.In a similar way, the global sensitivity (GWSs) is the rate

of change of SSE with respect to wjk:

GWSjk ¼XN�1n¼0

qSSE

qwjk

ðnÞDwjkðnÞ

� �wjkðfinalÞ

wjkðfinalÞ � wjkðinitialÞ.

(2)

The rate of change of squared errors (single and global)can be calculated with an approach analogous to thebackpropagation algorithm; starting from the output layer(layer (3)), the rate of change of SSE with respect to theweight connecting output neuron j and second hidden layerneuron k is

qSSE

qwð3Þjk

¼qSSE

qej

qej

qoj

qoj

qnetð3Þj

qnetð3Þj

qwð3Þjk

¼ � ejyð2Þk_fð3Þ

j ðnetð3Þj þ b

ð3Þj Þ, ð3Þ

where _fð3Þ

j ¼ qfð3Þj =qnet

ð3Þj . In this case (for the output layer)

the previous expression is the same for the rate of change ofSE with respect to wjk (relating to the jth output):

qSEðqÞ

qwð3Þjk

¼ �ejyð2Þk_fð3Þ

j ðnetð3Þj þ b

ð3Þj Þ; q ¼ 1; . . . ;Nout. (4)

Using backpropagation, the rate of change of SSE withrespect to the weight connecting the second hidden layerneuron j and the first hidden layer neuron k is

qSSE

qwð2Þjk

¼qSSE

qej

qej

qyð2Þj

qyð2Þj

qnetð2Þj

qnetð2Þj

qwð2Þjk

¼ � ejyð1Þk_fð2Þ

j ðnetð2Þj þ b

ð2Þj Þ

¼ � ejyð1Þk

XNout

q¼1

eqwð3Þqj_fð3Þ

q ðnetð3Þq þ bð3Þq Þ ð5Þ

while in this layer the rate of change of SE with respect tothe weight connecting the second hidden layer neuron j andthe first hidden layer neuron k is

qSEðqÞ

qwð2Þjk

¼qSEðqÞ

qej

qej

qyð2Þj

qyð2Þj

qnetð2Þj

qnetð2Þj

qwð2Þjk

¼ �ejyð1Þk_fð2Þ

j ðnetð2Þj þ b

ð2Þj Þ

¼ �yð1Þk_fð2Þ

j ðnetð2Þj þ b

ð2Þj Þeqw

ð3Þqj_fð3Þ

q ðnetð3Þq þ bð3Þq Þ,

q ¼ 1; . . . ;Nout. ð6Þ

Finally, the rate of change of SSE with respect tothe weight connecting the first hidden layer neuron j

Page 3: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESSA. Luchetta / Neurocomputing 71 (2008) 3553–3560 3555

and the input element k is

qSSE

qwð1Þjk

¼qSSE

qej

qej

qyð1Þj

qyð1Þj

qnetð1Þj

qnetð1Þj

qwð1Þjk

¼ � xk_fð1Þ

j ðnetð1Þj þ b

ð1Þj Þ

�XN2

p¼1

XNout

q¼1

eqwð3Þqp_fð3Þ

q ðnetð3Þq þ bð3Þq Þ

" #(

�wð2Þpj_fð2Þ

p ðnetð2Þp þ bð2Þp Þ

oð7Þ

and the rate of change of SE with respect to the weightconnecting the first hidden layer neuron j and the inputelement k is

qSEðqÞ

qwð1Þjk

¼qSEðqÞ

qej

qej

qyð1Þj

qyð1Þj

qnetð1Þj

qnetð1Þj

qwð1Þjk

¼ � xk_fð1Þ

j ðnetð1Þj þ b

ð1Þj Þw

ð2Þpj_fð2Þ

p ðnetð2Þp þ bð2Þp Þ

�XN2

p¼1

eqwð3Þqp_fð3Þ

q ðnetð3Þq þ bð3Þq Þ; q ¼ 1; . . . ;Nout. ð8Þ

At this point a Bi-Local Sensitivity Index can be definedas

BLSIðqÞjk ¼

jWSðqÞjk jPNwc

l¼1 jWSðqÞjl j

, (9)

where Nwc is the number of weighted connections in thespecific layer. The index just defined is called ‘‘bi-local’’ tounderline that it is ‘‘local’’ with respect to the consideredlayer and also to the single output of a multi-outputnetwork. It is obtained as in [15], adding the ‘‘local’’ natureof the index in relation to the output.

The fact of being a ‘‘local’’ index with respect to thegiven layer allows the network to prevent the suppressionof some important connections that can be pruned if aglobal pruning method is used [13]. On the other hand, thechoice of using an index that is diversified for any output ofthe network allows for adequately handling classes ofneural networks with several heterogeneous outputs. InSection 5 a concrete case study will be analyzed.

3. Pruning scheme and criteria

In the previous section, a procedure to calculate thesensitivity of each output of the FANN with respect toeach individual connection during the training phase hasbeen described. In this section, further steps will beproposed in order to set a criterion for answering twoquestions:

1.

When should the network be pruned? 2. How deeply should the network be pruned, using the

calculated values of bi-local sensitivity? In other words,how many connections should be removed at the time ofpruning?

Although several answers exist in the literature to thefirst question, few have been found to the second one.Moreover a standardized approach for a neural networkwhich has more than one non-homogeneous output doesnot exist. Non-homogeneous data lines can be consideredthose which are relating to different physical quantities(generally having different measure units). With regard tothe first point it seems quite obvious that a connection j�k

should be eliminated when, during the training phase, thevalue of sensitivity drops under a threshold value:BLSIjkoath. This approach requires the choice in someway of a threshold value. This might be done in acompletely empirical way [15], with no control on thepruning strength by adopting a ‘‘trial and error’’ process.This method is particularly onerous, given the nature of theproblem and the fact that every pruning action should befollowed by a new training phase in order to obtain anoutcome on the quality of the pruning. A more rigorousapproach can be introduced using a quantitative definitionof ‘‘overfitting’’ in cooperation with bi-local sensitivity andits average value. Taking into account the previousconsiderations, the multi-ouput general nature of thesystem, and that of the bi-local sensitivity mean value canbe taken as a good reference value for the cut-off. Thefollowing rule seems to be reasonable for determining howmany connections should be pruned:

when a pruning step occurs, prune all those connections

between neuron j of the layer h and the neuron k of the layer

h�1, whose weights wðhÞjk satisfy

BLSIðqÞjk olðqÞmðqÞBLSI for any q ¼ 1; . . . ;Nout

and for some given lðqÞ, ð10Þ

where mðqÞBLSI is the mean value over the entire network ofthe bi-local sensitivity. The parameter l(q) should bedynamically chosen in order to ensure the best level ofpruning in relation to the amount of overfitting observed inthe network at a given epoch.In order to do this, a quantitative measure of ‘‘over-

fitting’’ must be introduced in a similar way to theformulation proposed by Prechelt [17]. Let e be the targeterror, that is the chosen objective function of the trainingalgorithm; it could be e ¼ SSE, the output squared errorsor an analogous definition. After adding the apex (q) to e,e(q) takes on the meaning of a target error related only tothe output q (thus it could be e(q) ¼ SE(q)). Then let etr(n) bethe error calculated at epoch n over the training set andeva(n) be the one calculated at epoch n over the validationset, during the training phase. Moreover let emin(n) be thelowest validation error obtained before the epoch n:

�minðnÞ ¼ minn0on

�vaðn0Þ. (11)

Now a generalization decay (GD) at epoch n can bedefined (in percent) for the global case:

GDðnÞ ¼�vaðnÞ

�minðnÞ� 1

� �� 100 (12)

Page 4: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESSA. Luchetta / Neurocomputing 71 (2008) 3553–35603556

and for the singular output case:

GDðqÞðnÞ ¼�ðqÞva ðnÞ

�ðqÞminðnÞ� 1

!� 100. (13)

As the previous equations state, the GD expresses therelative increase of the validation error over the minimum-so-far in percent. From a general point of view, a highvalue of GD is a good starting point to decide the epoch ofpruning, but it is still not enough. In fact, the general-ization error trend can be rapidly oscillating mostly in theinitial phase of training. Therefore a criterion should be tostart the pruning not when GD is high, but when itincreases. In order to correctly evaluate when this happens,a further parameter can be defined, called GDUPs. It is aBoolean quantity calculated at the nth epoch of training:

GDUPs(n) is true if:

GDðkÞ40 during the last n� s epochs; that is for

k : n; . . . ; n� s

GDðnÞ � GDðn� 1Þ40

Again, these quantities can be referred to a single outputGDUPs

(q). This criterion can be seen as an ‘‘early stopping’’one.

Finally, the l(q) must be chosen. As previously pointedout, it would seem reasonable to make this choice in anadaptive way, taking into account the fact that the strengthmust be linked to the value of GD (and must increase withit). Thus a suitable choice could be

lðqÞðGDðqÞðnÞÞ ¼ lðqÞMAX 1�1

1þ GDðqÞðnÞ

� �. (14)

In this way the value of l(q) would be directly dependenton two factors, the index of GD(q) (and in turn the epoch n)and the chosen lðqÞMAX. This double dependence is appro-priate; it contemporarily allows balancing the pruningstrength in relation to the quality of generalization while

still having a freedom degree on lðqÞMAX, which can be

adequately dimensioned, taking into account the nature ofthe network.

The FANN predisposed to the pruning has in generalNout outputs. A multi-output network can be useful andmight work better than a single output one in all thosecases when the correlation among the output data is notnegligible. On the other hand, this kind of neural networkmight present situations or ranges where one output ismore significant than the others. In the applicationexample this case will be clarified. At present theassumption can be made that the network has Nout outputsand that they are not all important at the same level; anID(q) has been assigned to each output q. The number p1can be used to de-emphasize one output in relation toanother one.

In this paper it has not been discussed how to determinethe ID. In other words, the knowledge of the outputimportance is ‘‘a priori’’ information and its estimation

depends on the specific application. Attention is not givento ‘‘how’’ this is done. It could come from heuristicdeduction or experimental evidence as in a sort of ‘‘fuzzy’’linguistic datum or it could be the result of a previouselaboration. It is used to emphasize or de-emphasize thesingle output line, referring to the expected prediction errorand not to the reliability of available data. In any case, thedevelopment of the algorithm proposed here is independentfrom the way in which the ID value has been obtained. Inother words, it belongs to the space of the ‘‘input’’ data ofthe algorithm. The ID(q) is used during the first pruningstep to optimize values of lðqÞMAX’s. The choice of theoptimization method could be discussed, but for the sakeof brevity, and because of its robustness, a simple geneticalgorithm has been chosen. The genetic algorithm isapplied to specific case as illustrated here:

1.

When conditions for pruning arise, a genetic algorithmis prepared to extract lðqÞMAX’s, for q ¼ 1,y, Nout.

2.

The chromosomes of the genetic algorithm are just arepresentation of the lðqÞMAX’s; in the first step they arerandomly initialized at a value around unity and in eachstep of the algorithm the network is pruned with thosevalues, following the rules given by Eqs. (10) and (14).

3.

The fitness F of the algorithm is calculated by combiningroot mean square error of the network with the ID

assigned to each output:

F ¼1

Nout

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXNout

q¼1

ðIDðqÞÞ2ðdq � oqÞ2

vuut .

After mutation and recombination, the new values oflðqÞMAX’s are re-calculated. The procedure is stopped whenthe value of fitness is stabilized (usually after a few tens ofepochs).

4. Discussion and algorithm

The proposed pruning scheme for a FANN has a generalvalidity for any kind of networks, but it is particularlyappropriate for networks with multiple heterogeneousoutputs. It does not guarantee that a generalization errorreduction is achieved (but none of the available methods doit [17]), instead it involves the use of a rigorous approach inan attempt to optimize the structure of a neural network inthe specific case of multiple heterogeneous outputs. Furtherconsiderations about the ‘‘bi-local’’ nature of the algorithmcan be made in detail.The defined index is local with respect to the layers. In

[15] it is shown that this approach allows better general-ization capability, avoiding sub-optimal reduction whereeven important connections are untimely pruned.The BLSI index is also local to any single output line

of the network. In some applications, the use of multi-output heterogeneous networks allows for exploiting the

Page 5: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESSA. Luchetta / Neurocomputing 71 (2008) 3553–3560 3557

correlation among the output data during the back-propagation process [9]. On the other hand, it is plausiblethat the output data do vary in a given ‘‘non-homogenous’’space of the solutions. Because of this, the importance ofeach network output depends on the ‘‘working point’’ ofthe network itself. In other words, for a complexarchitecture which includes several networks, one outputmay be more important than another one in a givennetwork and less important than the same one in anothernetwork. This suggested approach directly permitsassigning the appropriate ID to each output in the givensituation.

On the basis of the previous considerations andexplanations, the procedure can be summarized in thefollowing pseudo-code algorithm:

Choose ID(q) for each q ¼ 1,y,Nout and k terms;do

Train the network for one epoch;if epoch MOD k ¼ 0 then

Compute eva(n), emin(n), GD(q)(n) using validationset, for each q ¼ 1,y,Nout;

endif

while ðminqðGDðqÞðnÞÞ41Þ

reset_network (to emin); // reload network parametersexhibiting minimum validation errordo

Train the network for one epoch and computeBLSI ðqÞ values;if epoch MOD k ¼ 0 then

Compute eva(n), emin(n), GD(q)(n) using validationset, for each q ¼ 1,y,Nout;if GDUP5(n) is true then

prime genetic algorithm to calculate lðqÞMAX’s, forq ¼ 1,y,Nout;prune all connections whose weights satisfyBLSI

ðqÞjk olðqÞ � mðqÞBLSI, where l(q) is calculated

using the previous step and formula (14);endif

endif

while (epochoepochMAX OR eva(n)o eMIN)

epochMAX and eMIN are empirical constants. Moreover,for GDUPs, s has been chosen to equal 5 in thealgorithm, empirically assuming 5 epochs as an oppor-tune range for early stopping control [15].

5. Simulation results

To test the complete procedure, a case study will bediscussed here. It is a multiple heterogeneous output casethat will also better clarify the procedure.

The used data set is formed by synthetic spectralradiances, the outputs of a High Resolution InfraredAtmospheric Sounding Interferometer (IASI). These havebeen generated on the basis of the line-by-line forwardradiative transfer model called s-IASI [1] which has been

designed to match the spectral range of the IASIinterferometer for fast computation of spectral radiance.In this case, the aim of the network is to invert geophysicalparameters of meteorological interest such as temperature,water vapor and ozone profiles from high-resolutioninfrared sensor spectra measured by the interferometer.These parameters form a ‘‘heterogenous’’ set, because theyare physical quantities which measure different entitiesusing different units (water vapor percent, ozone concen-tration, temperature Kelvin degrees). The complete set ofdata is formed by 8461 potential channels to be exploitedfor inversion of geophysical parameters. The classicalapproach to retrieve geophysical parameters from infraredradiance relies on physical inversion schemes which areinevitably extremely slow. The computation time requiredfor solving the inverse problem in this case becomesprohibitive, especially when inversions are aimed atproviding weather forecast numerical models, which needto be real-time. The implemented neural system is amultilayer back propagation feedforward network thatcomes from a simultaneous strategy: temperature, watervapor and ozone are simultaneously retrieved at eachatmospheric layer (Fig. 1).The present architecture of the net is generic and can

therefore be adapted to different instruments and atmo-spheric layering by changing the appropriate input para-meters. The following choices have been made:

The first 40 s-IASI atmospheric layers have been used torepresent the atmosphere from the ground level to theaverage altitude of the ER-2 airplane, assumed to be20 km; consequently 40 neural nets corresponding to the40 atmospheric layers from 0 to 20 km have beenimplemented and trained. Each neural net yields thetriplet (T, H2O, O3), corresponding to a given atmo-spheric layer. � The surface temperature is included in the first network,

which then has four outputs.

� The large number of channels (8461) makes full spectra

handling prohibitive. Thus, convenient subsets ofchannels have been selected according to the character-istics of absorption by gaseous constituents of theearth’s atmosphere. This pre-elaboration leads to aspectral radiance vector which is obtained by consider-ing the following four spectral ranges: 670–800;1010–1080; 1350–1450; 2160–2260 cm�1, for a total of1604 spectral data points.

The total number of 1604 selected spectral ordinates isstill too high to directly present all channels to the network.Moreover, the information content of these channels ishighly redundant so the input may be efficiently reduced byresorting to a principal component of analysis pre-elaboration. In this way the data space (spectral radiance)is initially represented by 50 principal componentsobtained by a truncated Hotelling transform of the originalradiance vector. The Hotelling or PCA transform is

Page 6: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESS

Fig. 1. Neural network architecture.

Table 1

Importance degree of the network outputs on some atmospheric layers

Output position Quantity ID(q)

1 Surface T1 1.0

2 1st layer air T1 0.5

3 1st layer H2O 0.3

4 1st layer O3 0.1

1 12th layer air T1 1.0

2 12th layer H2O 0.7

3 12th layer O3 0.7

1 30th layer air T1 0.2

2 30th layer H2O 0.05

3 30th layer O3 1.0

A. Luchetta / Neurocomputing 71 (2008) 3553–35603558

obtained by singular value decomposition of the trainingdata set covariance operator [3]. This phase is performedoffline and this initial choice guarantees a projectionfidelity index of 99.5%. The 99.5% fidelity index meansthat the retained 50 components contribute to the 99.5% ofthe total variation (variance) in the data set. The initialrange of parameters of the network (where they arerandomly distributed) is not a critical choice and it hasbeen determined by some empirical tests in order to obtaina good starting convergence rate. The network consists ofthe input layer, two hidden layers, of 40 neurons and 10neurons, respectively, and 3 neurons of output (4 in thefirst network which includes the surface temperature). Thelarge number of input channels require the use of PCA asexplained before, taking into consideration that it is not sosimple to choose the number of principal components to beused. Fidelity index can be used, but it is not a fully reliableindicator; each layer is then trained by using the previousalgorithm. Because each layer is represented by onenetwork, only one of them will be taken into account.

The considered layer is the first one, which also includesthe earth’s surface temperature, therefore the portion of thenetwork has 4 output lines. The available data set is formedby 377 samples, 200 of which are used for the training andthose remaining for the validation. Each sample is thenconstituted by a pair of I/O vectors, where the dimension ofinput vector is initially 50 and the dimension of outputvector is 3 (4 in the first network). The algorithm given inthe previous section is then applied. In Table 1 thepreliminary choice of ID(q) is reported. The choice ismotivated by the expected quality of the given output line;it is not essential to use a strict evaluation, but just aqualitative one. In the lowest layer the best performance isexpected to be in the surface temperature, to which aunitary value of ID(1) has been assigned. The first airtemperature generates an error which is about twice asmuch, while the water vapor relative error is about 3 timesgreater. Finally the ozone error is much larger at the earth’s

surface level and its contribution is completely de-emphasized, attributing a very small value to the ID(4). Inother words, for this specific application, the determinationof the ID comes from physical model considerations,confirmed by the heuristic evaluation of existing data. Inthis application, due to the 40 nets modeling the atmo-sphere and their corresponding 3 outputs, the number ofIDs to be chosen is quite high. However this is not thefocus point of the work because of 2 orders of reasons:(1) they are parameters ‘‘less’’ heuristic than those usuallyintroduced in a neural network (initial weights, learningparameter, momentum term, etc.); moreover its choice candescend from expert knowledge or other analyticalapproaches that also will be delved into in future works.They do not affect the statement of the present paper; (2) inthis particular example the number is actually quite high,but in other kinds of systems it could be much smaller, andthe importance resides in the generality of the approach.On the other hand, in this particular example, the IDsessentially come from expert knowledge and could begrouped together, so they actually are a subset of the total(number of networks�number of outputs).

Page 7: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESSA. Luchetta / Neurocomputing 71 (2008) 3553–3560 3559

In Fig. 2 the overall error trend is plotted both fortraining and validation data, and the epoch, when pruningoccurs, are clearly evident. In Table 2 the values of the lðqÞMAX

as calculated by the genetic algorithm, are reported, alongwith the validation error just before the pruning step andafter some epoch. The ‘‘before pruning’’ error reported inTable 2 coincides with the one obtained in previous workson the same data [10], when pruning was not performed.Moreover, in Table 2, the input dimension and number ofconnections are given, before and after the pruning step. Itmust be taken into account that the first pruning step is inany case the most important one, even if successiverefinements are eventually done, following the algorithmsteps. Input size has decreased from the initial 50 principalcomponents to 28; the other 22 inputs have been removed bythe pruning process itself, so they can be discarded with no

Fig. 2. rms error curves on training (continuous line) and validation

(dashed line) data for the first atmospheric layer.

Table 2

Results after pruning steps on some atmospheric layers

Data l(q)MAX Before pruning After 1s

1st layer (epoch 60) (epoch 7

Surface T1 (error in K) 0.5 0.48 0.44

Air T1 (error in K) 0.95 1.33 1.22

H2O (rms % error) 0.64 12.12 12.00

O3 (rms % error) 0.42 34.24 32.68

Input dimension – 50 32

Total weight number – 2494 654

12th layer (epoch 70) (epoch 9

Air T1 (error in K) 0.82 1.69 1.56

H2O (rms % error) 0.54 17.10 18.42

O3 (rms % error) 0.48 16.84 15.80

Input dimension – 50 40

Total weight number – 2483 965

30th layer (epoch 57) (epoch 7

Air T1 (error in K) 0.5 1.01 9.96

H2O (rms % error) 0.15 35.12 33.37

O3 (rms % error) 0.42 10.20 9.90

Input dimension – 50 32

Total weight number – 2483 782

information loss. The error adjustments are not huge, yetthe overall procedure allows for not only an improvement ofgeneralization, but also a correct dimensioning of input andnetwork size. Error on the validation data is reduced afterpruning. In general this can be more or less evident, oralmost absent, but it is common to every pruning method.The better generalization, also if light, arises along withdrastic weight reduction and input line dumping, alsohelping in the right choice of the size of the input spaceafter the principal component extraction. In Table 2 thenumber of neurons removed after each step has been shown.

6. Conclusions

The method of training and pruning of a FANN proposedin this paper combines several aspects in order to optimize thearchitecture of neural networks with several inter-correlatedoutputs which have different importance inside the samenetwork. The results are encouraging to continue and extendthese methods even to a more generic theory. A more generictheory could remove the manual choice of the importancedegree (ID) and it should therefore be chosen in automaticway, adding a module which evaluates the correlationbetween the output and a given network (the specific casecorresponds to a given atmospheric layer, but the theoryshould be generalized to every typology of systems).

Acknowledgments

Data shown in the example are provided by theEuropean center for the exploitation of meteorological

t pruning step After 2nd pruning step After 3rd pruning step

0) (epoch 105) (epoch 160)

0.44 0.43

1.19 1.19

11.47 11.37

32.90 32.80

32 28

590 562

0) (epoch 110) (epoch 180)

1.50 1.50

18.00 17.20

15.85 15.30

35 35

944 911

0) (epoch 145) (No more)

0.92 –

32.15 –

9.34 –

30 –

712 –

Page 8: Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks

ARTICLE IN PRESSA. Luchetta / Neurocomputing 71 (2008) 3553–35603560

satellites, that funded this research (Contract EUM/CO/02/1053/PS).

References

[1] U. Amato, G. Masiello, C. Serio, M. Viggiano, The s-IASI code for

calculation of infrared atmosphere radiance and its derivatives,

Environ. Modelling Software 17 (2002) 651–667.

[2] J.A. Benediktsson, J.R. Sveinsson, Multisource remote sensing data

classification based on consensus and pruning, IEEE Trans. Geosci.

Remote Sensing 41 (4) (2003) 932–936.

[3] A. Cichocki, R. Unbehauen, Neural Networks for Optimization and

Signal Processing, Wiley, Stuttgart, 1994.

[4] F. Del Frate, D. Solimini, On neural network algorithms for

retrieving forest biomass from SAR data, IEEE Trans. Geosci.

Remote Sensing 42 (1) (2004) 24–34.

[5] A.P. Engelbrecht, A new pruning heuristic based on variance analysis

of sensitivity information, IEEE Trans. Neural Networks 12 (6)

(2001) 1386–1399.

[6] G.B. Huang, P. Saratchandran, N. Sundararajan, A generalized

growing and pruning RBF (GGAP-RBF) neural network for function

approximation, IEEE Trans. Neural Networks 16 (1) (2005) 57–67.

[7] E. Karnin, A simple procedure for pruning backpropagation trained

neural networks, IEEE Trans. Neural Networks 1 (1990) 239–242.

[8] P. Lauret, E. Fock, T.A. Mara, A node pruning algorithm based on a

Fourier amplitude sensitivity test method, IEEE Trans. Neural

Networks 17 (2) (2006) 273–293.

[9] A. Luchetta, C. Serio, M. Viggiano, A neural network to retrieve

atmospheric parameters from infrared high resolution sensor spectra,

in: Proceedings of the 2003 IEEE International Symposium on

Circuits and Systems, 2003.

[10] A. Luchetta, C. Serio, M. Viggiano, A soft computing approach to

the elaboration of satellite data, in: Proceedings of the 2005 IEEE

Workshop on Soft Computing Applications, 2005.

[11] L. Ma, K. Khorasani, Facial expression recognition using construc-

tive feedforward neural networks, IEEE Trans. Systems Man

Cybernet. Part B 34 (3) (2004) 1588–1595.

[12] S. Miyoshi, M. Okada, Storage capacity diverges with synaptic

efficiency in an associative memory model with synaptic delay and

pruning, IEEE Trans. Neural Networks 15 (5) (2004) 1215–1227.

[13] J.E. Moody, The effective number of parameters: an analysis of

generalization and regularization in nonlinear learning systems, Adv.

Neural Inf. Process. Systems 4 (1992) 847–854.

[14] Jie Ni, Qing Song, Dynamic pruning algorithm for multilayer

perceptron based neural control systems, Neurocomputing 69

(16–18) (2006) 49–61.

[15] P.V.S. Ponnapalli, K.C. Ho, M. Thomson, A formal selection and

pruning algorithm for feedforward artificial neural network optimi-

zation, IEEE Trans. Neural Networks 10 (4) (1999) 964–968.

[16] A. Porto, A. Araque, J. Rabunal, J. Dorado, A. Pazos, A new

hybrid evolutionary mechanism based on unsupervised learning

for Connectionist Systems, Neurocomputing 70 (16–18) (2007)

2799–2808.

[17] L. Prechelt, Connection pruning with static and adaptive pruning

schedules, Neurocomputing 16 (1) (1997) 49–61.

[18] E. Ricci, R. Perfetti, Improved pruning strategy for radial basis

function networks with dynamic decay adjustment, Neurocomputing

69 (13–15) (2006) 1728–1732.

[19] H. Shi, L. Wang, Broadcast scheduling in wireless multihop networks

using a neural-network-based hybrid algorithm, Neural Networks 18

(5–6) (2005) 765–771.

[20] D. Srinivasan, Xin Jin, R.L. Cheu, Evaluation of adaptive neural

network models for freeway incident detection, IEEE Trans. Intell.

Transport. Systems 5 (1) (2004) 1–11.

[21] J. Wang, An improved discrete Hopfield neural network for Max-Cut

problems, Neurocomputing 69 (13–15) (2006) 1665–1669.

Antonio Luchetta graduated in electronic engi-

neering at the University of Florence, Italy, in

1993. From 1995 to 2001, he was an Assistant

Professor at the Department of Environmental

Engineering and Physics of the University of

Basilicata, Italy. From November 2001 to Octo-

ber 2005, he was Assistant Professor of the

Department of Electronics and Telecommunica-

tions of the University of Florence, where he

is at present an Associate Professor. His

research interests are in the areas of circuit theory, neural networks,

symbolic analysis of analog circuits, environmental and satellite data

elaboration.