Automatic design of growing radial basis function neural networks based on neighboorhood concepts

ry Systems 87 (2007) 231–240www.elsevier.com/locate/chemolab

Chemometrics and Intelligent Laborato

Automatic design of growing radial basis function neural networksbased on neighboorhood concepts

Frédéric Ros a,⁎, Marco Pintore b, Jacques R. Chrétien b

a GEMALTO, avenue de la Pomme de Pin, St. Cyr en Val, 45060 Orléans Cedex, Franceb BioChemics Consulting, 16 rue Leonard de Vinci, 45074 Orléans Cedex 2, France

Received 2 November 2006; accepted 6 February 2007Available online 21 February 2007

Abstract

Despite the reputation of RBFNs (Radial Basis Function Neural Networks), RBFN design is not straightforward since the efficiency of themodel depends on many parameters. RBFNs often require many manual parameter adjustments, which is a serious weakness especially when theyhave to be used automatically. In this paper, a method to design RBFNs for classification problems is proposed, with a view to obtainingclassification models rapidly by minimizing manual parameters, with performances very close to the best attainable from numerous trials. TheRBFN can be initiated automatically via the use of advanced clustering algorithms adapted to supervised contexts to find preliminary cells. Thefinal architecture is obtained via a growing process controlled by different mechanisms in order to find small and reliable RBF classifiers. Acandidate pattern is selected for creating a new unit only if it produces a significant quadratic error while presenting a significant classificationpotential from its neighborhood properties. The efficiency of the method is demonstrated on artificial and real data sets from the field ofchemometrics.© 2007 Elsevier B.V. All rights reserved.

Keywords: Clustering; Classification; Prototype selection; RBF; k-nearest neighbor method

1. Introduction

Artificial Neural Networks [1,2] are widely used in thebiological and pharmaceutical fields because they generallyenable the improvement of traditional performances thanks totheir ability to take non linearity in the data into account. BPNNs(Back Propagation Neural Networks) [3,4] are the most well-known Neural Networks but the complexity of the modelingfunctions they provide is often very high. This black boxbehavior is a serious weakness as it prevents the extraction ofrelevant information to explain the models and, therefore, todeliver a better understanding of biological mechanisms. RBFNs[5–8] (Radial Basis Function Neural Networks) are lesscommonly used than BPNN despite similar theoretical efficien-cies and an enhanced power of interpretability. They have,however, gained increasing popularity due to their simplestructure, well-established theoretical basis and fast learningspeed, which is a crucial factor in real time applications.

⁎ Corresponding author. Tel.: +33 2 38 49 90 06.E-mail address: [email protected] (F. Ros).

0169-7439/$ - see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.chemolab.2007.02.003

Nevertheless, RBF design is not straightforward. In several casesobtaining reliable models without an exhaustive search proves tobe a complex task. The presence of many parameters [9,10] to betuned makes their use difficult in situations where an on-line andautomatic model generation is required. For example, in thespecific context of feature selection, parameter adjustments [11]such as number, positions and widths of gaussian functions haveto be determined automatically since RBFs need to be appliedcontinuously during the selection procedure. Many methodsfound in the literature [12–15] based on varying concepts showpromising performances in that they are able to converge andoperate quickly. However, very few approaches are simple,interpretable and parameter-less, three key points regardingchemometric applications. Networks suffer from the need topresent a large number of parameters. As there is no guide to setup all the parameters, finding the most appropriate ones can behighly time-consuming. The objective of this paper is thereforeto propose a simple and efficient way to design RBF neuralnetworks capable of providing reliable results quickly. It hasbeen designed for no specialists (chemists) in pattern recognitionwho need to create and interpret classification models. This

mailto:[email protected]

http://dx.doi.org/10.1016/j.chemolab.2007.02.003

Fig. 2. Network evolution with the cycle iterations. Seven units have beencreated. The quadratic error and classification score have been multiplied by 100to be in the same range of the number of patterns inside each createdhypersphere, R_hyper defines this number.

232 F. Ros et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 231–240

means that while the reliability of the model is of greatimportance, the main focus of attention will be on the automaticaspect of the neural network development rather than on theresults themselves. Compared tomost current studies where onlyintrinsic performances are looked for, the objective here is toobtain acceptable results by limiting parameter adjustments inorder to be application-oriented. This study is limited to thedevelopment of RBFN for classification problems (networks aretrained to provide binary outputs). It does not deal with featureselection (feature space is supposed to be pre-selected).

The current proposal is the result of a long and exacting workto apply RBF in real world problems and our experience indesigning efficient classifiers [16]. It involves various ideas ofthe literature. The usefulness of the method is shown for classicartificial data often used in the pattern recognition communityand for a data set from the field of chemometrics. This paperfocuses on various dimensional classification examples thatmay be prototypical of the current chemometric problems thatoccur on a large scale.

2. Theory

2.1. Global strategy overview

The RBF proposed in this paper aims at reaching the multiobjective of being interpretable for chemical purposes, easy tomanage and with a good generalization capability. Some nativeconcepts can be found in Cho [17] and significant improve-ments have been made. The generalization potential of RBF isstrongly dependent on the relevance of the gaussian units. Theydefine the final network architecture and its associated weights.Given this constraint, the idea of the proposed method is to setup constructive rules. They are able to control the number andthe relevance of the created units while making networkconvergence easier.

From an empty neuronal structure or a structure initializedwith units provided by an adapted pre-processing procedure, thenetwork is constructed by a dynamic process. Networkparameters are optimized with basic gradient procedures. The

Fig. 1. Network configuration obtained after seven iteration cycles giving 98.4%of correct classification. The number close to the unit center defines the insertionorder.

network evolves by progressively adding new units until theestablished configuration is able to handle the data complexitywhile maintaining the generalization capability. The control ofthe unit added to the initial network configuration is partiallybased on the introduction of a region of interest with a sphericalshape associated to each created unit. This interest regiondelineates the unit influence, and no additional unit can beinserted inside. It acts as a protective area and therefore preventsthe creation of too many units. At each cycle corresponding tothe propagation of all the database patterns through the network,a reduced subset of candidate patterns for initiating new unitpositions is determined. The selected patterns are all outside theexisting protective regions of the network. They have to presentsignificant quadratic errors and a significant classificationpotential revealed by neighborhood properties. For classifica-tion purposes, the quadratic error criterion taken alone is notsufficiently relevant to enable the selection of new units.

At each cycle, only a reduced number (say 1) of admissiblecandidate patterns will contribute to enlarging the network inorder to leave time for optimizing the parameters of the currentarchitecture. Two alternatives are suggested to drive thegrowing process: the first one consists in selecting, among thecandidates presenting a minimum classification potential,those presenting the worst quadratic errors. The second (calledthe hierarchical mode) consists in selecting, among thecandidates presenting insufficient quadratic errors, thosehaving the highest potential for reducing the problem size.The latter strategy proceeds from a coarse description level tolevels of increasing detail. The network parameter optimiza-tion is favored by dynamically calculating the initial shape ofthe new units to be inserted into the network. It is also easedvia an appropriate initialization of the regression weights.Weight modifications are also controlled to avoid prematureand unjustified updates during the growing process. Thenetwork evolves by firstly focusing on the worst quadraticerrors to be considered after the smallest ones. Modificationsare supervised by a variable threshold which decreases withthe number of iterations under a low limit. The global network

233F. Ros et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 231–240

evolution can also be controlled by cross validation if there is avalidation database available.

Figs. 1 and 2 illustrate the growing process applied to asimple classification problem from an empty neuronal structure.With only seven inserted units and seven iteration cycles it ispossible to construct an RBF with more than 98.4% of correctclassification. Fig. 3 illustrates the growing process applied tothe same classification problem from an initial neuronalstructure provided by the pre-processing phase developed in aprevious paper [18]. From the initial configuration consisting oftwo initial units per category the growing process succeeds incompleting the architecture with 4 additional units presentingmore than 98.4% of correct classification.

2.2. Network architecture

Radial Basis function networks were introduced into theneural network literature by Broomhead and Lowe [19]. Someprevious work by Michelli [20], Duda and Hart [21] exploredthese topics from a more mathematical point of view. Moodyand Darken's study [22] is currently taken as a reference in mostpapers related to RBF neural networks. These researchers areconsidered as the first users of RBF techniques in the field ofneural networks. The RBF network model is motivated by thelocally tuned response observed in biological neurons. By usingoverlapping localized regions, complex decision regions can beobtained. It has been shown that with localized units one hiddenlayer suffices in principle to approximate any continuousfunction. The architecture of RBF used in this paper is classical,close to that proposed by Moody and Darken's RBF. The set upand training algorithm is more specific. Such networks consistof one layer of receptive field units. The ith receptive field unitdenoted by Gi(x) is typically chosen as:

Gi xð Þ ¼ expXnj¼1

� xj � cij� �2

=2⁎r2iji ¼ 1;2; N n¯unit

!ð1Þ

where x is an n-dimensional input vector, ci is a vector with thesame dimension as x, σi is the vector variance of the ithreceptive field unit, and n_unit is the number of receptive field

Fig. 3. Iteration cycles giving 98.4% of correct classification. The first four unitsare provided by the pre-processing step and the others by the growing process.

units. The output is obtained by a normalized linear combina-tion of the basis function units:

OkðxÞ ¼Xn¯uniti¼1

ðGiðxÞ⁎wikÞ=Xn¯uniti¼1

GiðxÞk ¼ 1; 2; N p ð2Þ

where p is the number of output nodes corresponding to thecategory number, and k the output index. Various otherimplementations with a serious scientific basis have beenproposed in the literature. We argue that the simplicity of ourapproach, which minimizes the degrees of freedom of thesystem, is a key point to lead to generalist systems. It alsocontributes to improving interpretability, a notion oftenneglected by designers despite its importance for applications.

2.3. Preprocessing phase for network initialization

This part is detailed in a previous paper [18]. It consists inselecting among the training set relevant patterns able torepresent hidden cells of the RBFN. The double selectionprocedure QSC+FCSM (Quick Supervised Clustering+FuzzySupervised C-Means) is supervised and therefore takes intoaccount the probability distribution of the different categories tobe discriminated. The QSC algorithm is based on a dualdistance principle and neighborhood concept, which aims atfinding leaders iteratively by respecting the following mainrules: each new leader has to be at the same time far from theaverage of all the selected leaders and far from the positionaverage of the leader group while presenting a classificationpotential. FCSM is a derivate method of the well-known FuzzyC-means algorithm to which supervised constraints have beenadded, leading rapidly to a better leader distribution in thedescriptor space. The role of the initialization procedure is toselect a few pattern leaders in each category in order to definethe skeleton of the RBF neural networks. The resultingpreliminary group of units is not automatically completeunder this strategy, and has to be combined with a growingapproach to specify the structure and the network parameters.Fig. 2 illustrates this point: the four units provided by theapproach are not sufficient to match with the data structure butgive a “good” network when completed with four other units.The consistency of each leader i of category j is defined by aninfluence hypersphere of radius Ri containing only patternsbelonging to category j.

Ri ¼ kratio�Di with Di ¼ fx=dðci;xÞ < minðdðci;xkÞÞ8k∈fl;n

¯patterng and catðxkÞcatðciÞg

ð3Þ

ci is the leader position, xk the kth training pattern and kratio aconstant in the range [0.5,1]. We also define Rinit(i) as theminimum of the radii found for the ith category.

We should underline that this pre-processing phase can speedup the growing process. It is not mandatory if the hierarchicalselection alternative is selected.


2.4. Growing process

The process is very simple and does not require parametersto be adjusted as the proposed strategy is driven by rules thatlimit the network degrees of freedom. The parameters involvedin the growing process play a secondary role, and can be definedautomatically:

The kv parameter expresses the minimum number ofnearest neighbors belonging to the same category for a unitcandidate to be inserted in the network. We have adopted thefollowing simple rule kv ¼ k⁎

ffiffiffin

pwhere n is the pattern

number of the active category and λ≈0.3. However kv=0can be used especially when hierarchical selection (seeSection 2.5.2) is adopted. As the unit creation is limited bythe algorithm construction, the unit created at this low limitcan help the regression part of the network to obtain betterclassification results. The kratio parameter [0.5,1] serves tocontrol the radius of the influence region of a given new unit.The gradient parameters αw, αc, ασ respectively for optimiz-ing the neural weights, unit positions and shapes via thegradient rule are set up constant, and assigned to small values[0.005, 0.01].

The threshold error em fromwhich a network modification canoccur is set to (∼0.6) and decreases linearly with the iterationsunder a low limit (∼0.4). Let tik be the desired value of the kthoutput unit of the ith training pattern, yik the actual values of thekth output unit of the ith training pattern when this is propagatedthrough the network, C the set of candidate patterns at a currentiteration,H the set of existing protecting hyperspheres and Countthe number of iterations without unit creation.

Initial settings: set up the different parametersLaunch the preprocessing step to set up the initial network

configuration or set up an empty neural configuration.Do until convergence

Step 1:• For each pattern i calculate ei ¼ l=p⁎

Ppk¼1jj tik � yik jj

○ If pattern i∉H and (eiN=em) and the creation processis active then add it to C

○ If (pattern i∈H and eiN= em) modify the neuralweights (Section 2.5.5)

Step 2:○ From C, select the best candidates (Section 2.5.2)○ If there are valid pattern candidates○ Create a new cell centered on it and initialize the unit

parameter (Ri, σi, ci: (Sections 2.5.3 and 2.5.4))○ Else

▪ Count=Count+1▪ If Count is more than 10% of the training populationfreeze the creation process.

Step 3:• Calculate global error for the training and cross validationset

• Update the controlling parameters (em, αw, …)• If the network shows satisfactory results from the trainingand cross validation sets or the maximum iteration reachedEnd. Otherwise, go to Step 1

2.5. Growing control

2.5.1. Limiting modifications and admissible pattern candidatesMinimizing the sum-of-squares error during the training is

well adapted to function approximation. It needs to bereconsidered for classification problems. A non-zero errorcan lead to a perfect category separation while a small errorresults in a rough classification. Reducing the quadraticerror generated by a well-classified pattern may have a badeffect on the overall network performances. It is especiallyverified in the case of a complex problem with overlappingcategories.

This aspect is a key point and needs to be carefully managed.Regarding the defined architecture, the network output isobtained by a normalized linear combination of the basisfunction units. The level of the quadratic error generated by apattern has however a real significance in terms of classificationperformance. It facilitates the definition of a relevant thresholdto freeze or not any network modification (unit creation orparameter unit update).

The quadratic threshold error em decreases with the numberof iterations until a low limit is reached. This enables a bettercontrol of the final structure of the network. By focusing first onthe worst errors, the creation of new units is prioritized, thusavoiding a premature (and sometimes useless) weight optimi-zation. By starting with a rather large threshold, the number ofcandidate patterns is thus limited to patterns that are the worstfor classification purposes. This enables the pattern space to bemore completely explored, by inserting units in unexploredregions and in others where the data complexity imposes thepresence of additional units. The presence of a low limit (say 0.4for two classes) for the threshold is more important for thenetwork modification than for the insertion (controlled vianeighborhood considerations). If a pattern is already well-classified, decreasing its corresponding quadratic error is notjustified. It can even decrease the overall classificationperformance.

2.5.2. Candidate selection for creating new unitsAll the admissible patterns are stored in C (c1,…cp). They all

present a quadratic error greater than em. The selectionprocedure consists in selecting the best one according to aneighborhood criterion directly linked to the classificationobjective. This selection is done via a first step consisting indisabling all the patterns which do not present the minimumrequirement according to their nearest neighbors. The remain-ing are all “competent” for a classification purpose. Let “active”be the category of a candidate pattern. If the kv nearestneighbors of this pattern are from the same category “active”,then this candidate is selected and placed in C' (c1,…cp'),otherwise it is disabled from the rest of the growing process. IfC' is empty, no unit can be inserted in the network, thusproducing a network with good generalization properties. Allthe patterns contained in C' present a good classificationpotential. It means that the simplest and quickest sortingprocedure consists in selecting the pattern with the largestquadratic error.


This first alternative is particularly suitable when a pre-processing phase has been performed. The hierarchicalalternative consists in selecting the pattern with the bestpotential for the classification purpose according to its localproperty. As already mentioned, kv has a secondary influence inthis case. The winner is the pattern presenting the largestnumber of nearest neighbors (say nl) belonging to the “active”category. This has the advantage of reducing the problemcomplexity as a function of the size. For an application of ntraining patterns, this selection can be estimated at a sizereduction of nl/n. At each iteration cycle, nl is the maximumaccording to the proposed criterion, the strong network structureis quickly reached and optimization can be easily completed.The mathematical expression to determine the winner cwin is thefollowing:

cwin¼ciV=n1 ¼ maxðOcðci;DiÞfor ia½1;p0�;where Di

¼ minðd1ðiÞ;d2ðiÞÞ ð4Þ

Oc(x,r) defines the number of patterns inside the hypersphere (x,r)containing only patterns of the same category of x. d1(i)=β⁎min(d(ci, ck) for k∈ [1, n_unit], if cat(ci)=cat(ci' ) then β is usuallyfixed to 0.6 [0.5, 1] else to 0.99 [0.9, 1]. The initial radius of a newunit has to be alwaysmajored by the distancewith the nearest unit.This distance is useful when the nearest unit presents the samecategory of ci as it conserves the notion of local influence. d2(i)=min(d(ci, xk) for k∈ [1,n_pattern] and cat(ci)≠cat(xk). The initialradius of a new unit has to be alwaysmajored by the distance withthe nearest pattern of a different category. It should be mentionedthat Di=d2(i) when the nearest unit of cwin belongs to a differentcategory.

2.5.3. Shape initializationOne of the most important aspects to be addressed is the

determination of the width of the unit function and its inherentcomplexity. These parameters control the amount of over-lapping of kernel functions as well as the network generaliza-tion. Small values yield a rapidly decreasing function whereaslarger values result in a more gently varying function. It is clearthat the setting of kernel widths for classification aims is acritical issue. The goal is to set up the width in such a way as toobtain good generalization. Many heuristics and techniquesbased on solid mathematical concepts have been proposed in theliterature [23,24]. They generally take into account thestatistical distribution of training patterns attracted by eachunit or the proximity of the units. We argue that complex widthestimation is not worth investigating: firstly the local distribu-tion serving for the estimation may not be statisticallyrepresentative enough to reach a good estimation. Secondly,complex width estimation is time-consuming. In addition, anapproximate estimation can be easily compensated by theregression part of the neural network: there is a question ofbalance between the kernel number and their shapes. Simpleshapes produce more interpretable classifiers, which is arelevant factor for the chemometric field.

In this paper, the RBF growing approach is based on theconcept of hypersphere strategies. The role of a hypersphere is

to prevent the creation of too many cells during the growingprocess. It represents a region of interest delineating the stronginfluence of a given cluster. In this zone, no additional cell canbe added. For instance, K.B Cho et al. [17] propose to create aneffective radius describing the accommodation boundary of theith hidden node by taking into account the shape of theassociated unit. In fact, their idea is to define the same radius forthe influence region whenever a new unit is created. Both σi andRi are defined without taking into account the training datadistribution. This classic scheme of involving hyperspheres canwork well in academic and non-complex classificationproblems (without overlapping, and small dimension space).For our objective, however, where the parameter adjustment hasto be minimized, these static approaches are not suitable despitetheir intrinsic possibilities: many trials may be needed to obtaina correct configuration. If influence regions are globally toosmall compared to the data distribution, the network is likely tolose the ability to generalize. On the contrary, the network maylack units blocking its evolution. Generally, when data comefrom spaces of more than four dimensions flexible approaches[25] are preferable.

There are two possibilities to set up Ri and σi associated toci:

If the best candidate has been selected under the firstalternative selection coupling quadratic error and nearestneighbors, Ri and σi are set up by considering Rinit(i) determinedduring the pre-processing phase (if performed) as these valuesreflect the data structure complexity.

Ri ¼ RinitðcÞ;c being the category of cirij ¼ Ri for ja 1;n½ � ð5Þ

The Rinit(i) values are decreased with the number ofiterations to insert units with smaller and smaller hyperspheres.If no pre-processing phase is used, Ri is set up according to thedistance between the unit center and its kv nearest neighbors.

If the best candidate has been selected under the hierarchicalalternative, then the initial radius of the active hypersphere isdirectly linked to the calculus done to obtain it. The protectiveregion contains only patterns belonging to the same category.

Ri ¼ Di if the nearest unit is of the same category elseRi ¼ kratio⁎Di;where kratio is in the range 0:5;1½ �rij ¼ Ri for ja 1;n½ �

ð6Þ

The role of kratio is to control the influence of thehypersphere. By taking small values for kratio value, the numberof cells is likely to be greater as the subspace volume is smaller.By taking large values, the number of cells is likely to besmaller as the subspace volume is larger. For example, whenkratio=1 Ri represents the exact distance between the new unitcenter and the nearest pattern of different category. Thehypersphere has the maximum size. This generates a smalloverlapping between hyperspheres which is easily compensatedfor by the regression part of the network.


It has to be noted that the proposed hypersphere strategyallows the presence of patterns of the database outside of anyexisting regions of influence. The algorithm target is not to findand fill hyperspheres until all the patterns of the database areinvolved. This highlights a significant difference with strategiesemployed in well-known methods derived from the preliminaryRCE [26] (Real Coulomb Energy) classification algorithmwhere every pattern belongs to at least one hypersphere. In thisalgorithm, if the propagation of a given input pattern outsideany hyperspheres leads to a correct result in classificationthanks to the regression part of RBFN, there is no reason tocreate a new influenced hypersphere around it. In the casewhere this scheme is not possible, there is a question of featurerelevance: the performance of pattern recognition systems isbasically linked to the method of classification but highlydependent on measured features representing the pattern.

2.5.4. Weight initialization and optimizationTraining data are supplied to the neural network in the form

of pairs of input and target vectors. The learning algorithm aimsto minimize the sum-of-squares error by modifying networkparameters, say neural weights, position and shape of theexisting units.

E ¼Xn¯patternx¼1

E xð Þ

E xð Þ ¼Xpk¼1

Lk xð Þ � Ok xð Þð Þ2¼Xpk¼1

Ek xð Þð Þ2

Ok xð Þ ¼

Xn¯unitj

Gj xð Þ⁎wjk

Xn¯uniti

Gj xð Þ

ð7Þ

Lk(x) is the desired output when x is propagated through thenetwork. Modifications are performed through the staticgradient descent procedures. Details for the normalizedgaussian network can be found in Cho [17]. Differentoptimization techniques can be applied to find the appropriateparameters. Despite their well-known disadvantages, gradientprocedures remain interesting in terms of performances, andchiefly regarding the time consumed.

In order to counteract the well-known drawbacks (localminima, slow…) of the approach, and in addition to the differentcontrols already proposed, the following set up is suggested: letus consider wij the weights of the connections between the newkernel and the p outputs of the network. Let us consider l(1∈{1,p}) the category of the kernel being created. Theconnection between this kernel and the output corresponding tothe category l is set up to 1−g1(t) while the other connections areset up as 0+g2(t). Then, as the network is trained withnormalized outputs, this set up prepares the convergence processwithout disturbing the weights of the neighbor cells. It imposesinitial weights so that when the pattern representing the createdkernel is propagated the first time via the network it gives acorrect classification score. This set up is particularly useful as it

prevents large parameter changes which are required when arandom initialization is adopted. The functions g1 and g2 areonly random functions providing a number between 0 and 0.1.They do not affect the classification score negatively when thepattern at the basis of the kernel is propagated through thenetwork. They make the gradient descent active at thebeginning. They also avoid a possible static scheme bygenerating a correct classification score with small quadraticerrors. It should be noted that this simple rule is specific to thecontext of the proposed classification approach. It does not,however, constitute a general way to initialize RBF weights. Forexample, a different rule is needed in the context of functionapproximation.

2.5.5. Robustness via cross validationOne of the most important constraints when developing

neural networks is the generalization aspect, i.e. the networkability to predict unknown patterns as well as training patterns.Even if all the necessary precautions are taken in order to createa minimum number of cells and then minimize the degree offreedom of the network, common concerns about generalizationability remain. In real world applications, databases are alwaysincomplete, making data distribution estimation almost impos-sible. One way to overcome this problem and preventoverfitting is to ensure the relevance of the model by validatingduring the training procedure that unknown patterns are wellpredicted by the model. This requires the availability of twodatabases, which is not always possible when managing realdata. The first database is devoted to the network training itselfwhile the second validates and preserves the networkgeneralization ability. Then, at each cycle, both scores can betaken into consideration in order to stop the procedures. Whenusing this network in a feature selection algorithm, a simple rulecan be applied to ensure that the results of the validation stepsdo not decrease too much. In other cases, curve evolution can bestudied to find the best iteration number. The training can thenbe launched again, and be stopped at this level since theevolution is completely deterministic. To be sure that thisprocedure is significant, the training and validation sets have tobe defined to cover the feature space in a similar way.

3. Results and discussion

In order to prove the efficiency of the proposed method we compareresults with the standard RBFN without growing process. By standardwe mean networks consisting of two consecutive phases: the first onededicated to the unit placement and the second one focusing only onoptimizing the different network parameters (kernels and regressionweights) without growing process. Two growing strategies are alsoconsidered: the first one developed by Cho [17] which can be seen asthe basis of the approach proposed in this paper and the second onedeveloped by Fritzke [13] which is conceptually different. Based on acontrolled growth process with an unsupervised and supervised variant,the method appears as an extension of Kohonen neural networks. It hasproved capable of generating networks which generalize very well andare relatively small. The k-dimensional topology of the Fritzkegenerated networks was 1 (line segment) and 2 (triangle) and theoption of removing superfluous cells not used.

Table 1Classification performances for the two-spiral data set according to the maximum number of units allowed after 100 cycles of iteration

Fixedtopology

Fixedtopology

Preprocessing+growingcontrol by quadratic error only

Preprocessing+growing control byautomatic unit initialization (first option)

Growing control by automatic unitinitialization (hierarchical option)

Number of initial units percategory

25 50 10 5 0

Growing strategy No No Yes Yes YesFinal number of units 25 50 50 43 43Classification performances 0.805 1 0.977 1 1Quadratic errors 0.082 0.081 0.079 0.074 0.073


3.1. Two-spiral data for classification

A two-spiral data set has been chosen as it consists in delineatingcomplex classification boundaries. In addition to the interest related tothe problem complexity, this case study enables data visualization dueto the low data dimensionality. It is a very well-known set taken asreference in many papers to show the ability of proposed classifiers. Itis a non-overlapping 2-dimensional set based on two categories, eachof them composed of one hundred and eighty patterns.

The proposed growing approach works very well for this application.Table 1 shows the proportion of misclassification for the learning setpertaining to the number of initialization leaders and from differentnumbers of additional units. The average proportion of misclassificationslowly decreased as a function of the number of units. For example, itgave a percentage of misclassification of 80% for 25 units and 85% for 35units. The growing process actually leads to 43 units and a score veryclose to 100%. Whatever the additional unit approach, the results arecomparable in terms of classification score due to the absence ofoverlapping between the categories, the difference being in the finalarchitecture. When selecting the new units on the basis of the quadraticerrors only the score remains comparable for the training set while a largerunit number is required to obtain the same classification score.Comparable results are also obtained both with and without theinitialization phase (Figs. 4 and 5). This can be easily explained by thespecial data distribution present. The shapes of the data are quite complexand especially the number of data required to estimate the real distributionis great compared to the number of training data. The best Fritzke 1-topology (2-topology) model gave 95% (98.8) of correct classificationand a quadratic error of 0.073 (0.049) for 54 (65) cells. Although all theparameter configurations gave good results, the models were rathersensitive. The classification score could vary by up to 5% in severaliterations or when different but very close convergence parameters areused. This case is very special regarding the complexity of the boundariesand rather remote from real applications. It however points out the

Fig. 4. Spiral problem: init with FCSM preprocessing, 100% of correctclassification obtained.

weakness of fixed number unit networks, and the real interest of growingprocedures: they can fit into the data structure.

Without a growing concept, it is impossible to perform thisclassification problem unless the unit number matches the complexityof the problem. It is important to note that all the solutions in theliterature imposing a fixed number of units usually considerablysmaller than the total number of data points fail. While good results arereported under this hypothesis, it is very easy to find datasets like thepresent one where they would not perform well. Indeed, these units areusually distributed by unsupervised approaches and their objective isvery different from that of placing units in order to achieve a goodperformance in classification.

Using supervised approaches is generally better but the question ofthe representative unit number remains. No universal solution exists tomanage this open problem. In the case of many units, developednetworks are generally heavy and with arbitrary performances. On thecontrary, they lead to poor performances as the unit number is notsufficient to cover the descriptor space. In order to understand theperformances during the training process, the number of patternsoutside any influenced hyperspheres is calculated (Table 1). Thisparameter is neither exhaustive nor absolute; it is nevertheless veryinteresting since it clearly shows that the descriptor space is or notrecovered enough by existing hyperspheres. For example from 20 unitsto 30, the number of patterns outside existing hyperspheres passes from60 to 45. In this case, it clearly proves that the poor performances aredue to an insufficient number of elementary units. Due to the specificityof this application, results become more and more satisfactory as thenumber of units increases.

3.2. Chemometric data

This data set comes from the field of chemometrics and presents agood illustration of the different algorithm possibilities. The objective

Fig. 5. Spiral problem: without pre-processing phase 100% of correctclassification obtained.

Table 2Classification performances obtained with different network strategydevelopments for the chemometric application

Number ofinitial units

Number offinal units

Trainingscore

Testscore

Fixed topology Randominitialization

50 50 0.629 0.82

C_means initialization 50 50 0.66 0.84FCSM initialization 50 50 0.69 0.865Simple growing with static Rinit

and σinit

0 [30–100] 0.35 0.39

Simple growing+control bydecreasing Rinit and σinit

0 [30–100] 0.48 0.46

Last configuration+control bydynamic em

0 [30–100] 0.55 0.53

Last configuration+pre-processing phase

24 [30–100] 0.65 0.723

Pre_processing phase+control byautomatic unit initialization(first alternative)

24 [30–65] 0.724 0.87

Pre_processing phase+control byautomatic unit initialization(hierarchical alternative)

24 [30,65] 0.74 0.88

Last configuration without pre-processing phase

0 [30,65] 0.74 0.88

Classification scores are the average of the 10% best obtained scores when aparameter search was required.

Fig. 6. Classification results obtained for the chemometric application functionto em when only quadratic error is considered to insert new units.


is to discriminate eight partially overlapped categories in the 8-D space.It was originally composed of 171-D space data divided into eightcategories representing 987 patterns for the eight categories ordered,680 serving for the training phase and the remainder for the test. Ahome-made feature selection algorithm [25] based on genetic conceptshas been applied to the native matrix in order to reduce the number ofdescriptors to 8.

This application firstly shows that the growing strategy proposed inthis paper produces very satisfactory results. Depending on thedifferent variants to select the best candidates and the use of thepreprocessing phase, more than 80% of correct classification has beenobtained for the test set. This points out the generalization capability ofthe proposed strategy. It also proves in particular that by imposingcoherent rules of construction, good network architectures can beobtained automatically. Without controlling the network growing it isvery difficult to obtain comparable results without an exhaustiveparameter search.

Different configurations have been studied in order to determine therole of the different controls favoring the network development.Regarding the number of training patterns, a fixed maximum unitnumber was set to 100. Starting from a basic growing strategycomparable to the Cho [17] proposition in which it was difficult toobtain more than 50% of correct classification. We have successivelyintroduced the different controls presented in this paper to reach a scoreabove 80%. We have also applied the Fritzke method which gave badresults for this application. After numerous parameter adjustments, itwas impossible to reach more than 57% of correct classification for thetraining and test set. Admittedly, this is more a question of finding thebest parameters than a methodology concern. The weakness of theapproach is the real difficulty in finding the appropriate parameters.

Table 2 summarizes the main results obtained with the differentconfigurations and growing alternatives. It points out the fundamentalrole of the neighborhood criterion in determining the best candidatepatterns as new units. It is clear that the scores become really interestingas soon as the quadratic criterion to select the units to be inserted is

coupled with the neighborhood one. For example, when 24 units areintroduced by the pre-processing phase based on a neighborhoodcriterion and the growing is controlled by the quadratic error criteriontaken alone, the classification score is increased by about 10%. Whenthe neighborhood criterion is coupled with the quadratic error oneduring the growing process, the results are again improved in the sameway reaching 88% of correct classification for the test set. When aneighborhood criterion is directly integrated in the choice of new unitsthe pre-processing phase does not play a key role for the classificationscores.

The pre-processing phase reduces the time consumed to obtain thefinal network configurations but generally leads to larger architectures.When the hierarchical growing strategy is applied, better classificationresults are obtained on the basis of a similar number of units. Forexample, with only 30 units, classification results are most of timeabove 80% of correct classification for the test set. The same networksize obtained via the other growing approaches and fixed topologiesgive classification results most of time below 80%.

Fig. 6 illustrates the role of the threshold introduced to control thecreation of a new unit or parameter modification regarding thequadratic error. In this experiment, only the quadratic error criterionwas used to skip the neighborhood criterion effect. The networkperformance is clearly dependent on the initial value of em, and alsosensitive to change during the training process. For example, from afixed em value passing from 0.3 to 0.2 the learning score passes from0.357 to 0.347. From a decreasing em value passing from 0.2 to 0.5 theclassification scores passes from 0.358 to 0.53. When mixed with theneighborhood criterion, the effect is not so obvious but does exist. If theclassification score is an important strength, the real advantage of thisconfiguration is the small number of parameters to be definedcompared to configurations where manual adjustment is required.

Table 3 depicts the results obtained for different values of kratio andkv. The results are not the same whatever the values of kratio. They arelinked to the fixed parameters like em or the maximum number ofiteration cycles. It is however interesting to note that they are close andchiefly that there is a coherent understanding of the results. A largekratio value is likely to limit the unit number, as the shape unit is alsolarge. It can lead to more unit creation in the case of regions with partialdata overlapping requiring more specific units. A small kratio value islikely to produce more units depicting more the local data distributionin the active category. For example, when kv was fixed to 0, the numberof created units goes from 64 to 44 when kratio goes from 0.6 to 0.9. Theappropriate kratio value is application dependent and linked to the dataoverlapping. For this application, kratio=0.75 enables to obtainrelatively stable results whatever the coherent kv values both for thetraining and test set.

Table 3Results related to the chemometric application for different values of kratio andkv after 100 iteration cycles

kratio kv Unitnumber

Quadraticlearning error

Quadratic testerror

Learningscore

Testscore

0.6 0 64 0.07 0.055 0.833 0.8991 28 0.071 0.056 0.698 0.8502 22 0.069 0.056 0.68 0.8373 17 0.07 0.057 0.65 0.834 17 0.072 0.058 0.655 0.835 15 0.076 0.058 0.635 0.798

0.7 0 63 0.070 0.051 0.81 0.901 48 0.071 0.052 0.778 0.8992 33 0.069 0.051 0.74 0.8853 24 0.07 0.051 0.711 0.8854 21 0.072 0.051 0.70 0.8595 21 0.076 0.051 0.70 0.859

0.75 0 45 0.070 0.049 0.763 0.901 44 0.071 0.05 0.763 0.902 40 0.069 0.05 0.754 0.9123 32 0.07 0.05 0.741 0.9054 28 0.072 0.051 0.729 0.8995 24 0.076 0.052 0.702 0.873

0.8 0 44 0.071 0.049 0.759 0.901 59 0.071 0.049 0.775 0.9152 57 0.069 0.051 0.751 0.9183 45 0.07 0.05 0.742 0.9214 34 0.072 0.051 0.701 0.885 27 0.076 0.053 0.635 0.827

0.9 0 44 0.072 0.051 0.731 0.901 52 0.073 0.051 0.757 0.902 58 0.072 0.051 0.75 0.9213 50 0.07 0.051 0.698 0.8954 18 0.072 0.058 0.586 0.7655 18 0.073 0.06 0.586 0.765


It is also interesting to note that the classification results for the testset are always better than those of the training set. The training and testsets were provided by chemists and used without any change. Theresults however prove that there is a covering issue between thedifferent sets and especially for the training set. This is obvious whenobserving the classification results obtained when kv=0. In thisconfiguration, the growing strategy develops the network architecturehierarchically without any limit regarding the local distribution of thecandidate patterns. In a classic pattern recognition scheme, this is likelyto lead to specialist classifiers having worse test results. By analyzingthe test results, we can see that they are rather constant around 0.9 whilethe training results increase by about 5%. To confirm this point, wehave considered the test set as a cross validation file to add a newcontrol related to the growing. With 25 units the validation results wereabout 88% of correct classification. They do not noticeably progressafter related to the number of additional units and the leave-one-out testprocedure gives similar classification results.

It is important to distinguish those internal parameters whose valueis important for the overall behavior of the model from those that give areasonable performance over a large range of values. The kv nearestneighbors fall into the latter category. Any range of values leads tosatisfactory results and their exact role is perfectly known. The value ofkv has to be fixed in order to produce the best trade-off between the dualobjectives for the model: being able to generalize and being sufficientlyefficient. Whatever the shape of the data, the specified value has a realsignification even if its role is more secondary when the hierarchicalapproach is used. On the contrary, similar gradient parameters have

been used to apply the classical delta rule algorithm without anymodification whatever the application. This lack of sensitivity is easilyexplained by the calculations performed previously in order to find theappropriate context for the gradient procedure.

4. Conclusion

In this paper, a global strategy for designing RBFN classifiershas been performed and assessed in two studies, one empiricalderived from well-known studies in pattern recognition and onein the field of chemometrics. The networks generated are simple,interpretable and easy to generate. In order to initialize theRBFN, a two-step deterministic supervised clustering procedureis proposed to set up the initial RBFN architecture via clusteringcompleted by the RBF growing process. This growing processconsists in optimizing the RBF architecture by recruiting thenew units adaptively and adjusting the different parameters toimprove its generalization capability. Introduction of severalheuristics during the growing process also leads to a bettercontrol of the final network architecture. In particular,introducing new units by considering both the quadratic errorsgenerated and classification skills is particularly efficient. Itreduces the problem size continuously until convergence isreached. Many simulations have been performed to prove theefficiency of the method. The spiral problem points out the limitof static structures compared to the growing ones. Thechemometric set shows the good behavior of the proposal inthe presence of overlapped data, in the context of a multiclassification problem and a relatively large data space. It wouldhowever be possible to improve the network efficiencies byconsidering fuzzy hyperspheres. This is not as easy as it soundssince the need is to improve the generalization power whilekeeping the network interpretability. Future research develop-ments will move in this direction. It will investigate how toimprove control of the growing by for example exploiting asmuch as possible the current structure before inserting new units.

References

[1] D. Kireev, D. Bernard, J.R. Chretien, F. Ros, Application of KohonenNeural Networks in classification of biologically active compounds, SARand QSAR in Environmental Research, vol. 8, 1998, pp. 93–107.

[2] M. Chastrette, D. Cretin, C.E. Aidi, Structure–odor relationships: usingNeural Networks in the estimation of camphoraceous or fruity odors andolfactory thresholds of aliphatic alcohol, Journal of Chemical Informationand Computer Sciences, vol. 36 (1996) 108–113.

[3] R.P. Lippman, An introduction to computing with Neural Nets, IEEEASSP Magazine 4 (1987) 4–22.

[4] N.B. Emery, The inspection of Blue Denim Fabric with Computer Vision,Doctoral thesis, North Carolina State University, Raleigh, NC, 1990.

[5] Juhnong Nie, D.A. Linkens, Learning control using fuzzyfied self-organizing Radial Basis Function Network, IEEE Transactions on FuzzySystems 4 (1993) 280–287.

[6] B.D. Ripley, Statistical Aspects of Neural Networks, Semstat, Denmark,1992.

[7] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Parallel Distributed Proces-sing: exploration in the microstructure of cognition, Cambridge, 1986.

[8] D.F. Specht, Probabilistic Neural Networks for classification, Mapping orAssociative Memory, ICNN 88 Conference proceedings, 1988.

[9] J.D. Powell, The theory of radial basis function approximation, Advancesin Numerical Analysis 2 (1990) 105–210.


[10] Z. Wang, T. Zhu, An efficient learning algorithm for improving gen-eralization performance of radial basis function neural networks, NeuralNetworks 13 (2000) 545–553.

[11] M. Musavi, W. Ahmed, K. Chan, K. Faris, D. Hummels, On the training ofradial basis function classifiers, Neural Networks 5 (1992) 595–603.

[12] B. Fritzke, Growing self-organizingNetworks—why? ESSAN 96, EuropeanSymposium on Artificial Neural Networks, Brussels, 1996, pp. 61–71.

[13] Li Yuhua, J. Michael, N. Pont, J. Barrie, Improving the performance ofradial basis function classifiers in condition monitoring and fault diagnosisapplications where unknown faults may occur, Pattern Recognition Letters23 (2002) 569–577.

[14] I. Rojas, H. Pomares, J.L. Bernier, J. Ortega, B. Pino, F.J. Pelayo, A. Prieto,Time series analysis using normalized PG-RBF network with regressionweights, Neurocomputing 42 (2002) 267–285.

[15] F. Behloul, B.P.F. Lelieveldt, A. Boudraa, J.H.C. Reiber, Optimal design ofradial basis function neural network for fuzzy-rule extraction in highdimensional data, Pattern Recognition 35 (2002) 659–675.

[16] F. Ros, S. Guillaume, An efficient nearest neighbor classifier, volume inPress, Hybrid Evolutionary Systems. Studies in Computational Intelli-gence, Springer-Verlag (in press).

[17] B.K. Cho, B.H. Wang, Radial basis function based adaptative fuzzy sys-tems and their applications to System identification and prediction, FuzzySets and Systems 83 (1995) 325–339.

[18] F. Ros, M.Pintore, A. Demand, J.C. Chretien, Automatical initialization ofRBF Neural Networks, Chemometrics and Intelligent Laboratory Systems,(in press).

[19] D. Broomhead, D. Lowe, Multivariable functional interpolation andadaptative networks, Complex Systems 2 (1988) 321–355.

[20] C. Miccheli, Interpolation of scattered data: distance matrices andconditionally positive definite functions, Constructive Approximation 2(1986) 11–22.

[21] R.O. Duda, P.E. Hart, Pattern Classification and scene analysis, JohnWiley& Sons, New York, 1973.

[22] J. Moody, C.J. Darken, Fast learning in networks of locally tuned pro-cessing units, Neural Computation 1 (1989) 281–294.

[23] F. Schwenker, H. Kestler, G. Palm, Three learning phases for radial basisfunction networks, Neural Networks 14 (2001) 439–458.

[24] F. Schwenker, C. Dietrich, Initialization of radial basis function net-works by means of classification trees, Neural Networks 10 (2000)476–482.

[25] F. Ros, M. Pintore, J.R. Chretien, Molecular description selection com-bining genetic algorithms and fuzzy logic: application to database miningprocedures, Chemometrics and Intelligent Laboratory Systems 63 (2002)15–26.

[26] D.L. Reilly, L.N. Cooper, C. Elbaum, A neural network for categorylearning, Biological Cybernetics 45 (1982) 35–41.

Documents

Automatic design of growing radial basis function neural networks based on neighboorhood concepts