26
Neurocomputing 62 (2004) 39 – 64 www.elsevier.com/locate/neucom Training RBF networks with selective backpropagation Mohammad-Taghi Vakil-Baghmisheh , Nikola Pave si c Laboratory of Articial Perceptions, Systems and Cybernetics, Faculty of Electrical Engineering, University of Ljubljana, Slovenia Received 11 March 2002; received in revised form 8 July 2003; accepted 19 November 2003 Abstract Backpropagation with selective training (BST) is applied on training radial basis function (RBF) networks. It improves the performance of the RBF network substantially, in terms of convergence speed and recognition error. Three drawbacks of the basic backpropagation algo- rithm, i.e. overtraining, slow convergence at the end of training, and inability to learn the last few percent of patterns are solved. In addition, it has the advantages of shortening training time (up to 3 times) and de-emphasizing overtrained patterns. The simulation results obtained on 16 datasets of the Farsi optical character recognition problem prove the advantages of the BST algorithm. Three activity functions for output cells are examined, and the sigmoid activity function is preferred over others, since it results in less sensitivity to learning parameters, faster convergence and lower recognition error. c 2003 Elsevier B.V. All rights reserved. Keywords: Neural networks; Radial basis functions; Backpropagation with selective training; Overtraining; Farsi optical character recognition 1. Introduction Neural networks (NNs) have been used in a broad range of applications, including: pattern classication, pattern completion, function approximation, optimization, pre- diction, and automatic control. In many cases, they even outperform their classical Corresponding author. LUKS, Fakulteta za elektrotehniko, Tr za ska 25, 1000-Ljubljana, Slovenia. Tel.: +386-1-4768839; fax: +386-1-4768316. E-mail addresses: [email protected] (M.-T. Vakil-Baghmisheh), [email protected] (N. Pave si c). 0925-2312/$ - see front matter c 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2003.11.011

Training RBF Networks With Selective Backpropagation

  • Upload
    tall1

  • View
    229

  • Download
    8

Embed Size (px)

DESCRIPTION

Backpropagation with selective training (BST) is applied on training radial basis function(RBF) networks. It improves the performance of the RBF network substantially, in terms ofconvergence speed and recognition error. Three drawbacks of the basic backpropagation algo-rithm, i.e. overtraining, slow convergence at the end of training, and inability to learn the lastfew percent of patterns are solved. In addition, it has the advantages of shortening trainingtime (up to 3 times) and de-emphasizing overtrained patterns. The simulation results obtainedon 16 datasets of the Farsi optical character recognition problem prove the advantages of theBST algorithm. Three activity functions for output cells are examined, and the sigmoid activityfunction is preferred over others, since it results in less sensitivity to learning parameters, fasterconvergence and lower recognition error

Citation preview

  • Neurocomputing 62 (2004) 3964www.elsevier.com/locate/neucom

    Training RBF networks with selectivebackpropagation

    Mohammad-Taghi Vakil-Baghmisheh , Nikola Pave*si+cLaboratory of Articial Perceptions, Systems and Cybernetics, Faculty of Electrical Engineering,

    University of Ljubljana, Slovenia

    Received 11 March 2002; received in revised form 8 July 2003; accepted 19 November 2003

    Abstract

    Backpropagation with selective training (BST) is applied on training radial basis function(RBF) networks. It improves the performance of the RBF network substantially, in terms ofconvergence speed and recognition error. Three drawbacks of the basic backpropagation algo-rithm, i.e. overtraining, slow convergence at the end of training, and inability to learn the lastfew percent of patterns are solved. In addition, it has the advantages of shortening trainingtime (up to 3 times) and de-emphasizing overtrained patterns. The simulation results obtainedon 16 datasets of the Farsi optical character recognition problem prove the advantages of theBST algorithm. Three activity functions for output cells are examined, and the sigmoid activityfunction is preferred over others, since it results in less sensitivity to learning parameters, fasterconvergence and lower recognition error.c 2003 Elsevier B.V. All rights reserved.

    Keywords: Neural networks; Radial basis functions; Backpropagation with selective training; Overtraining;Farsi optical character recognition

    1. Introduction

    Neural networks (NNs) have been used in a broad range of applications, including:pattern classi:cation, pattern completion, function approximation, optimization, pre-diction, and automatic control. In many cases, they even outperform their classical

    Corresponding author. LUKS, Fakulteta za elektrotehniko, Tr*za*ska 25, 1000-Ljubljana, Slovenia.Tel.: +386-1-4768839; fax: +386-1-4768316.

    E-mail addresses: [email protected] (M.-T. Vakil-Baghmisheh), [email protected](N. Pave*si+c).

    0925-2312/$ - see front matter c 2003 Elsevier B.V. All rights reserved.doi:10.1016/j.neucom.2003.11.011

    mailto:[email protected]:[email protected]
  • 40 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    2ly

    1y

    1z

    2z

    3lz

    V1x

    2x

    1lx

    i = 1,..., l1 m = 1,..., l2 j = 1,..., l3

    U

    Fig. 1. Con:guration of the RBF Network (for explanation see Appendix A).

    counterparts. In spite of diEerent structures and training paradigms, all NNs per-form essentially the same function: vector mapping. Likewise, all NN applicationsare special cases of vector mapping. Development of detailed mathematical models forNNs began in 1943 with the work of McCulloch et al. [12] and was continued byothers.According to Wasserman [20], the :rst publication on radial basis functions for

    classi:cation purposes dates back to 1964 and is attributed to Bashkirof et al. [4] andAizerman et al. [1]. In 1988, based on Covers theorem on the separability of patterns[6], Broomhead et al. [5] employed radial basis functions in the design of NNs.The RBF network is a two layered network (Fig. 1), and the common method for

    its training is the backpropagation algorithm. The :rst version of the backpropagationalgorithm, based on the gradient descent method, was proposed by Werbos [21] andParker [13] independently, but gained popularity after publication of the seminal bookby Rumelhart et al. [15]. Since then, many modi:cations have been oEered by others,and Jondarr [10] has reviewed 65 varieties.Almost all variants of backpropagation algorithm were originally devised for the

    multilayer perceptron (MLP). Therefore, any variant of the backpropagation algorithmwhich is used for training the radial basis function (RBF) network should be customizedto suit this network, so it will be somehow diEerent from the variant suitable for theMLP. Using the backpropagation algorithm for training RBF network has three maindrawbacks:

    overtraining, which weakens the networks generalization property, slowness at the end of training, inability to learn the last few percent of vector associations.

    A solution oEered for overtraining problem is early stopping by employing crossvalidation technique [9]. There are plenty of research reports that argue against use-fulness of the cross validation technique in the design and the training of NNs. Fordetailed discussions the reader is invited to see [2,3,14].

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 41

    From our point of view, there are two major reasons against using early stoppingand the cross validation technique on our data:

    (1) The cross validation stops training on both learned and unlearned data. Whilethe logic behind early stopping is preventing overtraining on learned data, thereis no logic for stopping the training on unlearned data, when the data is notcontradictory.

    (2) In the RBF and the MLP networks, learning trajectory depends on the randomlyselected initial point. This means that the optimal number of training epochs whichis obtained by CV, is useful iE we start training always from the same initial point,and the network always traverses the same learning trajectory!

    To improve the performance of the network, the authors suggest the selective training,as there is no other way to improve the performance of the RBF network on the givendatasets. The paper shows that if we use early stopping or continue the training withthe whole dataset, the generalization error will be much more than the results obtainedby the selective training. In [19] the backpropagation with selective training (BST)algorithm was presented for the :rst time and was used for training the MLP network.Based on the results obtained on our OCR datasets, the BST algorithm has the

    following advantages over the basic backpropagation algorithm:

    prevents from overtraining, de-emphasizes the overtrained patterns, enables the network to learn the last percent of unlearned associations in a shortperiod of time.

    As there is no universally eEective method, the BST algorithm is not an exception.Since the contradictory data or the overlapping part of the data cannot be learned,applying the selective training on data with a large overlapping area will destabilizethe system, but it is quite eEective when dataset is error-free and non-overlapping, as itis the case with every error-free character-recognition database, when enough numberof proper features are extracted.Organization of the paper: The RBF network is reviewed in Section 2. In Section

    3, the training algorithms are presented. Simulation results are presented in Section 4,and conclusions are given in Section 5. In addition, the paper includes two appendices.In most of the resources the formulations for calculating error gradients of RBF

    networks are either erroneous and conOicting (for instance see the formulas 4.57, 4.60,7.53, 7.54, 7.55 in [11]), or having not been given at all (see for instance [20,16]).Thus in Appendix A, we obtain these formulas for three forms of output cell activ-ity function. Appendix B presents some information about feature extraction methodsused for creating the Farsi optical character recognition datasets, which are used forsimulations in this paper.

    Remark. Considering that in the classi:er every pattern is represented by its featurevector as the input vector to the classi:er, classifying the input vector is equivalent

  • 42 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    to classifying the corresponding pattern. Frequently in the paper, the vector which isto be classi:ed has been referred to by the input pattern, or simply pattern, and viceversa.

    2. RBF networks

    In this section, the structure, training paradigms and initialization methods of RBFnetworks are reviewed.

    2.1. Structure

    While there are various interpretations of RBF, in this paper we will considerit from the pattern recognition point of view. The main idea is to divide the in-put space into subclasses, and to assign a prototype vector for every subclass inthe center of it. Then the membership of every input vector in each subclass willbe measured by a function of its distance from the prototype (or kernel vector),that is fm(x) = f(x vm). This membership function should have four speci:-cations:

    1. Attaining the maximum value in the center (zero distance).2. Having considerable value in the close neighborhood of center.3. Having negligible value in far distances (where are close to other centers).4. DiEerentiability.

    In fact, any diEerentiable and monotonically decreasing function of x vm will ful-:ll these conditions, but the Gaussian function is the common choice. After obtainingthe membership values (or similarity measures) of input vector in the subclasses, theresults should be combined to obtain the membership degrees in every class. The twolayered feed-forward neural network depicted in Fig. 1 is capable of performing all theoperations, and is called the RBF network.The neurons in the hidden layer of network have a Gaussian activity function and

    their inputoutput relationship is:

    ym = fm(x) = exp(

    x vm2

    22m

    ); (1)

    where vm is the prototype vector or the center of the mth subclass and m is thespread parameter, through which we can control the receptive :eld of that neuron.The receptive :eld of the mth neuron is the region in the input space where fm(x)is high.The neurons in the output layer could be sigmoid, linear, or pseudo-linear, i.e. linear

    with some squashing property, so the output could be calculated using one of the

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 43

    following equations:

    zj =

    11 + eSj

    ; sigmoid;

    sjl2

    ; linear; with1l2

    squashing function;

    sjm ym

    ; pseudo-linear; with1m ym

    squashing function;

    (2)

    where

    sj =l2

    m=1

    ymumj; j = 1; : : : ; l3: (3)

    Although in the most of literature, the neurons with linear or pseudo-linear activityfunction have been considered for the output layer, we strongly recommend using thesigmoidal activity function, since it results in less sensitivity to learning parameters,faster convergence and lower recognition error.

    2.2. Training paradigms

    Before starting the training, a cost function should be de:ned, and through the train-ing process we will try to minimize it. Total sum-squared error (TSSE) is the mostpopular cost function.Three paradigms of training have been suggested in the literature:

    1. No-training: In this the simplest case, all the parameters are calculated and :xedin advance and no training is required. This paradigm does not have any practicalvalue, because the number of prototype vectors should be equal to the number oftraining samples, and consequently the network will be too large and very slow.

    2. Half-training: In this case the hidden layer parameters (kernel vectors and spreadparameters) are calculated and :xed in advance, and only the connection weightsof output layer are adjusted through backpropagation algorithm.

    3. Full-training: This paradigm requires the training of all parameters including kernelvectors, spread parameters, and the connection weights of output layer (vms, msand umjs) through the backpropagation algorithm.

    2.3. Initialization methods

    The method of initialization of any parameter will depend on the selected trainingparadigm. To determine the initial values of kernel vectors, many methods have beensuggested, among them the most popular are:

    1. the :rst samples of the training set,2. some randomly chosen samples from the training set,

  • 44 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    3. subclass centers obtained by some clustering or classi:cation method, e.g. k-meansalgorithm or LVQ algorithm.

    Theodoridis [16] has reviewed some other methods and cited some related reviewpapers.Wasserman [20] presents a heuristic which can be useful in determining the method

    of calculating initial values of spread parameters:

    Heuristic: The object is to cover the input space with receptive :elds as uniformlyas possible. If the spacing between centers are not uniform, it may be necessary foreach subclass to have its own value of . For subclasses that are widely separatedfrom others, must be large enough to cover the gap, whereas for subclasses thatare close to others, must have a small value.

    Depending on the dataset, training paradigm, and according to the heuristic, one of thefollowing methods can be adopted:

    1. Assigning a small :xed value, say, = 0:05 or 0.1, which requires a large numberof hidden neurons to cover the input space.

    2. = d=2l2, where d is the maximum distance between the chosen centers, and l2

    is the number of centers.3. In the case of using the k-means algorithm to :nd the kernel vectors, m could be

    the standard deviation of the vectors in the pertaining subclass.

    To assign initial values to the weights in the output layer, there are two methods:

    1. Some random values in the range [ 0:1;+0:1]. This method necessitates weightadjustment through an iterative process (the backpropagation algorithm).

    2. Using the pseudo-inverse matrix to solve the following matrix equation:

    YU = Z; Y =

    y1

    ...

    yQ

    ; Z=

    z1

    ...

    zQ

    ;

    yq Rl2 ;zq Rl3 ;

    (4)

    where y1; : : : ; yQ and z1; : : : ; zQ are the row vectors obtained from the hidden and outputlayers, respectively, in response to the x1; : : : ; xQ row vectors in the input layer, andthe equation YU = Z is made as follows: for each input vector in the training set xq,the outputs from the hidden layer are made a row in the matrix Y, target outputs areplaced in corresponding rows of target matrix Z and each set of weights associatedwith an output neuron is made a column of matrix U.Considering that in large scale problems, the dimension of Y is high and (YTY)1

    is ill-conditioned, despite super:cial appeal of the pseudo-inverse matrix method, the:rst iterative method is the only applicable one.

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 45

    3. Training algorithms

    In this section, we will present two training algorithms for the RBF network. Firstthe basic backpropagation (BB) algorithm is reviewed, and then the modi:ed algorithmis presented.

    3.1. Basic backpropagation for the RBF network

    Here we will consider the algorithm for the full-training paradigm; customizing itfor half-training is straightforward and can be done simply by eliminating gradientcalculations and weight-updating corresponding to the appropriate parameters.

    Algorithm.

    (1) Initialize network.(2) Forward pass: Insert the input and the desired output, compute the network outputs

    by proceeding forward through the network, layer by layer.(3) Backward pass: Calculate the error gradients versus the parameters, layer by layer,

    starting from the output layer and proceeding backwards: @E=@umj, @E=@vim, @E=@2m(see Appendix A).

    (4) Update parameters:

    umj(n+ 1) = umj(n) 3 @E@umj ; (5)

    vim(n+ 1) = vim(n) 2 @E@vim ; (6)

    2m(n+ 1) = 2m(n) 1

    @E@2m

    ; (7)

    where 1; 2; 3 are learning rate factors in the range [0; 1].(5) Repeat the algorithm for all training inputs. If one epoch of training is :nished,

    repeat the training for another epoch.

    Remarks. (1) Based on our experience, the addition of the momentum termas it iscommon for the MLPdoes not help in training of the RBF network.(2) If the sigmoidal activity function is used for output cells, adding sigmoid prime

    oEset 1 [8] will improve training substantially, similar to the MLP.(3) Stopping should be decided based on the results of the network test, which

    is carried out every T epochs after cost function becomes smaller than a thresholdvalue C.

    1 As the output of the neurons approaches extreme values (0 or 1) there will be just a little learning orno learning. A solution to this problem is adding a small oEset ( 0:1) to the derivative @zj=@sj in Eq.(A.9), which is called the sigmoid prime o=set, thus @zj=@sj never reaches zero. Based on our experience,adding such a term is helpful only in calculation of (A.11), but not in calculation of (A.26) or (A.30).

  • 46 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    (4) To get better generalization performance, using the cross validation [9] methodhas been reported in some cases as a stopping criterion; this, howeveras was men-tioned in the introductionis unsatisfactory and unconvincing, because it stops trainingon both learned and unlearned inputs.(5) The number of output cells depends on the number of classes and the approach

    of coding, however, it is highly recommended to make it equal to the number ofclasses.(6) Sometimes, in the net input of the sigmoid function or the linear output a constant

    term is also considered (called threshold term), which is implemented using a constantinput (equal to 1). In some cases this term triggers the moving target phenomenonand hinders training, and in some other cases without it there is no solution. Therefore,it must be examined for every case, separately.(7) In the rest of this paper our purpose from BB is the backpropagation algorithm

    with sigmoid prime oEset as explained in footnote 1, without the momentum term.

    3.2. Backpropagation with selective training

    The diEerence between the BST algorithm and the BB algorithm lies in the selectivetraining, which is appended to the BB algorithm. When most of the vector associationshave been learned, every input vector should be checked individually, and if it islearned there should be no training on that input, otherwise training will be carriedout. In doing so, a side eEect will arise: the stability problem. That is to say, whenwe continue training on only some inputs, the network usually forgets the other inputoutput associations which were already learned, and in the next epoch of training it willmake wrong predictions for some of the inputs that were already classi:ed correctly.The solution to this side eEect consists of considering a stability margin for thede:nition of the correct classi:cation in the training step. In this way we also carryout training on marginally learned inputs, which are on the verge of being misclassi:ed.Selective training has its own limitations, and cannot be used on conOicting data,

    or on a dataset with large overlapping areas of classes. Based on the obtained results,using the BST algorithm on an error free OCR dataset has the following advantages:

    prevents from overtraining, de-emphasizes the overtrained patterns, enables the network to learn the last percent of unlearned associations in a shortperiod of time.

    BST algorithm.

    (1) Start training with BB algorithm, which includes two steps: forward propagation, backward propagation.

    (2) When the network has learned most of the vector mappings and the training pro-cedure has slowed down, i.e. TSSE becomes smaller than a threshold value C,stop the main algorithm and continue with selective training.

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 47

    (3) For any pattern perform forward propagation and examine the prediction of thenetwork:

    zJ =maxJ(zj) j = 1; : : : ; l3;

    if (zJ zj + ) j = J J is the predicted class;if (zJ zj + ) j = J no-prediction;

    (8)

    where is a small positive constant, and is called the stability margin.(4) If the network makes a correct prediction, do nothing, go back to step 3 and repeat

    the algorithm for the next input, else, network does not make a correct prediction(including no-prediction case), carry out the backward propagation.

    (5) If the epoch is not complete, go to step 3, else check the total number of wrongpredictions: if its trend is decreasing, go to step 3 and continue the training oth-erwise stop training.

    (6) Examine network performance on the test set:do only forward propagation, and then:for any input: zJ =max

    J(zj) j = 1; : : : ; l3 J is the predicted class.

    Remarks. (1) An alternative condition for starting the selective training is as follows:after TSSE becomes smaller than a threshold value C, every T epochs carry out arecognition test on the training set, and if it ful:lls a second threshold condition, thatis (recognition error C1), start selective training. The recommended value for T is36T 6 5.(2) If is chosen large, training will be continued for most of the learned inputs,

    and this will make our method ineEective. On the other hand, if is chosen too small,during training we will face a stability problem, i.e. with a small change in the weightsvalueswhich happens in every epocha considerable number of associations will beforgotten, thus the network will oscillate and training will not proceed. After training,by a small change in the feature vector that causes a small change in output values, theprediction of the network will change, or for feature vectors from the same class butwith minor diEerences, we will have diEerent predictions. This also causes vulnerabilityto noise and weak performance on both the test set and on the real world data out ofboth the training set and the test set. The optimum value of should be small enoughto prevent training on learned inputs, but not so small as to give way to changing thewinner neuron with minor changes in weights values or input values. Our simulationresults show that for the RBF network a value in the range [0:1; 0:2] is the optimumfor our datasets.(3) It is also possible to consider a no-prediction state in the :nal test, that is

    zJ =maxJ

    (zj) j = 1; : : : ; l3;

    if (zJ zj + 1) j = J J is the predicted class;if (zJ zj + 1) j = J no-prediction

    (9)

  • 48 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    in which 0 1 . This no-prediction state will decrease the error rate at the costof decreasing the correct prediction rate.

    4. Experiments

    In this section, :rst we give some explanation about datasets, on which simulationshave been carried out. Then simulation results are presented.

    4.1. Datasets

    A total of 18 datasets composed of feature vectors of 32 isolated characters of theFarsi alphabet, sorted in three groups, were created through various feature extractionmethods, including: principal component analysis (PCA), vertical, horizontal and di-agonal projections, zoning, pixel change coding and some combinations of them, withthe number of features varying from 4 to 78 per character according to Table 1.For creating these datasets, 34 Farsi fontswhich are used in publishing online

    newspapers and web siteswere downloaded from the Internet. Fig. 2 demonstrates awhole set of isolated characters of one sample font printed as text. Then, 32 isolatedcharacters of these 34 Farsi fonts were printed in an image :le; 11 sets of thesefonts were boldface and one set was italic. In the next step, by printing the image:le and scanning it with various brightness and contrast levels, two additional image:les were obtained. Then, using a 65 65 pixel size window, the character imageswere separated into images of isolated characters. After applying a shift-radius invariant

    Table 1Brief information about Farsi OCR datasets

    Database Extraction method Explanation No. of featuresper character

    Group A db1 PCA 72db2 PCA 54db3 PCA 72db4 PCA 64db5 PCA 48

    Group B dbn1 to dbn5 PCA Normalized versions of db1 to db5Group C db6 Zoning 4

    db7 Pixel change coding 48db8 Projection Horizontal and vertical 48db9 Projection Diagonal 30db10 Projection db8+db9 78db11 db6+db7 52db12 db6+db8 52db13 db6+db7+parts of db8 72

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 49

    Fig. 2. A sample set of machine-printed Farsi characters (isolated case).

    image normalization [18], and by reducing the sizes of character images to 24 24pixel, the features vectors were extracted as explained in Appendix B.

    4.2. Simulation results

    In all of the simulations, two-thirds of the feature vectors, obtained from the originalimage and the :rst scanned image, were assigned for the training set, and one-thirdof feature vectors obtained from the second scanned image were assigned for the testset. Therefore, 68 samples per character were used for training and 34 samples percharacter for testing. Thus, the total number of samples in the training sets and testsets are 2176 and 1088, respectively.We considered three types of activity functions for the output layer: linear,

    pseudo-linear, and sigmoid. And we faced numerous problems with both linear andpseudo-linear activity functions. These problems are explained later (in current sectionsee Considerations, item 2). Hence in the sequel we present only the simulation resultsobtained by the sigmoidal output neurons.In Table 2 the results obtained on datasets of groups AC, by the BST and the BB

    algorithms have been compared against each other.

    4.2.1. Settings and initializationsIn all the cases, we considered one prototype pattern per class, i.e. 32 prototype

    vectors, and 32 output cells for the output layer. Thus, the network con:guration isl13232, where l1 is the dimension of the input vector which is equal to the numberof features per character (see Table 1). Adding the threshold term to the sigmoidactivity functions triggered the moving target phenomenon, so we eliminated it. Alsowe did not add momentum term, because it did not help to improve training. However,adding the sigmoid prime oEset boosted the performance of network substantially.Two training paradigms were examined, i.e. half- and full-training. In Table 2, we

    have presented only the results obtained by the full-training paradigm; the consequencesof adopting the half-training paradigm will be discussed at the end.For initializing kernel vectors, two methods were adopted: the :rst samples, and the

    cluster centers obtained by k-means algorithm. Considering that k was set equal to one,these centers are simply the average of training samples of every class. For initializing

  • 50M.-T.Vakil-B

    aghmisheh,

    N.Pave +si ,c/N

    eurocomputing

    62(2004)

    3964

    Table 2Comparing the recognition errors of the RBF network obtained by the BB and the BST algorithms

    Database BB BST Parameters

    Epoch Error Epochs Error First phase Second phase Threshold 2

    N Train Test n; N Train Test 3, 2, 1 3, 2, 1

    db1 100 11 25 64, 104 0 17 5; 0.04; 0.001 1.7; 0.013; 0.0003 60 6db2 100 22 32 57, 90 4 22 5; 0.04; 0.001 1.7; 0.013; 0.0003 110 5

    100, 140 2 19 5; 0.04; 0.001 1.7; 0.013; 0.0003 5db3 100 11 7 51, 90 0 0 5; 0.04; 0.001 1.7; 0.013; 0.0003 80 6db4 100 9 6 39, 55 0 0 8; 0.04; 0.001 2.7; 0.013; 0.0003 60 7db5 100 14 6 58, 95 0 0 5; 0.04; 0.001 1.7; 0.013; 0.0003 80 6db7 100 63 31 65, 105 3 1 1.4; 0.001; 5e-6 0.5; 0.00033; 1.7e-6 220 0.05

    100, 140 2 0 1.4; 0.001; 5e-6 .5; 0.00033; 1.7e-6 0.05db8 100 83 40 77, 115 14 7 3; 0.001; 5e-6 1; 0.00033; 1.7e-6 220 0.2db10 100 20 10 68, 108 2 1 6; 0.001; 5e-6 2; 0.0003; 1.7e-6 120 0.2

    100, 140 1 0 6; 0.001; 5e-6 2; 0.0003; 1.7e-6 0.2db11 100 49 28 48, 65 0 0 3; 0.001; 5e-6 1; 0.0003; 1.7e-6 170 0.05db12 100 68 32 77, 117 5 1 5; 0.001; 5e-6 1.7; 0.0003; 1.7e-6 200 0.15

    100, 140 5 1 5; 0.001; 5e-6 1.7; 0.0003; 1.7e-6 0.15db13 100 4 2 49, 60 0 0 5; 0.001; 5e-6 2; 0.0003; 1.7e-6 80 0.15dbn1 100 12 18 71, 113 0 16 5; 0.001; 5e-6 1.7; 0.0003; 1.7e-6 50 0.5

    100, 120 0 13 5; 0.001; 5e-6 1.7; 0.0003; 1.7e-6 0.5dbn2 100 1 27 64, 100 4 20 4; 0.001; 5e-6 1.3; 0.0003; 1e-6 100 0.4

    100, 140 0 15 4; 0.001; 5e-6 1.3; 0.0003; 1e-6 0 0.4dbn3 100 8 5 53, 93 0 0 5; 0.001; 5e-6 1.7; 0.0003; 1e-6 60 0.5

    100, 130 0 0 5; 0.001; 5e-6 1.7; 0.0003; 1e-6 0.5dbn4 100 3 1 45, 75 0 0 4; 0.001; 5e-6 1.3; 0.0003; 1e-6 70 0.6dbn5 100 17 9 65, 100 0 0 4; 0.001; 5e-6 1.3; 0.0003; 1e-6 80 0.6

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 51

    spread parameters two policies were adopted:

    (1) A :xed number slightly larger than the minimum of standard deviations of allclusters created by k-means algorithm, de:ned as in Eq. (10):

    2 = mini(2i ); (10)

    where 2i is the variance of the training patterns of the ith cluster, de:ned by

    2i =168

    xqcluster i

    (xq mi)2; for i = 1; : : : ; 32 (11)

    and mi is the average vector of the ith cluster.(2) DiEerent initial values equal to the variances of every cluster obtained by the

    k-means algorithm.

    Total sum-squared error (TSSE) was considered as the cost function, and randomnumbers in the range [ 0:1;+0:1] were assigned to the initial weight matrix of theoutput layer. The results presented in Table 2 were obtained while the initial spreadparameters were equal, and the initial kernel vectors were set equal to cluster centersof the k-means algorithm. For the BST algorithm, the stability margin was set equalto 0:2. This value was obtained based on the empirical results.

    4.2.2. Training algorithms(1) BB algorithm: The network was trained for 100 epochs with the BB algorithm.

    The obtained results are given under column BB.(2) BST algorithm: A threshold value for TSSE was chosen, after which the training

    procedure slows down. This threshold value was acquired from the :rst trainingexperiment through the BB algorithm. Training was restarted by the BB algorithmfrom the same initial point of the :rst experiment; when TSSE reached to thethreshold value, we changed the training algorithm from unselective to selectiveand continued for a maximum of 40 epochs, with the values of learning parameters,i.e. 3 2 and 1, decreased almost three times. Every :ve epochs the network wastested; if the recognition error on the training set was zero, training was stopped.Also training was stopped if either dynamic recognition error 2 reached zero orif 40 epochs of training were over. The obtained results are given under columnBST. In this column n and N represent the epoch numbers where selective trainingstarts and ends.

    (3) BST algorithm: We did not set any threshold; training was carried out for 100epochs on datasets dbn1, db2, dbn2, dbn3, db7, db10 and db12 with the BB

    2 Dynamic recognition error is obtained while the network is under training, and after presenting anypattern, the network parameters probably will change. Therefore, it is diEerent to the recognition error whichis obtained after training. Generally, in selective training, the dynamic recognition error on the training setwill be larger than the recognition error, therefore we could stop training earlier by performing a test on thetraining set after every few epochs of training, but this will violate the stability condition. Although, in ourcase study, this does not cause any problem, it can increase recognition error in real on-line operation.

  • 52 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    algorithm, then followed with selective training for maximum of 40 epochs. Theobtained results are given under column BST, where the threshold value has notbeen speci:ed, or where n is equal to 100.

    4.2.3. Analysis(1) Results obtained by the BB algorithm: The recognition errors on datasets dbn4,

    db13, dbn3, db4, db3 and db5 are lower than those on others, and db12, db8 and db7must be considered the worst ones. Data normalization has improved the recognitionrate in all cases, excluding dbn5. We notice that the performances on test sets ofdb1 and db2 are weaker than that on training sets of these datasets, and this mustbe attributed to the inappropriate implementation of the feature extraction method.The learning rates of kernel vectors and spread parameters, i.e. 2 and 1 are muchsmaller for datasets of groups B and C than those for the datasets of group A, butthe value of 3 (the learning rate for weight matrix of the output layer), does notchange substantially for datasets from diEerent groups. The initial spread parametersfor datasets of group A are much larger than for the datasets of groups B and C.(2) The results obtained by the BST algorithm: The :rst eminent point is the

    decreased recognition errors on all datasets. The BST algorithm has achieved muchbetter results in shorter time; especially on db3, db4, dn5, dbn3, dbn4, dbn5, db11,and db13 the recognition error has reached zero. For evaluating and ranking thesedatasets we have two other measures: convergence speed and the number of features.Regarding convergence speed, the best datasets will be: db4, db13, db11, dbn4, db3,dbn3, db5, dbn5, although in some cases the diEerences are too slight to be meaningful;db3, dbn3 and db13 should be ruled out because of high dimension of their featurevectors. In addition, training does not bene:t from data normalization. Also the BSTalgorithm solves the overtraining problem. We notice that it has decreased the errorrate on the training set but not at the cost of increased error on the test set. And thisis even more obvious from the results obtained on db1 and db2. In addition, this canbe veri:ed from the results demonstrated in Table 3.(3) In Table 3 we have compared the recognition errors at the epochs n and N , i.e.

    at the beginning and the end of the selective training obtained by the BST algorithm,against recognition error at the same epochs obtained by the BB algorithm, on fourdatasets. It shows that after TSSE reaches to the threshold value, if we continue trainingwith basic backpropagation the recognition error either decreases very trivially or evenincreases (e.g. on test set of db4 and db11). Some researchers use the cross validationtechnique to :nd this point and stop training at it, but we are opposing applying thecross validation method on neural networks training.(4) Table 4 shows the results of another experiment performed on db4 with diEerent

    network settings. Threshold value was set equal to 100, stability margin equal to 0.1,and learning rate parameters divided only by two for the second phase of training, andthe initial weight matrix was changed. While by epoch number 22 the error rates onboth training set and test set are equal, i.e. 26 and 14, respectively, the BST algorithmhas reached zero error on both sets in 23 epochs of selective training, and the error ratesof the BB algorithm after 100 epochs of unselective training are 6 and 3, respectively,on training set and test set.

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 53

    Table 3Comparing the recognition errors of the BB and the BST algorithms in two points of training

    Database BB BST

    Epoch Error Epoch Error

    Train Test Train Test

    db1 65 20 29 65 20 29104 11 25 104 0 17

    db4 40 9 4 40 9 4100 9 6 55 0 0

    db8 78 89 41 78 89 41115 83 40 115 14 7

    db11 48 50 26 48 50 26100 49 28 65 0 0

    Table 4Comparing the recognition errors of the RBF network on db4 obtained by the BB and the BST algorithms

    BB BST

    Epoch Error Epoch Error

    n Train Test n; N Train Test

    22 26 14 22, 45 0 0100 6 3 100, 120 0 0

    (5) The reader should recall that in the selective training mode, the calculationfor weight updating (or backpropagating)which is the most time-consuming step oftrainingis carried out only for misclassi:ed patterns, whose number at the beginningof selective training is less than 89, or 5% of all training samples (see Tables 2 and 3);thus one epoch of selective training is at least :ve times faster than that of unselectivetraining, and by decreasing misclassi:ed patterns through time it becomes faster andfaster. Therefore the BST algorithm is at least three times faster than the BB algorithm.(6) Fig. 3 demonstrates TSSE versus the epoch number, obtained on db4, corre-

    sponding with the experiment of Table 4. By changing the training algorithm we facea sudden descent in TSSE, and this must be attributed to the sharp decrease of learningrate factors. Our explanation for this phenomenon is as follows:After approaching the minimum well, by using large values of learning rate and

    momentum factor we step over the well. But by decreasing the step size we put ourleg in the middle of the minimum well and face a fall to the bottom. This phenomenoninspired us to devise the BDLRF [17] and BPLRF [19] algorithms. BDLRF and BPLRF

  • 54 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    BB

    BST

    100 20 40 60 80 100 120

    100

    1000

    Epoch number

    TSS

    E

    Fig. 3. Convergence diagrams of the RBF network obtained by the BB and the BST algorithms on db4,corresponding to Table 4.

    are acronyms for backpropagation with declining leaning rate factor and backprop-agation with plummeting leaning rate factor, respectively. In [17,19] we have shownhow to speed up training and improve the recognition rate in MLP by decreasing thelearning rate factor. Also we have shown that by larger decrease in the values oftraining factors can result in larger decrease in cost function, and better recognitionrate.(7) In addition, we notice that by training in the second phase while the recognition

    error decreases, TSSE increases, which substantiates our statement that our methoddoes not overtrain the network on learned patterns. On the contrary, if the network hasbeen overtrained on some patterns in the :rst phase, by increasing TSSE and decreasingrecognition error it is de-emphasizing on already overtrained patterns. In other words,decreased recognition error on unlearned patterns must be a resultant of decreased SSE(sum-squared error) resulting from the same patterns. Thus, for an increase of TSSE,simultaneous with decrease of recognition error, there has to be an increase in the SSEresulting from already learned patterns without crossing the stability margin, and thismeans de-emphasizing on overtrained patterns. Therefore, our method decreases theerror on the training set, but not at the cost of overtraining and increased error on thetest set.

    4.2.4. Considerations(1) By starting from a diEerent initial point, the number of training epochs will

    change slightly, but not so much as to aEect the general conclusions.(2) As already mentioned, we considered three types of activity functions for

    the output layer: sigmoid, linear, and pseudo-linear. And we faced numerous problems

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 55

    with both linear and pseudo-linear activity functions, as explained in thefollowing:

    Slow learning: they do not allow for using large learning rate factors. In the caseof using a large learning rate, the network will oscillate and will not converge tominimum.

    High sensitivity to parameters values and wide range in which the optimal parametersvalues lie for diEerent datasets. The optimal values change enormously for diEerentdatasets (up to 4 orders of magnitude), which makes the parameter tuning procedurean exhausting task.

    Contrary to their super:cial simplicity, they need far more computations per iteration(refer to the appendix for their formulations). Thus, they are slower than the sigmoidoutput from this aspect as well.

    Their recognition errors are higher than that of the sigmoidal activity function. The afore-said problems worsen on the datasets of groups B and C. The pseudo-linear activity function has better performance than the linear one,in terms of convergence speed, recognition rate, and sensitivity to learningparameters.

    Therefore, output cells with a sigmoid activity function are preferred over other ac-tivity functions, due to resulting in less sensitivity to learning parameters, faster con-vergence, and lower recognition error. Although applying the BST algorithm on theRBF network with linear and pseudo-linear outputs does improve their performances,they do not surpass the RBF network with sigmoid output trained with the BSTalgorithm.(3) We tried three training paradigms:

    Half training: Only the weight matrix of the output layer was under training Half training: The weight matrix of the output layer and kernel vectors were undertraining.

    Full training: The weight matrix of the output layer, kernel vectors and spreadparameters were under training.

    If the half-training paradigm is chosenconsidering that the kernel vectors will not bein the optimal positions and the spread parameters will not have the optimum valueswe have to increase the number of kernel vectors, otherwise recognition error willincrease. In the case of increasing the number of kernel vectors, both training andon-line operation will slow down drastically. On the other hand, if the full-trainingparadigm is chosen the number of learning parameters increases to three, that is 32 and 1, adjusting which is an exhausting task. All in all, the full-training paradigmseems to be the most bene:cial method.(4) We adopted two policies for initializing kernel vectors:

    the :rst samples, the prototype vectors resulting from the k-means algorithm.

  • 56 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    Table 5Comparing performance of the RBF network with diEerent initialization policies

    BB BST Initial values

    Epoch Error Epochs Error Kernel vectors 2

    n Train Test n; N Train Test

    22 26 14 22, 45 0 0 First samples 7100 6 3 100, 120 0 021 25 14 21, 45 0 0 k-Means cluster Centers 7100 9 6 100, 110 0 027 31 16 27, 45 0 0 First samples From k-means100 24 14 100, 104 0 038 27 13 38, 78 0 0 k-Means cluster From k-means100 19 9 100, 110 0 0 Centers

    Although the last method of initialization does yield faster convergence, neverthelessthe diEerence between these two types of initialization becomes trivial when the numberof kernel vectors grows smaller. More precisely, when the number of kernel vectorsis kept small, using the second initialization method speeds up convergence only atthe very beginning, but in the middle and at the end of training convergence slowsdown, and the global convergence is not better than that obtained by the :rst method(see Table 5). Notwithstanding, before training the RBF network, we need to run thek-means algorithm to get initial values for the spread parameters, and using the createdcluster centers can be done in no time.(5) A major drawback of the RBF network lies in its size, and therefore its speed.

    Unlike the MLP, in the RBF networks we cannot increase the size of network incre-mentally. For instance, in our case study the network size was l1 32 32; if wehad decided to enlarge the network, the next size would have been l1 64 32, andthis means that the network size would be doubled. Considering that speed will de-crease by an order larger than one, the improved performance would cost a substantialslow-down both in training and on-line operation, which would make the RBF networkunable to compete with other networks, and therefore be practically useless. Conse-quently, usage of the RBF network is not recommended if the number of pattern classesis high.(6) The best policy for initial values of spread parameters is to set equal initial values

    for all of them, but to change them through training. Considering kernel vectors, themost important aspect is adjusting them during training. In this case, selecting the :rstsamples or the prototype vectors derived from the k-means algorithm yield similarresults (see Table 5).(7) The data normalization method oEered by Wasserman was tried [20, p. 161].

    Wasserman oEers, for each component of the input training vectors:

    1. Find the standard deviation over the training set.

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 57

    2. Divide the corresponding component of each training vector by this standard devi-ation.

    Considering that some components of feature vectors are equal for all patterns ex-cept for some from speci:c classes, the standard deviations of these components arevery small and by dividing these components to their standard deviations, their valuesextremely increase, and this attenuates the impact of other components in the normof the diEerence vector xq vm, almost to zero. The mere impact of Wassermansnormalization method was destabilizing the whole system.(8) Decreasing the learning parameters during selective training, stabilizes the net-

    work and speeds up training by preventing repeated overtraining and unlearning onsome patterns.

    5. Conclusions

    (1) In this paper we presented the BST algorithm, and showed that on the givendatasets, the BST algorithm improves the performance of RBF networks substan-tially, in terms of convergence speed and recognition error. The BST algorithmachieves much better results in shorter time. It solves three drawbacks of the back-propagation algorithm: overtraining, slow convergence at the end of training, andinability to learn the last few percent of patterns. In addition, it has the advan-tages of shortening training time (up to three times) and partially de-emphasizingovertrained patterns.

    (2) As there is no universally eEective method, the BST algorithm is not an excep-tion and has its own shortcoming. Since the contradictory data or the overlappingpart of the data cannot be learned, applying the selective training on data with alarge overlapping area will destabilize the system. But it is quite eEective whendataset is error-free and non-overlapping, as it is the case with every error-freecharacter-recognition database, when enough number of proper features are ex-tracted.

    (3) The best training paradigm is full training, because it utilizes all the capacity ofthe network. Using the sigmoidal activity function for the neurons of the outputlayer is recommended, because it results in less sensitivity to learning parameters,faster convergence and lower recognition error.

    Acknowledgements

    This work has been partially supported within the framework of the SlovenianIranian Bilateral Scienti:c Cooperation Treaty. The authors would like to thank theanonymous reviewers for their valuable comments and suggestions which helped toimprove the quality of the paper. We would like to thank Alan McConnell DuE forlinguistic revision.

  • 58 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    Appendix A. Error gradients of RBF network

    Let the network have the con:guration as depicted in Fig. 1, and

    x input vector (=[x1; x2; : : : ; xl1 ]T)

    l1 dimension of the input vectorl2 number of neurons in hidden layervm prototype vector corresponding to the mth hidden cell (=[v1m; v2m; : : : ; vl1m]

    T)V matrix of prototype vectors (=[v1; v2; : : : ; vl2 ])ym output of mth hidden celll3 dimension of the output vectoruj weight vector of the jth output cell (=[u1j; u2j; : : : ; ul2j]

    T)U weight matrix of output layer (=[u1; u2; : : : ; ul3 ])zj actual output of the jth output celltj desired output of the jth output cellQ number of training patterns

    Let TSSE be the cost function de:ned as

    TSSE =

    q

    Eq; Eq =

    k

    (tqk zqk )2 (q = 1; : : : ; Q) (A.1)

    and let E be the simpli:ed notation for Eq

    E =

    j

    (tj zj)2: (A.2)

    We will calculate error gradients for pattern mode training. Obtaining error gradientsfor batch mode training is straightforward, as explained in the remark at the end ofthis appendix.We will consider three types of activity functions for output cells

    zj =

    11 + eSj

    ; sigmoid;

    sjl2

    ; linear; with1l2

    squashing function;

    sjm ym

    ; pseudo-linear; with1m ym

    squashing function;

    (A.3)

    where

    sj =

    m

    ymumj (A.4)

    and

    ym = exp(

    x vm2

    22m

    ): (A.5)

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 59

    A.1. Part 1Error gradients versus weights of output layer

    By using the chain rule for derivatives we get

    @E@umj

    =

    I@E@zj

    II@zj@sj

    III@sj@umj

    (A.6)

    and we will calculate all three terms in three cases.Computing (I): The :rst term is the same for all three cases, i.e.

    I =@E@zj

    =2 (tj zj) for cases (13): (A.7)

    Computing (II):

    II =@zj@sj

    =

    1l2

    for case (1);

    1m ym

    for case (2)(A.8)

    and for the third case (sigmoid output) we have

    II =@zj@sj

    =esj

    (1 + esj)2=

    1 + esj 1(1 + esj)2

    = zj z2j = zj(1 zj): (A.9)

    Computing (III): The third term is identical for all cases

    III =@sj@umj

    = ym: (A.10)

    By putting all partial results together we obtain

    @E@umj

    =

    2(tj zj) yml2 ; case(1);

    2(tj zj) ymm ym

    ; case(2);

    2(tj zj) zj (1 zj)ym; case(3):

    (A.11)

    A.2. Part 2Error gradients versus components of prototype vectors

    We have

    @E@vim

    =

    j

    I@E@zj

    II@zj@ym

    III@ym@vim

    : (A.12)

    Computing (I): For all three cases we have

    I =@E@zj

    =2(tj zj): (A.13)

  • 60 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    Computing (II): For case (1)

    zj =sjl2

    II = @zj@ym

    =umjl2

    : (A.14)

    For case (2)

    zj =sjm ym

    =

    m ymumjm ym

    (A.15)

    since m is dummy variable we can change it to k

    zj =sjk yk

    =

    k ykukjk yk

    ; (A.16)

    II =@zj@ym

    =umj

    k yk

    k ykukj

    (

    k yk)2 =

    umj

    k yk sj(

    k yk)2 : (A.17)

    For case (3)

    !zj!ym

    =

    IV!zj!sj

    V!sj!ym

    ; (A.18)

    considering

    zj =1

    1 + esj; (A.19)

    and

    sj =

    m

    umjym; (A.20)

    we have

    IV = zj (1 zj); (A.21)and

    V = umj: (A.22)

    Then by putting the partial derivatives together we obtain

    II =@zj@ym

    = zj (1 zj)umj: (A.23)

    Computing (III): For all three cases we have

    ym = exp(

    x vm2

    22m

    )= exp

    ( (x vm)

    2

    22m

    )(A.24)

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 61

    then

    III =@ym@vim

    = ym

    (xim vim

    2m

    ): (A.25)

    By putting the partial results together we have

    @E@"im

    =

    J

    2(tj zj) umjl2ym2m

    (xim "im); case (1);

    J

    2(tj zj) umj

    k yk sj(

    k yk)2

    ym2m

    (xim "im); case (2);

    j

    2(tj zj)zj(1 zj)umj ym2m(xim "im); case (3):

    (A.26)

    A.3. Part 3Error gradients versus spread parameters

    We have

    @E@2m

    =

    j

    I@E@zj

    II@zj@ym

    III@ym@2m

    ; (A.27)

    terms I and II are exactly as in part 2, therefore we only need to calculate the thirdterm, which will have an identical formulation in all three cases

    ym = exp(

    x vm2

    22m

    )(A.28)

    and

    III =@ym@2m

    = ymx vm2

    24m(A.29)

    by putting the partial results together we obtain

    @E@2m

    =

    j

    2(tj zj) umjl2 ym(x vm2

    24m

    ); case(1);

    j

    2(tj zj) umj

    k yk sj(

    k yk)2 ym

    (x vm224m

    ); case(2);

    j

    2(tj zj)zj(1 zj)umjym(x vm2

    24m

    ); case(3):

    (A.30)

    Remark. All the above formulas have been calculated for pattern mode training. Forbatch mode training, only the term

    Qq=1 should be added in front of Eqs. (A.11),

    (A.26) and (A.30), i.e. all partial results should be summed up. Our experience showsthat batch mode training is much slower than pattern mode training, in addition to itsimplementation intricacy and high memory demand.

  • 62 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    Appendix B. Feature extraction methods

    B.1. Group Adb1db5

    These datasets were created by extracting the principal components, extracted bya single layer feedforward linear neural network with generalized Hebbian TrainingAlgorithm (GHA) [9,7], as summarized in the following:For extracting principal components, we used m l single layer feedforward linear

    network with Generalized Hebbian Training Algorithm (GHA).

    db1: To train the network: 88 non-overlapping blocks of the image of every characterwere considered as an input vector. The image was scanned from top left tobottom right and l was set equal to 8. Therefore for every character 72 featureswere extracted. The training was performed with 34 samples per character andthe learning rate was set to = 7 103. To give enough time for the networkto learn the statistics of data, training procedure was repeated for three epochs.

    db2: The same as db1, but l was set equal to 6, thus for any character 54 featureswere extracted.

    db3: The image matrix of any character was converted into a vector, by scanningvertically from top left to bottom right, then this vector was partitioned into 9vectors which were inserted into the network as 9 input vectors. In this way, 72features were extracted for every character.

    db4: Similar to db1, but the dimension of the input blocks was considered to be 243,i.e. every three rows were considered as one input vector. In this way, for anycharacter 64 features were extracted.

    db5: Similar to db4, but l was set equal to 6, thus for any character 48 features wereextracted.

    B.2. Group Bdbn1dbn5

    These datasets are normalized versions of db1db5. After creating any dataset, thefeature vectors were normalized by mapping the ith component of all the vectors intothe interval [0; 1].

    B.3. Group Cdb6db13

    db6: This dataset was created by the zoning method. Each character image was di-vided into four overlapping squares and the percentage of black pixels in eachsquare was obtained. The size of overlap was set to two pixels in each edge,which yields the best recognition rate. The best recognition rate on this datasetdoes not exceed 53%, so its features were used only in combination with otherfeatures.

    db7: Pixel change coding was used to extract the feature vectors of this dataset.db8: The feature vectors of this dataset were extracted by vertical and horizontal

    projections.

  • M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964 63

    db9: The feature vectors of this dataset were extracted by diagonal projection. Tencomponents from the beginning and seven components from the end were deleted,because their values were zero for all characters. The best recognition rate onthis dataset does not reach 85%, so its features were used only in combinationwith other features.

    db10: By concatenating the feature vectors of db8 and db9, feature vectors of thisdataset were extracted.

    db11: The feature vectors of this dataset were created by concatenating the featurevectors of db6 and db7.

    db12: The feature vectors of this dataset were created by concatenating the featurevectors of db6 and db8.

    db13: The feature vectors of this dataset were created by concatenating the featurevectors of db11 and some selected features from db8, that is 10 features fromthe middle of both vertical and horizontal projections.

    References

    [1] M.A. Aizerman, E.M. Braverman, I.E. Rozonoer, Theoretical foundations of the potential functionmethod in pattern recognition learning, Automat. Remote Control 25 (1964) 821837.

    [2] S. Amari, N. Murata, K.R. Muller, M. Finke, H.H. Yang, Asymptotic statistical theory of overtrainingand cross-validatio, IEEE Trans. Neural Networks 8 (5) (1997) 985996.

    [3] T. Andersen, T. Martinez, Cross validation and MLP architecture selection, Proceedings of InternationalJoint Conference on Neural Networks, IJCNN99, Cat. No. 99CH36339, Vol. 3 (part 3), 1999,pp. 16141619.

    [4] O.A. Bashkirov, E.M. Braverman, I.B. Muchnik, Potential function algorithms networks for patternrecognition learning machines, Automat. Remote Control 25 (1964) 629631.

    [5] D.S. Broomhead, D. Lowe, Multivariable functional interpolation and adaptive networks, ComplexSystems 2 (1988) 321355.

    [6] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with application inpattern recognition, IEEE Trans. Electron. Comput. EC-14 (1965) 326334.

    [7] K.I. Diamantaras, S.Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley,New York, 1996.

    [8] S.C. Fahlman, An empirical study of learning speed in backpropagation networks, Technical ReportCMU-CS-88-162, Carnegie Mellon University, Pittsburgh, PA 15213, September 1988.

    [9] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, Englewood CliEs, NJ, USA,1999.

    [10] G. Jondarr, Backpropagation family album, Technical Report, Department of Computing, MacquarieUniversity, New South Wales, August 1996.

    [11] C.G. Looney, Pattern Recognition Using Neural Networks, Oxford University Press, New York, 1997.[12] W.S. McCulloch, W. Pitts, A logical calculus of the ideas imminent in nervous activity, Bull. Math.

    Biophy. 5 (1943) 115133.[13] D.B. Parker, Learning logic, Invention Report S81-64, File 1, OXce of Technology Licensing, Stanford

    University, March 1982.[14] L. Prechelt, Automatic early stopping using cross validation: quantifying the criteria, Neural Networks

    11 (4) (1998) 761767.[15] D.E. Rumelhart, J.L. McClelland, Parallel Distributed Processing: Exploration in the Microstructure of

    Cognition, MIT Press, Cambridge, MA, 1986.[16] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Academic Press, USA, 1999.[17] M.T. Vakil-Baghmisheh, N. Pave*sic, Backpropagation with declining learning rate, Proceeding of the

    10th Electrotechnical and Computer Science Conference, Portoro*z, Slovenia, Vol. B, September 2001,pp. 297300.

  • 64 M.-T. Vakil-Baghmisheh, N. Pave+si,c / Neurocomputing 62 (2004) 3964

    [18] M.T. Vakil-Baghmisheh, Farsi character recognition using arti:cial neural networks, Ph.D. Thesis,Faculty of Electrical Engineering, University of Ljubljana, Slovenia, October 2002.

    [19] M.T. Vakil-Baghmisheh, N. Pave*si+c, A fast simpli:ed fuzzy ARTMAP network, Neural Process. Lett.,2003, in press.

    [20] P.D. Wasserman, Advanced Methods in Neural Computing, Van Nostrand Reinhold, New York, 1993.[21] P.J. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral science, Ph.D.

    Thesis, Harvard University, Cambridge, MA, 1974.

    Mohammad-Taghi Vakil-Baghmisheh was born in 1961 in Tabriz, Iran. He re-ceived his B.Sc. and M.Sc. degrees in electronics, from Tehran University in 1987and 1991. In 2002, he received his Ph.D. degree from the University of Ljubl-jana, Slovenia, with a dissertation on neural networks in the Faculty of ElectricalEngineering.

    Nikola Pave0si1c was born in 1946. He received his B.Sc. degree in electronics,M.Sc. degree in automatics, and Ph.D. degree in electrical engineering from theUniversity of Ljubljana, Slovenia, in 1970,1973 and 1976, respectively. Since 1970he has been a staE member at the Faculty of Electrical Engineering in Ljubljana,where he is currently head of the Laboratory of Arti:cial Perception, Systems andCybernetics. His research interests include pattern recognition, neural networks,image processing, speech processing, and information theory. He is the author andco-author of more than 100 papers and 3 books addressing several aspects of theabove areas.Professor Nikola Pave*si+c is a member of IEEE, the Slovenian Association ofElectrical Engineers and Technicians (Meritorious Member), the Slovenian Pattern

    Recognition Society, and the Slovenian Society for Medical and Biological Engineers. He is also a memberof the editorial boards of several technical journals.

    Training RBF networks with selective backpropagationIntroductionRBF networksStructureTraining paradigmsInitialization methodsTraining algorithmsBasic backpropagation for the RBF networkBackpropagation with selective trainingExperimentsDatasetsSimulation resultsSettings and initializationsTraining algorithmsAnalysisConsiderationsConclusionsAcknowledgementsAppendix A. Error gradients of RBF networkPart 1---Error gradients versus weights of output layerPart 2---Error gradients versus components of prototype vectorsPart 3---Error gradients versus spread parametersAppendix B. Feature extraction methodsGroup A---db1--db5Group B---dbn1--dbn5Group C---db6--db13References