Entropy-Based Generation of Supervised Neural Networks for Classification of Structured Patterns

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004 283

Entropy-Based Generation of Supervised NeuralNetworks for Classification of Structured Patterns

Hsien-Leing Tsai and Shie-Jue Lee

Abstract—Sperduti and Starita [32] proposed a new type ofneural network which consists of generalized recursive neuronsfor classification of structures. In this paper, we propose an en-tropy-based approach for constructing such neural networks forclassification of acyclic structured patterns. Given a classificationproblem, the architecture, i.e., the number of hidden layers and thenumber of neurons in each hidden layer, and all the values of thelink weights associated with the corresponding neural network areautomatically determined. Experimental results have shown thatthe networks constructed by our method can have a better perfor-mance, with respect to network size, learning speed, or recognitionaccuracy, than the networks obtained by other methods.

Index Terms—Delta rules, generalized recursive neuron, infor-mation entropy, multilayer perceptrons, structured patterns.

I. INTRODUCTION

CLASSIFICATION of structured patterns is necessary inmanyapplicationssuchasmolecularbiologyandchemistry

classifications, speech and text processing, geometrical and spa-tial reasoning, medical diagnoses, and so on.Conventional neuralnetworks are usually believed to be inadequate when dealingwith structured patterns due to the sensitivity to the features se-lected for representation and the incapacity to represent specificrelationships among the components of the structures [32]. Toovercome the difficulties, Sperduti and Starita [32] proposed anew type of network architecture which consists of generalizedrecursive neurons for classification of structured patterns. Eachgeneralized recursive neuron receives two kinds of inputs, theneuron outputs of the previous layer and the previously computedoutputs of the related vertices indicated in the structure of theunderlying pattern, to generate its output. The architecture hadbeen shown useful in performing classification tasks involvingstructured patterns. However, constructing such a network auto-matically for a certain application would be desirable especiallywhen the patterns under consideration are complex.

Many methods have been proposed for determining the archi-tecture of a neural network for a certain application. Kung andHwang [19] used the algebraic projection analysis to specify thesize of hidden layers. Fahlman and Lebiere [12] proposed cas-cade-correlation neural networks (CCNN) in which new hiddennodes are added dynamically by minimizing the mean squareerror and maximizing a correlation measure. Goodman et al.[16] used a J-measure to derive from training data a set of ruleswhich are then used to construct a neural network. Nadal [28],

Manuscript received August 8, 2002; revised September 23, 2003.This work was supported by the National Science Council under GrantNSC-90-2213-E-110-013.

H.-L. Tsai and S.-J. Lee are with the Department of Electrical Engineering,National Sun Yat-Sen University, Kaohsiung 80424, Taiwan (e-mail: [email protected]).

Digital Object Identifier 10.1109/TNN.2004.824253

Bichsel and Seitz [4], Cios and Liu [10], and Lee et al. [21], [22]used entropy to determine the number of hidden layers and thenumber of nodes in each hidden layer. Tontini et al. [34], Dae[11], and Lee and Ho [20] obtained the initial network prototypeby the use of adaptive resonance theory [7], [17]. However, thesemethods were all developed for conventional neural networks.

Recurrent multilayer perceptrons (MLP) proposed in [29],[35] contain recurrent feedbacks. Such a network requires thatall the recurrent feedbacks of a layer connect to each node of thislayer. Fahlman [13] proposed recurrent CCNN for processingsequential data. Each hidden neuron receives information notonly from the input neurons and all the previous hidden neu-rons but also from a recurrent feedback. Both recurrent MLPand recurrent CCNN networks can be used to process struc-tured data by conveying structured information through recur-rent feedbacks. However, the number of layers and hidden nodesfor a recurrent MLP network should be determined manuallyby trial-and-error. Sperduti and Starita [32], [3] proposed a cas-cade-correlation algorithm, abbreviated as S&S CCNN, to con-struct CCNN networks for processing structured patterns. How-ever, constructing both recurrent CCNN and S&S CCNN net-works requires the determination of the number of candidatesby the user. Furthermore, trying all the candidates to select thebest choice for each neuron results in a long training time.

We propose an entropy-based approach for constructingneural networks that are composed of generalized recursiveneurons. For a classification problem with acyclic structuredpatterns, a corresponding neural network is generated auto-matically. The number of hidden layers and the number ofneurons in each hidden layer are determined by minimizingthe information entropy function associated with the partitioninduced by the participating hyperplanes. A learning processcombining delta rules and simulated annealing is used to obtainoptimal link weights for each neuron. Experimental results haveshown that the networks constructed by our method can have asmaller size, a faster learning speed, and a better recognitionaccuracy than the networks obtained by other methods.

The rest of the paper is organized as follows. Section II gives abrief introduction to generalized recursive neurons and the net-works composed by them. Section III describes entropy mea-sures for deriving hyperplanes for hidden and output neurons.Section IV develops the delta rules to be used for finding op-timal hyperplanes systematically. Section V describes the pro-cedure for building the whole neural network, together with anexample for illustration. Modified algorithms for constructingsimpler networks are presented in Section VI. Section VII givessome experimental results to demonstrate the effectiveness ofour proposed method. Finally, concluding remarks are given inSection VIII.

1045-9227/04$20.00 © 2004 IEEE

284 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004

Fig. 1. Structured pattern ~g of class “c” and its corresponding sorted data set.

II. STRUCTURED NEURAL NETWORKS

The networks proposed by Sperduti and Starita [32] deal withpatterns represented by directed graphs. For convenience, wecall such networks structured neural networks. Let be a struc-tured pattern of a data set . A vertex is called the super-source vertex of if every vertex in can be reached by apath starting from . We assume that each structured pattern hasa unique super-source vertex. The out-degree of a vertexin is defined to be the number of edges leaving from . Thevalence of is defined to be the maximum out-degree of all thevertices in . The valence of , is the maximum valence ofall patterns in . Let be a vertex and have children. We de-note these children by , and , respectively.

Usually, each vertex of a structured pattern is associated witha set of attributes, represented by a vector called the vertex fea-ture vector. To facilitate processing, a structured pattern is ex-pressed by a sorted data set in which vertices are listed levelby level, starting from bottom to top and left to right. Fig. 1shows an example pattern of class “c” and its correspondingsorted data set. Note that vertex is the super-source vertexof , and the valence of is 2. Forclarity, the sorted data set of is represented in columns. Thefirst column lists vertex names and the vertex feature vector ofeach vertex is listed in the second column. The third and fourthcolumns give the left child and right child, respectively, of theunderlying vertex, with “nil” indicating no child for the entry.The class that belongs to is specified in the last column of theentry for the super-source vertex, and do not care “—” is givenin the entries of the other vertices. In the rest of the paper, we’lluse vertex and vertex feature vector interchangeably if no con-fusion arises.

A structured neural network consists of generalized recur-sive neurons and a control center [32]. The neurons are orga-nized in layers. The input and output layers are the first and lastlayers, respectively, and the hidden layers appear in between.Each neuron is connected with two kinds of links, standard linksand recursive links, weighted by the standard weight and re-cursive weight , respectively, as shown in Fig. 2. The outputsof the neurons in one layer are forwarded to the neurons of thenext layer through standard weights. The previously computedneuron outputs, stored in the control center, are fed-back appro-priately through recursive weights. Let be a neuron in layer .The output of neuron for vertex of pattern is computedas follows:

(1)

Fig. 2. Generalized recursive neuron.

Fig. 3. Vertex A is being processed by a structured network.

where is the output of neuron in layer withbeing the input to the network, is the number of neurons inlayer is the number of neurons in layer is thevalence of the data set being considered, is the th childvertex of , and is the sigmoidal function. After the outputof a neuron for vertex is computed, it is passed forward to thenext layer. Also, it is copied to the control center to be used asfeedback-values later. The number of nodes in the input layerequals to the dimension of the vertex feature vector. Each inputneuron receives one individual attribute value and passes thesame value to the next layer.

Let us use an example for illustration. Consider the processingof pattern of Fig. 1 by structured network consisting of threeinput nodes, two layers of hidden nodes with two nodes in eachlayer ( , and ), and one output node ( ), as shownin Fig. 3. Note that output neurons are represented by solidrectangles, hidden neurons are represented by solid circles, andfeedback values are represented by hollow shapes in this figure.We assume that the valence of the data set being consideredis 2. The number of recursive links for the second layer is

, for the third layer is , and for the last layer is. At time step 0, vertex is fed to the neural network.

Since vertex is a leaf vertex having no child vertices, all thefeedback-values are zero. Let the neuron outputs be representedas , and , respectively.These values are computed and stored in the control center. Attime step 1, vertex is fed to the neural network, and similarly

TSAI AND LEE: ENTROPY-BASED GENERATION OF NEURAL NETWORKS FOR STRUCTURED PATTERNS 285

neuron outputs , andare computed and kept in the control center. Again, vertexhas no child vertices, so the feedback-values for it are all zero.

At time step 2, vertex is fed to the neural network. Sincehas two child vertices, and , the neuron outputs of the pre-vious layer and the previously computed neuron outputs forand have to be considered. For example, to compute the neuronoutput , we have to consider and passedfrom node and , and the previously computed values

, and accommodated by thecontrol center. Similarly, to compute , we have to consider

and , together with feedback-values and. The computed values ,

and are then retained in the control center. At time step 3,vertex is fed to the neural network. This time neuron outputs

, and are computed withfeedback values being zero, and the computed values are kept inthe control center. Finally, at time step 4, the super-source vertex

is fed to the neural network, which has two child vertices,and . To compute neuron output values for , we have

to consider the neuron outputs of the previous layer and thepreviously computed neuron outputs for and , as shown inFig. 3. Because vertex is the super-source vertex of , output

represents the response of this neural network to thestructured pattern .

III. ENTROPY MEASURES

Entropy was successfully used for determining the architec-ture of conventional neural networks [28], [4], [10], [21], [22].Intuitively, each neuron performs the function of a hyperplanewhich divides the training data set into subsets. Once we obtaina hyperplane, a corresponding neuron can be created. This ideacan also be applied to determine the architecture of structuredneural networks. As mentioned in Section 2, the output of eachneuron, and hence the function value of its associated hyper-plane, depends on the neuron’s inputs, as well as the standardand recursive weights, and . We describe here the entropymeasures for deriving the hyperplanes of hidden and output neu-rons. The learning rules based on these entropy measures willbe developed in the next section.

Suppose we have a data set of structured patterns, with classes . Let be a subset

of . A single hyperplane, , where denotesthe super-source vertex of , separates into two disjointsubsets, one being and the other

. Assume that is divided into apartition of sets, , and , i.e., , bysome hyperplanes. For convenience, the partition is denoted as

. The class entropy of is defined as

(2)

where is the proportion of patterns in that belong toclass . The information entropy of the partition isdefined as:

(3)

where denotes the number of elements in .

Suppose another hyperplane, , is added to re-fine the partition to another partition . Forconvenience, we use the notation for

to represent explicitly the effect of the hyperplaneon . Entropy gain obtained by the addi-

tion of is defined as

(4)

The hyperplane for which is maximal, oris minimal, among all the candi-

date hyperplanes is selected as the best hyperplane to refine.

A. Entropy Function for Hidden Neurons

Suppose we want to refine to by addingone more hidden neuron with hyperplane . Leteach set in be divided by into thefollowing two subsets

Obviously, , and where, and . Then from (3) we have

(5)

where . Let

where denote the class of . Then the following relationshold:

Therefore, from (5), we have [22]

(6)


B. Entropy Function for Output Neurons

Given a data set , we’d like to test if a certain classcan be linearly separated from the other classes by a single hy-perplane. If so, an output neuron with this hyperplane will becreated for class . Let a hyperplane divideinto the following two sets:

Let and . Obviously, and. We have

(7)

where

and denotes the probability of patterns in that donot belong to class . Let

for which the following relations hold:

Therefore, from (7), we have [22]

(8)

Note that if is separated from the other classes by, then by (7), (8) is minimized and has a value of

zero.

IV. GENERALIZED DELTA RULES

A good heuristic of choosing the best hyperplane each timekeeping the others unchanged is to minimize the informationentropy of the resulting partition. Once the hyperplane is found,a corresponding hidden or output neuron can then be created.However, finding such a hyperplane with an exhaustive search isapparently impossible since the search space is infinite. There-fore, generalized delta rules are developed to guide the findingof such hyperplanes. Since both the vertices and the relation-ships among them contribute to the neuron outputs of a struc-tured network, we have to perform delta rules recursively fromthe super-source vertex to all leaf vertices when the network istrained with a structured pattern.

A. Architecture of Constructed Networks

In a network proposed by [32], hidden layers are placed be-tween the input and output layers. Each neuron of a layer isconnected to the neurons of the previous layer through standardlinks and to feedback values through recursive links. Let bea neuron in layer . Suppose the valence of the data set beingconsidered is , the number of neurons in layer is , and isthe total number of input neurons and all the hidden neurons ofthe previous layer. Then neuron is connected to the input andhidden neurons via standard links and to feedback values via

recursive links. The recursive links in each of groupspass forward the previously computed outputs, from the controlcenter, of neuron for the child vertices of the current vertexbeing considered.

The architecture of our constructed networks is a little bit dif-ferent from that of [32]. First, we do not place all the outputneurons in the last layer. Since we check for output neurons ineach layer, they may appear in any layer except the input layer.Second, each noninput neuron is connected by the input neuronsand all the hidden neurons of lower layers through standard linksdue to the consideration of better efficiency [10], [22]. Third,the recursive links connected to a hidden neuron are restricted.The feedback values provided to a hidden neuron in a layer arethose obtained from itself and the previously created neurons inthis layer. The reason for this restriction is that neurons are cre-ated one by one. When a hidden neuron is being created, onlythe previously created hidden neurons and itself are available.Let be the th neuron in layer , and be the total numberof input neurons and all the hidden neurons of lower layers.Then neuron is connected to the input and hidden neurons via

standard links, and to the feedback values via recursivelinks, as shown in Fig. 4(a). Finally, each output neuron onlyreceives feedback values which are previously computed out-puts of neuron for the child vertices. Connections to outputneurons are shown in Fig. 4(b).

B. Delta Rules for Hidden Neurons

Let a data set be already divided into byhidden neurons with hyperplanes

, and , respectively, in layer . Suppose wewant to add hidden neuron with hyperplanesuch that is minimized. Note


Fig. 4. Connections for (a) hidden neurons and (b) output neurons.

Fig. 5. Deriving weights for a hidden neuron.

that , referring to Fig. 5, takes the followingform:

(9)

where is thenumberof inputneuronsandall thehiddenneuronsin lower layers, is the output of neuron of the neuronswith being the input vector to thenetwork, is the valence of thedata set is the th child vertex of , and is the neuronoutput which is the sigmoid function of the following form:

(10)

For convenience, we use instead ofin the following discussion. Such a hyperplane

can be obtained by adjusting and in the following manner:

(11)

(12)

where the argument means the value at the present time,means the value at the previous time, and is the learning rate,a constant. By the chain rule, we have

By taking partial differentiation on (6), we have

after a little manipulation [22]. Now we computeand . Since is not differentiable, We modify thedefinition to make it continuous as follows:

where is defined as

.

By applying the chain rule again, we have

where

by (10). Therefore


where andare computed from the super-source vertex to all leaf verticesrecursively for pattern . From (9), we have

(13)

(14)

Note that the third term in the right-hand side of (13) iszero since has nothing to do with for

. (14) can be expanded by recursively takinguntil leaf vertices are encountered.

For a leaf vertex , we have

Similarly, we have

(15)

(16)

Again, the third term in the right-hand side of (15) is zero sincehas nothing to do with for .

(16) can be expanded until leaf vertices are encountered. For aleaf vertex , we have

Fig. 6. Deriving weights for an output neuron.

In summary, to find of the th hidden neuronwhen is already divided into by hyper-planes in layer , we use (11) and (12) to find and until

is minimized. This process canbe described by the following algorithm.

procedure Find Hidden Neuron n

Initialize and ;

while is not minimized

Update and by (11) and (12);

return and ;

end Find Hidden Neuron n

As usual, the above procedure has a problem of being trappedto local minima [10]. A fast simulated annealing strategy, calledthe Cauchy machine [33], is used to help escape from localminima. In particular, we use a Monte Carlo [5], [18] based sim-ulated annealing method for our purpose [22].

C. Delta Rules for Output Neurons

Given a data set , we want to find a hyperplanethat separates a class from the other classes, such that

is minimized and has a value of zero.If such a hyperplane can be found, then an output neuron iscreated correspondingly. For this case, , referringto Fig. 6, is defined as:

Let represent . We have the followingdelta rules for and :

(17)

(18)

where


where and are

by taking partial differentiation on (8) appropriately [22]. Wemodify the definition of to make it differentiable as before.Then

(19)

(20)

Again, andare computed from the super-source vertex to all leaf verticesrecursively for the given input pattern. Therefore

(21)

which can be expanded until leaf vertices are encountered. Fora leaf vertex , we have

Similarly, we have

(22)

and for a leaf vertex ,

In summary, to find of the output neuronfor , we use (17) and (18) to find and until

is zero. Ifcannot be zero, then it indicates that cannot be separatedfrom the other classes by a single hyperplane. In this case, wedo not create an output neuron for . This process can bedescribed by the following algorithm:

procedure Find Output Neuron

Initialize and ;

while is not minimized do


endwhile;

if

then return and ;

else return fail;

end Find Output Neuron

Likewise, the Cauchy machine [33] is used to help the proce-dure escape from local minima.

D. An Illustration

Suppose we are working with layer , and the current patternbeing considered is of Fig. 1. Let vertices , andbe denoted by , and , respectively. First we find if anoutput neuron is to be created. We show here the update for ,which is associated with the first recursive link of the outputneuron. By (22), we have


Next, consider the construction of the 3rd hidden neuron of layer. Let us compute the update for , which is associated with

the standard link from the second neuron of layer . By (14),we have

V. CONSTRUCTION OF STRUCTURED NETWORKS

Suppose we are given a set of structured training patternswith classes: . Let each vertex be represented bya vector of attributes. We’d like to have one output neuron foreach class in the constructed network for . Each output neurongives 1 for any input structured pattern of its own class, andgives 0 for otherwise. At the beginning, the network consistsof only the first layer with input nodes each correspondingto an attribute. Then we build the second layer. We use pro-cedure Find Output Neuron to test if any class can be sepa-rated from the other classes. For each success, we create oneoutput neuron for the underlying class. Then we apply procedureGenerate Hidden Layer shown below to create hidden nodes forthis layer.

procedure Generate Hidden Layer

;

while do

Find Hidden Neuron n ;

create the th hidden node with and as its weights;

update ; ;

endwhile;

end Generate Hidden Layer

Note that hidden nodes are generated by procedureFind Hidden Neuron n until the information entropy

is zero, i.e., all the data pointsin each belong to the same class. The number of hidden

Fig. 7. Two-valence structured data set G: (a) class c , (b) class c , and(c) class c .

Fig. 8. Constructed network for the example data set G.

nodes obtained is equal to the number of iterations procedureFind Hidden Neuron has been applied. Then using the firstlayer and the hidden nodes of the second layer, we build thethird layer. This process iterates until we have created oneoutput neuron for each class.

We give a simple example to illustrate the whole construc-tion process. Assume that we have a two-valence structureddata set G with three classes , and , as shown in Fig. 7.Note that three patterns belong to each class and each vertexis labeled with its vertex feature vector. Let each vertex featurevector be a three-dimensional binary string, i.e.,

. At the beginning, the neural network con-tains only the input layer, the first layer, with three nodes. Thenwe start to build the second layer. First, we find that the patternsof class can be separated from the patterns of the other classesby a single hyperplane. Therefore we create an output neuronfor class . Then we compute hyperplanes for hidden neurons.We find a hyperplane which minimizes , and create a newhidden neuron for it. However, is not zero, so we needto find more hyperplanes. Another hyperplane is found and acorresponding new hidden neuron is created for it. Nowbecomes zero with the partition induced by these two hyper-planes, so the processing of the second layer terminates. Next,we build the third layer. We find that the patterns of classcan be separated by one single hyperplane. Therefore, an outputneuron is created for class . Then we generate hidden neu-rons and for the third layer. Finally, an output neuronis created for in the fourth layer and we are done. The con-structed network for data set is shown in Fig. 8.

VI. MODIFIED ALGORITHMS

Overfitting the training data is an important issue in neuralnetwork learning. It occurs when a neural network is over-spe-cialized and cannot generalize well for other unknown patterns,


Fig. 9. (a) An output hyperplane for separating one class from the other classes; (b) adding a hyperplane in a layer of hidden neurons.

especially when the number of adjustable parameters is largecompared to the available training data [30], [31], [36]. Ourconstruction algorithms also encounter the risk of overfitting.Recall that in procedure Find Output Neuron of Section IV-3,we required that be zero in order to gen-erate an output neuron. Also, in Section V, we demanded inprocedure Generate Hidden Layer that hidden neurons be gen-erated until decreases to zero. These 100% strictrequirements on training accuracy may result in big networksizes and aggravate the problem of overfitting, One way to al-leviate overfitting is to loosen the requirements. That is, wedo not require that or bezero when deriving hidden or output neurons. Instead, we stoplearning for an output neuron or generating more hidden neu-rons for a hidden layer when a specified training accuracy isachieved. Let us consider the learning of an output neuron first.Refer to Fig. 9(a). Suppose a hyperplane is beingconsidered for separating class from the other classes. Let

be the number of correctly recognized training patternsdue to this hyperplane. A correctly recognized pattern is a pat-tern of that is correctly classified to , or a pattern thatis not a member of and is not classified to . For ex-ample, suppose the hyperplane in Fig. 9(a) intends to separatethe class labeled with from the other classes labeled withand . Assume that the left side of the hyperplane is assignedto the class labeled with . Then the number of correctly rec-ognized training patterns is 4 in the left side and 9 in the rightside of the hyperplane. We accept if the recog-nition rate, , is greater than or equal to a predefinedthreshold, even though is not zero. Forexample, the recognition rate is induced by the hyper-plane in Fig. 9(a). Therefore, procedure Find Output Neuronis modified as follows.

procedure Find Output Neuron

Initialize and ;

while is less than a predefined threshold do


if changes too small

then return fail;

endwhile;

return and ;

end Find Output Neuron

Note that we added one more statement to check howdiffers from its previous version. If the change is too small, weassume that the demanded recognition rate cannot be achievedand the procedure is terminated with a report of failure.

Next, we consider the construction of a hidden layer. Supposewe are adding a hyperplane to the hidden layer, asshown in Fig. 9(b). Let the training patterns be partitioned intosets, , and due to this addition. A set belongsto class if the majority of the patterns in belong to . Forexample, suppose contains 3 patterns of , 5 patterns of ,9 patterns of , and nothing else. Then belongs to . Let

be the number of consistently recognized training patternsdue to the addition of this hyperplane. For a set that belongs to

, a consistently recognized pattern is a pattern that is a memberof and is located in . For example, the number of consis-tently recognized patterns is 4, 3, 4, and 3, respectively, in eachsubset of Fig. 9(b). We accept and stop learning ofmore hidden nodes if the recognition rate, , is greater thanor equal to a predefined threshold, even thoughis not zero. For example, the recognition rate induced by theadded hyperplane is in Fig. 9(b). Similarly, procedureGenerate Hidden Layer is modified as follows.

procedure Generate Hidden Layer

;

while is less than a predefined threshold do

Find Hidden Neuron n ;

if changes too small due to the addition of the th node

then return fail;

create the th hidden node with and as its weights;

update ; ;

endwhile;

end Generate Hidden Layer

Again, we added one more statement to check if the de-manded recognition rate can be achieved. However, instead ofchecking weight values, we check how the current version of

differs from its previous version. The reason is that inprocedure Find Output Neuron we deal with the weights ofa hyperplane, while in procedure Generate Hidden Layer wedeal with the weights of different hyperplanes. If the change in

is too small, we assume that the demanded recognitionrate cannot be achieved, and the procedure is terminated with areport of failure.


VII. EXPERIMENTAL RESULTS

In this section, we present results of six experiments todemonstrate the effectiveness of our method. We apply ourmethod to the construction of neural networks with differenttypes of training patterns. We also compare our methodwith other types of networks, including standard multilayerperceptrons (S-MLP) [26], recurrent multilayer perceptrons(R-MLP) [29], [35], standard CCNN (S-CCNN) networks[12], recurrent CCNN (R-CCNN) networks [13], and S&SCCNN networks [3]. The learning rate of all the methods isset to be 0.1. Like our method, all the other methods run withsimulated annealing. The initial and end temperatures are set tobe 50 and 0.1, respectively. For S-CCNN, R-CCNN, and S&SCCNN, the threshold percentage changes for error functionand correlation function are both set to be 0.001, except forExperiment 1 in which they are set to be 0.0001. In a S&SCCNN network, each hidden neuron is connected by the inputneurons and all the previously created hidden neurons. Besides,we use 4, 8, and 16 candidates for generating networks forS-CCNN, S&S CCNN, and R-CCNN, respectively. Also, thearchitecture of S-MLP and R-MLP networks is determinedby the trial-and-error method. For each case, we explore fivealternatives and select the one with the best generalizationperformance for comparison.

The first experiment concerns a set of unstructured data. Eachpattern is a collection of attribute values. Experiment 2 dealswith the recognition of spoken English words, with each pat-tern represented as a set of sequential data. Experiment 3 con-cerns the recognition of handwritten Chinese characters whichare represented as structured patterns. Experiment 4 deals withthe classification of molecular structural formulas. Experiment5 investigates the balance between the computation load andperformance improvements caused by simulated annealing. Fi-nally, the last experiment concerns the effects of noise on therecognition of handwritten Chinese characters.

A. Experiment 1

Thyroid Gland classification is a practical problem on realworld, taken from the UCI Repository of Machine LearningDatabases, P.M. Murphy and D.W. Aha, Department ofInformation and Computer Science, University of California,Irvine, CA. There are 215 instances in total in this data set,each belonging to one of the three classes: nonthyroidism, hy-pothyroidism, and hyperthyroidism. Among the 215 instances,150 instances belong to nonthyroidism, 35 to hypothyroidism,and 30 to hyperthyroidism. Each instance has five attributes,e.g., anamnesis, scan, etc. Since the data are unstructured, eachinstance contains only one vertex and the dimensional of thevertex feature vector is five. In this experiment, the method ofN-fold cross-validation [6] is adopted. All instances of the dataset are randomly divided into five groups of equal size. That is,each group contains 30 instances of nonthyroidism, 7 of hy-pothyroidism, and 6 of hyperthyroidism. Each time four groupsare used as training examples for generating/training a networkand the other one is used as testing examples for checking theclassification capability of the constructed network. Therefore,

TABLE IRESULTS FOR THE THYROID GLAND PROBLEM WITH FIVE ATTRIBUTES, THREE

CLASSES, AND 215 EXAMPLES

five tests are done. The result obtained as the average of thefive tests is listed in Table I.

In Table I, the column “Stop Rate” indicates the lowerbound of the recognition rate for training patterns for a trainingprocess to terminate, the column “No of Neurons” shows thetotal number of neurons, including input, hidden, and outputneurons, in the obtained network, indicated as ,meaning neurons in the first layer, neurons in the secondlayer, etc., and neurons in the th layer. The column “Noof Conn” indicates the total number of connections betweennodes in the obtained network. The columns “Train Rate”and “Test Rate” indicate the rate of correct classification fortraining and testing examples, respectively. Finally, the column“Train Time(sec)” indicates the time in seconds for training anetwork, obtained by running on a PC with PIII 800 CPU and256 M memory.

Note that output nodes are not necessarily placed at the lastlayer of a network generated by our method. Therefore, thenumber of nodes at the last layer of our networks is differentfrom that of S-MLP and S-CCNN networks. Also, if a trainingprocess successfully converges, one should have “Train Rate”greater than or equal to “Stop Rate.” For the case of “Stop Rate”set to 100%, S-MLP and S-CCNN cannot converge and 100%recognition rate for training patterns cannot be achieved. All thenetworks obtained are comparable in size. However, the net-works obtained by our method learn faster and have a highergeneralization accuracy than S-MLP and S-CCNN networks.Besides, the networks obtained with a lower “Stop Rate” learnfaster, have a smaller size, but generalize less accurately thanthe networks obtained with a higher “Stop Rate.”

B. Experiment 2

The goal of this experiment is to classify the following 20spoken English words: “erase,” “enter,” “go,” “help,” “no,”“rubout,” “repeat,” “start,” “stop,” “yes,” “one,” “two,” “three,”“four,” “five,” “six,” “seven,” “eight,” “nine,” and “zero.” Therate of sampling the speech signals is 11 025 Hz, and theperiod of each word is divided into fixed-length frames eachof which consists of 256 samples. Each frame is represented


TABLE IIRESULTS FOR SPOKEN ENGLISH WORDS WITH 10 ATTRIBUTES,

20 CLASSES, AND 280 EXAMPLES.

by ten LPC (linear predictive coding) coefficients [2], to bedescribed below. In this way, each spoken word is transformedto a list structure in which a vertex denotes a frame and theten LPC parameters of each frame become the vertex featurevector of the corresponding vertex. Note that the first frame isthe super-source vertex, the second frame is the child of thesuper-source vertex, etc., and the last frame is the leaf vertex ofa list structure. The average number of frames for each word is16. That is, the structure of each word is a 16-level graph.

Let , be the LPC coefficients of a frame,and denote the speech signal at sampling point . The LPCmethod uses the previous sampled signals to predict the currentsampled signal as follows:

(23)

where is the predicted version of . The values of, for each frame are to minimize the following

error:

(24)

Various methods, such as the autocorrelation method [25], thecovariance method [2], and the lattice method [24] have beenproposed to minimize (24). We use the autocorrelation methodto obtain the LPC coefficients for each frame.

The data set for this experiment contains 280 instances, witheach class (word) containing 14 instances taking from differentpersons. All instances of the data set are randomly divided intoseven groups of equal size. Each group contains two instancesof each class. Each time six groups are used as training exam-ples for generating/training a network and the other one is usedas test examples for checking the classification capability of theconstructed network. Therefore, seven tests are done. The re-sults obtained are listed in Table II. Our network has 69 neuronsin total, which includes 10 input neurons, 20 output neurons,and 39 hidden neurons. Both R-MLP and R-CCNN networkslearn much slower than our network. Note that for a R-MLP net-work, all the weights have to be updated each time an exampleis learned. For a network of ours or R-CCNN, only the weightsof the underlying neuron have to be updated. However, one hasto try all the candidates of each neuron for a R-CCNN networkin order to select the best choice for the neuron. From the table,we can see that our network has the highest generalization ac-curacy and the R-CCNN network is the biggest in size.

Fig. 10. (a) Example Chinese character and (b) five strokes represented bystraight line segments.

If 100% training accuracy is demanded, our method canstill generate a network in 67 044 seconds, with a structureof and with test accuracy 86%.However, neither R-MLP nor R-CCNN can generate networksto achieve the desired training accuracy for this case.

C. Experiment 3

We apply our method to construct networks for learning andrecognizing handwritten Chinese characters which are repre-sented as directed graphs. Handwritten Chinese characters arevery complex and various approaches have been proposed toprocess them. Structural approaches take advantage of the struc-tural characteristics of Chinese characters and have attracteda lot of attention [14], [15], [8]. In these approaches, Chinesecharacters are converted into structured patterns, and recogni-tion of Chinese characters is done by using various similaritymeasures between the input and stored structured patterns.

Strokes are basic structural units in Chinese characters,and have been proved to be useful for Chinese characterrecognition [9], [23]. Here we describe a very simple methodto convert handwritten Chinese characters into structuredpatterns consisting of strokes and their relationships. Wefocus on how structured networks are constructed from theobtained structured patterns, rather than the efficiency of theobtained structured patterns for Chinese character recognition.We assume that each character is normalized, thinned, andoriented appropriately [15]. Given a character, it is scannedand divided into strokes. If a stroke is close to be straight, it isrepresented by a straight line segment connecting its two endpoints. Otherwise, it is segmented into several pieces and eachpiece is represented by one straight line segment. For example,the Chinese character

shown in Fig. 10(a) can be regarded as the composition of fivestrokes shown as dotted lines in Fig. 10(b).

Next, we give each straight line a direction. A direction canbe one of the eight categories: EW (east-west), ENE (east-north-east), NE (northeast), NNE (north-northeast), N (north), NNW(north-northwest), NW (north-west), and WNW (west-north-west), as shown in Fig. 11. A straight line segment is assignedthe direction that best matches it. Each direction is coded with


Fig. 11. Directions and their codes.

four bits. For example, the bit string of direction N is 0000, di-rection EW is 1111, etc. Note that the number of bits differ be-tween two bit strings can be used as a measure of dis-similarityof the corresponding two line segments. The more bits differbetween the bit strings of two line segments, the less similarbetween these two line segments. For example, EW is close toENE and the number of bits differ between them is 1. However,EW differs from N by four bits, indicating that EW is quite dif-ferent from N.

Now we transform a character into a graph structure by scan-ning the straight line segments from top to bottom and from leftto right. The obtained straight line segments become the ver-tices and the direction code of each line segment becomes thevertex feature vector of the corresponding vertex. If a graph hasmore than one source vertex, we add a virtual node, with code0110, as the super-source vertex and let it be the parent of allthe source vertices. For example, the procedure of obtaining thegraph structure for Fig. 10(b) is shown in Fig. 12. Note that thegraph structure contains five vertices and its valence is three. Fi-nally, the graph structure is converted to the sorted data set andis ready for use.

We apply our construction algorithm to the recognition ofseven similar Chinese characters

Fig. 12. Obtaining graph structure for Fig. 10(b).

A total of 140 instances, with 20 instances for each character,were collected for this experiment. Each instance is transformedto a graph structure and corresponding sorted data set by theprocedure described above. Some examples and their structuredpatterns are shown in Fig. 13. Note that these characters are verysimilar in shape, and that is why separating them apart is not asimple task. All instances of the data set are randomly dividedinto 10 groups of equal size. Each group contains two instancesof each class. Each time nine groups are used as training exam-ples for generating/training a network and the other one is usedas test examples for checking the classification capability of theconstructed network. Therefore, ten tests are done. The resultsobtained as the average of the ten tests are listed in Table III.Obviously, our networks tend to be smaller and learn faster thanR-MLP and S&S CCNN networks, and keep an equally goodgeneralization accuracy.

D. Experiment 4

We apply our method to the classification of aliphatics basedon the structure of molecular structural formulas. In this case,we’d like to classify 48 instances, to one of three classes:alkanes, alkenes, and alkynes. There are 16 instances for eachclass [27] in the data set. A formula contains atoms whichare connected by covalent bonds. The chemical structuralformula of an instance, called Isopentane, of alkanes is shownin Fig. 14(a). To classify the instances, we adopt an arbitrarymethod to convert their formulas to acyclic graphs. We find thelongest chain in a formula and the atom at the utmost left ofthe chain becomes the supersource. Then the other atoms of theformula can be added to the graph structure in a natural way.The converted graph for Isopentane is shown in Fig. 14(b). Theatoms become the vertices and the vertex feature vector of avertex include the atomic mass and the maximal covalent bondof the corresponding atom. Note that the valence of the data setis 3.

All instances of the data set are randomly divided into eightgroups of equal size. Each group contains two instances of eachclass. Each time seven groups are used as training examplesfor generating/training a network and the other one is used astest examples for checking the classification capability of theconstructed network. Therefore, eight tests are done. The resultsobtained as the average of the eight tests are listed in Table IV.The networks obtained by our method have about the same sizeand generalization accuracy as S&S CCNN networks, but learnmuch faster. Note that R-MLP networks have difficulties forthis application. Too much time is needed for training a R-MLP


Fig. 13. One set of example Chinese characters and their structured patterns.

TABLE IIIRESULTS FOR HANDWRITTEN CHINESE CHARACTERS WITH FOUR ATTRIBUTES,

SEVEN CLASSES, AND 140 EXAMPLES

network. As mentioned, all the weights have to be updated eachtime for R-MLP networks. For this case, the highest level of the

Fig. 14. Isopentane, an instance of alkanes: (a) chemical structural formulaand (b) converted graph structure.

converted graphs is 11 which, together with valence being 3,leads to a huge amount of recursive computation involved beforea R-MLP network is trained.

E. Experiment 5

We mentioned earlier that simulated annealing is used to helpgeneralized delta rules escape from local minima. The Cauchymachine [33] used in our construction algorithms allows an it-erative update to the weights of the hyperplane being derived by


TABLE IVRESULTS FOR ALIPHATICS WITH TWO ATTRIBUTES, THREE CLASSES, AND

48 EXAMPLES

TABLE VEFFECT OF TEMPERATURE DECREASING RATES IN SIMULATED ANNEALING

decreasing the temperature until is less than a spec-ified minimum value. In our experiments, we adopt to beof the following form:

(25)

where is a constant and is increased by one in each itera-tion. Usually, is set to 1. The decreasing rate of can becontrolled by the value of . The smaller the value of , thelower rate is decreased.

Local minima can be less likely to appear if the tempera-ture is decreased very slowly each time. However, this may takea lot of time in obtaining a hyperplane. On the other hand, ifthe temperature is decreased too fast, a neuron may be createdfaster. However, more hidden layers and nodes may be gener-ated. Such a big network may aggravate the risk of overfittingtoo. To see the balance between the computation load and per-formance improvements caused by simulated annealing, we doan experiment on the data set of handwritten Chinese characterswith “Stop Rate” set to 95%. The results are shown in Table V.Using large values of tends to save training time. However,too large values, e.g., , increase training time. The reasonis that more and more neurons have to be generated and trainedwhen gets larger. Besides, we can see that overfitting occurswith .

F. Experiment 6

This experiment concerns the effects of noise on the recogni-tion of handwritten Chinese characters. The noise is added ran-domly to the dataset in two ways: We add 8% of noise by adding

TABLE VINETWORKS OBTAINED BY DIFFERENT METHODS FOR NOISY DATA

extra vertices into or deleting vertices from the graph structures,and we add another 5% of noise by increasing or decreasingthe values of vertex feature vectors. The results are shown inTable VI. Obviously, our method works better than R-MLP andS&S CCNN networks. Our networks learn faster and generalizemore accurately than R-MLP and S&S CCNN networks.

VIII. CONCLUSION

We have presented an entropy-based approach for automat-ically constructing neural networks consisting of generalizedrecursive neurons for classification of acyclic structured pat-terns. Given a classification problem, the architecture, i.e., thenumber of hidden layers and the number of nodes in each hiddenlayer, and all the values of the weights associates with the cor-responding neural network are automatically determined. As aresult, the burden of trail-and-error imposed on the user can beavoided. We have also demonstrated the effectiveness of ourmethod by applying it to the construction of neural networkswith unstructured and structured data.

Our proposed method encounters some difficulties. One isassociated with the case when the valence of testing patterns islarger than that of training patterns [32]. A simple solution isto use a larger valence than necessary during the constructionphase. However, this may slow down the construction processand increase the complexity of the obtained networks. Anotherdifficulty is that the method is limited to the application toacyclic structured patterns. A network will not provide a stableoutput if the trajectory of each cyclic graph does not convergeto an equilibrium. When the norm of the weights in a networkis sufficiently small, the network’s trajectory can be shown toconverge to a unique equilibrium for a given cyclic graph [1].However, it is not guaranteed that our learning algorithms canlearn such small weights for any given application. Therefore,our method can only be used to construct networks for classifi-cation of acyclic structured patterns.

ACKNOWLEDGMENT

The authors would like to thank the anonymous referees fortheir constructive comments and suggestions.


REFERENCES

[1] L. B. Almedida, “A learning rule for asynchronous perceptrons withfeedback in a combinatorial environment,” in Proc. IEEE Int. Conf.Neural Networks, New York, 1987, pp. 609–618.

[2] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis bylinear prediction of the speech wave,” J. Acoust. Soc. Am., vol. 50, pp.637–665, 1971.

[3] A. M. Bianucci, A. Micheli, A. Sperduti, and A. Starita, “Applicationof cascade correlation networks for structures to chemistry,” J. Appl.Intelligence, vol. 12, no. 1/2, pp. 117–146, 2000.

[4] M. Bichsel and P. Seitz, “Minimum class entropy: A maximum infor-mation approach to layered networks,” Neural Networks, vol. 2, pp.133–141, 1989.

[5] K. Binder, Monte Carlo Methods in Statistical Physics. New York,NY: Springer, 1978.

[6] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classificationand Regression Trees Wadsworth, CA, 1984.

[7] G. A. Carpenter and S. Grossberg, “The ART of adaptive patternrecognition by a self-organizing neural network,” Computer, vol. 21,pp. 77–88, Mar. 1988.

[8] K. P. Chan and Y. S. Cheung, “Fuzzy-attribute graph with applicationto Chinese character recognition,” IEEE Trans. Syst. Man, Cybern., vol.22, pp. 153–160, 1992.

[9] H. H. Chang and H. Yang, “Analysis of stroke structures of handwrittenChinese characters,” IEEE Trans. Syst., Man, Cybern., vol. 29, pp.47–61, Jan. 1999.

[10] K. J. Cios and N. Liu, “A machine learning method for generation of aneural network architecture: A continuous ID3 algorithm,” IEEE Trans.Neural Networks, vol. 3, pp. 280–290, 1992.

[11] Y. L. Dae, M. K. Byung, and S. C. Hyung, “A self-organized RBF net-work combined with ART II,” in Proc. IEEE Int. Joint Conf. NeuralNetworks, vol. 3, 1999, pp. 1963–1968.

[12] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning archi-tecture,” in Advances in Neural Information Processing Systems, D. S.Toouretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp.524–532.

[13] S. E. Fahlman, “The Cascade-Correlation Architecture,” CarnegieMellon Univ., Pittsburgh, PA, Technical Rep. CMU-CS-91-100, 1991.

[14] K. S. Fu, Syntactic Methods in Pattern Recognition. New York, NY:Academic, 1974.

[15] , Syntactical Pattern Recognition and Applications. EnglewoodCliffs, NJ: Prentice-Hall, 1982.

[16] R. M. Goodman, C. M. Higgins, and J. W. Miller, “Rule-based neuralnetwork for classification and probability estimation,” Neural Compu-tation, vol. 4, pp. 781–804, 1992.

[17] F. M. Ham and I. Kostanic, Principles of Neurocomputing for Scienceand Engineering, Singapore: McGraw-Hill International, 2000.

[18] W. Hastings, “Monte Carlo sampling methods using Markov chains andtheir application,” Biometrika, vol. 57, pp. 97–109, 1970.

[19] S. Y. Kung and J. N. Hwang, “An algebraic projection analysis for op-timal hidden units size and learning rate in back-propagation learning,”in Proc. IEEE Int. Conf. Neural Networks, vol. 1, San Diego, CA, July1988, pp. 363–370.

[20] S.-J. Lee and C.-L. Ho, “An ART-based construction of RBF networks,”IEEE Trans. Neural Networks, vol. 13, pp. 1308–1321, Nov. 2002.

[21] S.-J. Lee and M.-T. Jone, “An extended procedure of constructing neuralnetworks for supervised dichotomy,” IEEE Trans. Syst., Man, Cybern.B, vol. 26, pp. 660–665, Aug. 1996.

[22] S.-J. Lee, M.-T. Jone, and H.-L. Tsai, “Constructing neural networks formulti-class descretization based on information entropy,” IEEE Trans.Syst., Man, Cybern. B, vol. 29, pp. 445–453, June 1999.

[23] J.-W. Lin, S.-J. Lee, and H.-T. Yang, “A stroke-based neuro-fuzzysystem for handwritten Chinese character recognition,” Appl. ArtificialIntelligence, vol. 15, no. 6, pp. 561–586, 2001.

[24] J. Makhoul, “Stable and efficient lattice methods for linear prediction,”IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-25, no.5, pp. 423–428, 1977.

[25] J. D. Markel and A. H. Gray Jr., “On autocorrelation equations as appliedto speech analysis,” IEEE Trans. Audio Electroacoust., vol. AU-21, pp.69–79, 1973.

[26] J. L. McClelland and D. E. Rumehart, Parallel Distributed Processing(Two Volumes). Cambridge, MA: MIT Press, 1986.

[27] J. McMurry, Organic Chemistry, 5th ed. Pacific Grove, CA:Brooks/Cole, 2000.

[28] J. P. Nadal, “New algorithms for feedforward networks,” in NeuralNetworks and SPIN Classes, J. P. Theumann and J. P. Koberle, Eds.Singapore: World Scientific, 1989, pp. 80–88.

[29] A. G. Parlos, K. T. Chong, and A. F. Atiya, “Application of the recurrentmultilayer perceptron in modeling complex process dynamics,” IEEETrans. Neural Networks, vol. 5, pp. 255–266, Apr. 1994.

[30] W. S. Sarle, “Stopped training and other remedies for overfitting,” inProc. 27th Symp. Interface of Computing Science and Statistics, July1995, pp. 352–360.

[31] M. Smith, Neural Networks for Statistical Modeling. Boston, MA: In-ternational Thomson Computer Press, 1996.

[32] A. Sperduti and A. Starita, “Supervised neural networks for classifica-tion of structures,” IEEE Trans. Neural Networks, vol. 8, pp. 714–735,May 1997.

[33] H. Szu and R. Hartley, “Fast simulated annealing,” Phys. Lett. A, vol.122, pp. 157–162, 1987.

[34] G. Tontini and A. A. de Queiroz, “RBF FUZZY-ARTMAP: A new fuzzyneural network for robust on-line learning and identification of patterns,”in Proc. IEEE Int. Conf. Systems, Man and Cybernetics, vol. 2, 1996, pp.1364–1369.

[35] K. Tutschku, “Recurrent multilayer perceptrons for identification andcontrol: The road to application,” Univ. Würzburg, Germany, ser. Re-search Report Series, 1995.

[36] A. Weigend, “An overfitting and the effective number of hidden units,”in Proc. Connectionist Models Summer School, July 1994, pp. 335–342.

Hsien-Leing Tsai was born in Taoyuan, Taiwan,R.O.C., on December 26, 1969. He received the B.S.degree in computer sciencefrom private Feng ChiaUniversity, Taiwan, R.O.C., in 1993, he is currentlypursuing the Ph.D. degree at the Department of Elec-trical Engineering, National Sun Yat-Sen University,Taiwan, R.O.C. His main research interests includeartificial intelligence and pattern recognition.

Shie-Jue Lee (S’88–M’90) was born in Kin-Men,Taiwan, R.O.C., on August 15, 1955. He received theB.S.E.E. and M.S.E.E. degrees from National TaiwanUniversity, Taiwan, R.O.C., in 1977 and 1979, re-spectively, and the Ph.D. degree from the Departmentof Computer Science, University of North Carolina,Chapel Hill, in 1990.

He joined the faculty of the Department ofElectrical Engineering, National Sun Yat-Sen Uni-versity, Taiwan, R.O.C., in 1983, where he became aProfessor in 1994, served as the Acting Director and

Director of the Telecommunication Development and Research Center, from1997 to 2000, and was the Chairman of the Electrical Engineering Department,from 2000 to 2003. He was the Director of the Southern TelecommunicationsResearch Center, National Science Council, Taiwan, R.O.C., from 1998 to1999. His research interests include machine intelligence, data mining, softcomputing, multimedia communications, and chip design.

Dr. Lee is a Member of the Association for Automated Reasoning, the In-stitute of Information and Computing Machinery, the Chinese Fuzzy SystemsAssociation, and the Taiwanese Association of Artificial Intelligence. He re-ceived the Excellent Teaching Award of National Sun Yat-Sen University andthe Distinguished Teachers Award of the Ministry of Education, both in 1993.He was awarded the Outstanding M.S. Thesis Supervision by the Chinese Insti-tute of Electrical Engineering in 1997. He also received the Distinguished PaperAward of the Computer Society of the Republic of China and the DistinguishedResearch Award of National Sun Yat-Sen University both in 1998, and the bestpaper award of the 7th Conference on Artificial Intelligence and Applications,in 2002. He served as the Program Chair for the International Conference onArtificial Intelligence (TAAI-96), Kaohsiung, Taiwan, R.O.C., December 1996,the International Computer Symposium—Workshop on Artificial Intelligence,Tainan, Taiwan, R.O.C., December 1998, and the 6th Conference on ArtificialIntelligence and Applications, Kaohsiung, Taiwan, R.O.C., November 2001.

Documents

Entropy-Based Generation of Supervised Neural Networks for Classification of Structured Patterns