9
Neural Networks, Vol. 2, pp. 133-141, 1989 0893-6080/89 $3.00 + .00 Printed in the USA. All rights reserved. Copyright ~'~ 1989 Pergamon Press plc ORIGINAL CONTRIBUTION Minimum Class Entropy: A Maximum Information Approach to Layered Networks MARTIN BICHSEL* AND PETER SE1TZ Paul Scherrer Institute (PSI) and *Swiss Federal Institute of Technology (ETHL Zurich, Switzcrland (Received 20 June 1988; revised and accepted 28 September 1988) Abstract--Layered feedforward networks are viewed as multistage encoders. This view provides a link between neural networks and information theory and leads to a new measure for the performance of hidden units as well as output units. The measure, called conditional class entropy, not only allows existing networks to be judged but is also the basis of a new training algorithm with which an optimum number of neurons with optimum connecting weights can be found. Keywords--Layered networks, Information theory, Minimum entropy, Performance measure, Teaching al- gorithms. 1. INTRODUCTION Most applications of neural networks are classifica- tion problems in which an analog or digital input pattern should be transformed into a digital code word describing the class of this particular input pat- tern. It has been proved by Duda and Hart (1973) that any such classifier can be realized with a hard limiter feedforward network containing only two hid- den layers. This is a very powerful statement but, as soon as one tries to implement a real neural network for a particular classification problem, one is con- fronted with two fundamental questions: • What is the smallest possible number of units (neu- rons) in a particular hidden layer for best possible operation? • Is the optimum number of hidden layers equal to the minimum of two (as stated above), are fewer layers possible, or should additional layers with less units per layer be used to get a lower total number of units and therefore a smaller compu- tational cost? The helpful comments of the reviewers of this paper are grate- fully acknowledged, especially pointing out the work of Linsker, previously unknown to us. We would also like to thank one of the reviewers for his suggestion that increasing the phase space might make it easier to find a good minimum. Requests for reprints should be sent to Martin Bichsel, Paul Scherrer Institute, c/o Laboratories RCA, Badenerstrasse 569, CH - 8048 Zurich, Switzerland. To answer these questions we must be able to judge the performance of the units in the hidden layers. In particular we want to measure how much a single unit or a group of units in a hidden layer contributes to the overall performance of a network. This aim can be achieved by interpreting hard limiter feedforward networks (for a nice illustration, see Lippmann (1987)) as multistage encoders within the framework of information theory. The well-devel- oped information theory provides us with a natural measure of performance, the conditional entropy, which allows us not only to judge the overall per- formance of a network but also to judge the per- formance of any group of units in a particular hidden layer.This unconventional view of layered neural networks in the context of information theory is in- troduced in Section 2. Linsker (1988) has demonstrated that Hebbian learning tends to conserve information from the in- put layer to further layers. The idea of treating neural networks in the context of information theory is not completely new. Gardner (1986) characterized a learning network as a conditional probability com- puter. As we shall point out, however, in most prob- lems we are not interested in the complete infor- mation of the input patterns, as it was used for example by Gardner (1986) and Linsker (1988), but rather we want to separate this information from the in- formation concerning the class of these patterns. Therefore, the conditional entropy with respect to the class of the input pattern is a much better per- formance measure than the conditional entropy with 133

Minimum class entropy: A maximum information approach to layered networks

Embed Size (px)

Citation preview

Neural Networks, Vol. 2, pp. 133-141, 1989 0893-6080/89 $3.00 + .00 Printed in the USA. All rights reserved. Copyright ~'~ 1989 Pergamon Press plc

O R I G I N A L C O N T R I B U T I O N

Minimum Class Entropy: A Maximum Information Approach to Layered Networks

M A R T I N BICHSEL* AND P E T E R SE1TZ

Paul Scherrer Institute (PSI) and *Swiss Federal Institute of Technology (ETHL Zurich, Switzcrland

(Received 20 June 1988; revised and accepted 28 September 1988)

Abstract--Layered feedforward networks are viewed as multistage encoders. This view provides a link between neural networks and information theory and leads to a new measure for the performance of hidden units as well as output units. The measure, called conditional class entropy, not only allows existing networks to be judged but is also the basis of a new training algorithm with which an optimum number of neurons with optimum connecting weights can be found.

Keywords--Layered networks, Information theory, Minimum entropy, Performance measure, Teaching al- gorithms.

1. INTRODUCTION

Most applications of neural networks are classifica- tion problems in which an analog or digital input pattern should be t ransformed into a digital code word describing the class of this particular input pat- tern. It has been proved by Duda and Har t (1973) that any such classifier can be realized with a hard limiter feedforward network containing only two hid- den layers. This is a very powerful statement but, as soon as one tries to implement a real neural network for a particular classification problem, one is con- fronted with two fundamental questions:

• What is the smallest possible number of units (neu- rons) in a particular hidden layer for best possible operat ion?

• Is the opt imum number of hidden layers equal to the minimum of two (as stated above), are fewer layers possible, or should additional layers with less units per layer be used to get a lower total number of units and therefore a smaller compu- tational cost?

The helpful comments of the reviewers of this paper are grate- fully acknowledged, especially pointing out the work of Linsker, previously unknown to us. We would also like to thank one of the reviewers for his suggestion that increasing the phase space might make it easier to find a good minimum.

Requests for reprints should be sent to Martin Bichsel, Paul Scherrer Institute, c/o Laboratories RCA, Badenerstrasse 569, CH - 8048 Zurich, Switzerland.

To answer these questions we must be able to judge the performance of the units in the hidden layers. In particular we want to measure how much a single unit or a group of units in a hidden layer contributes to the overall performance of a network. This aim can be achieved by interpreting hard limiter feedforward networks (for a nice illustration, see Lippmann (1987)) as multistage encoders within the f ramework of information theory. The well-devel- oped information theory provides us with a natural measure of performance, the conditional entropy, which allows us not only to judge the overall per- formance of a network but also to judge the per- formance of any group of units in a particular hidden layer.This unconventional view of layered neural networks in the context of information theory is in- troduced in Section 2.

Linsker (1988) has demonstrated that Hebbian learning tends to conserve information from the in- put layer to further layers. The idea of treating neural networks in the context of information theory is not completely new. Gardner (1986) characterized a learning network as a conditional probability com- puter. As we shall point out, however, in most prob- lems we are not interested in the complete infor- mation of the input patterns, as it was used for example by Gardner (1986) and Linsker (1988), but rather we want to separate this information from the in- formation concerning the class of these patterns. Therefore, the conditional entropy with respect to the class of the input pattern is a much better per- formance measure than the conditional entropy with

133

134 M. Bichsel and P. Seitz

respect to the whole information content in the input patterns and it follows that Hebbian learning is not. in general, optimum. In Section 3 it is shown how a new network is constructed by determining optimum weights with respect to the entropy measure. The numerical results of the example in Section 4 illus- trate a successful application of our new approach. where a neural network solves a problem in image processing, classifying shift-invariant symmetry axes. Section 5 discusses general problems of teaching al- gorithms and particular difficulties with our new al- gorithm.

2. MEASURING THE RELEVANT INFORMATION IN A HARD-LIMITER

FEEDFORWARD NET

In the following we consider layered feedforward nets of neurons with a hard-limiter activauon func- tion according to the McCulloch-Pitts model (1943). The output xj of a single neuron ~s given bv

x~ = sign ( ~ w, ,y , - - (.),) (1} !

with appropriate summation, so that x~ denotes the output of a neuron in a hidden layer (or the final output layer) and y~ refers to the o u t p u t o f a neuron in a layer nearer to the input or to an input connec- tion.

If we present a certain input pattern to such a network then the neurons in a particular hidden layer (or in the output layer) show a corresponding pattern x = (x~, x 2 , . . . , x,) of outputs which is determined by the particular set of weights. Since xj can be in one of only two possible states, it is natural to in- terpret the patterns in a certain layer as a code word and the whole network as a multistage encoder. The weights between two layers determine how the code in one layer is transformed into the corresponding code in the next layer. With this interpretation, anal- ysis of neural networks in the framework of infor- mation theory has become possible.

A natural measure for the effectiveness of a given code is its (statistical) information content, which. according to Shannon (1948), is called its " e n t r o p y . The entropy S of a code in a certain layer is

S = - ~ P(xt) log2 P(xk) (2) k

where P(xk) denotes the probability that the pattern xk appears in the layer under study. The entropy S measures how uncertain we are in telling what code word will appear m the particular layer if we do not know which input pattern is presented to the net- work. The smaller S is, the higher is our confidence that we can predict the response of the units in the layer and, at S = 0, perfect prediction is possible. The entropy is also the mean amount of information

obtained when looking at the code word caused bv a certain input pattern. As an example, consider the case of only two units m a particular layer, where the four possible output patterns{l i . 1). (- I. 11. (1. --1), (1, 1)} are equally likel~ ~ In this situation two bits of information are necessar\, to describe the response of the layer. Assuming now that the outputs (1, - l) and ( - 1, 1) are extremely unlikely and that the two remaming patterns appear with the same probability, it can easily be seen ~hat. in this case. only one bit of information is gained.

In order to apply the concept ,~f entropy to the analysis of the performance of neural networks, care has to be taken because, for a given problem, there is relevant and irrelevant informauon. Most appli- cations of neural networks arc classification prob- lems where the relevant information is the class in- dex. This information must be separated from a wide range of irrelevant information, fi.~r example, intra- class variability (the different input patterns of the same class are not identical) or noise.

The classical XOR problem of Minsky and Papert (1969} is an illustrative example ,:ff a classification problem. There we want to assign the input patterns with similar bits {(0, 0), ( l . 1)/ ~ class ] and the remaining two {(0, 1 ), I 1,0)} to class 2, Thus we only want to know if the two inputs arc equal or not: their particular value is of no interest. Therefore we must try to separate the information describing the equal- ity of the two inputs from the total information con- tent of the input pattern.

How then can the information content of the set of code words be measured with respect to the given classification problem? This can be achieved by mea- suring how uncertain we are (i~ the mean sense} about the class of the input pattern if a particular code word is observed in a certain ~ hidden or output } layer. The solution, therefore. ~s to measure how much information is obtained if wc know both the pattern in a certain layer and the class of the input which produces this observed pattern, instead of the pattern alone. Knowing the output pattern, eqn (2) can be used for the calculation of the uncertainty of the class to which the generating input belongs. Now however the probabilities P(xD have tc~ be replaced by the conditional probabilities I',, -=: P(class - ilpattern - xk) that an input pattern from class i was presented to the network, given the observed pattern x~ in the considered layer. Therefore the uncertainty Sk in class assignment for a given pattern xk is de- scribed by

Nctas,

Sk = - ~ P~,log:Pik. (3)

The mean uncertainty about the class for all obserw able patterns xk is now easily calculated by taking the mean over all possible patterns xk weighted with

Minimum Class Entropy 135

the probability of the occurrence of the pattern:

N

~¢ = ~ P(xk)Sk k = l

N Nclas s

= - . ~ P(xk) ~ Pik log2 Pik. k - 1 i ~ l

(4)

In the following we refer to S as "conditional class entropy" (CCE) by analogy with Shannon (1948). The CCE measures, in a particular layer, the amount of missing information about the class index of the inputs.

This may be illustrated with the X O R problem and two inputs: we allow only connections between neighbouring layers and assume that any of the four input patterns {(0, 0), (0, 1), (1, 0), (1, 1)} is possible and equally probable. In this case the smallest pos- sible network contains one hidden layer with two hidden units and one output neuron, see Lippmann (1987). There are 16 different successful response strategies for the two hidden units and infinitely many realizations in weight space. All realizations belong to one of two different types, which are illustrated in Figure 1. The straight lines of sign-change for the two units form a wedge in the 2-dimensional space of input patterns and enclose either the two points (0, 0) and (1, 1) or the two points (1, 0) and (0, 1). All these strategies have CCE zero. This means that we can uniquely determine the class from the output pattern of the hidden units.

The entropy of the output pattern can be calcu- lated using eqn (2). As an example we consider the (successful) strategy, which maps the four input pat- terns {(0, 0), (0, 1), (1, 0), (1, 1)} onto the output patterns {( - 1, - 1), (1, - 1), (1, - 1), (1, 1)} in the hidden layer. This means that two distinct input pat- terns are mapped onto the same output pattern (1, - 1). For this strategy the output patterns { ( - 1, - 1), ( - 1 , 1), (1, - 1 ) , (l , 1)} have probabilities {1/4, 0, I/2, 1/4} to occur. Thus we calculate an entropy:

S = _ .1

0 - ~ log2

4 = 1.5 (5)

where the input entropy was 2 bits. The same entropy is calculated for all other successful strategies. This illustrates that all successful strategies succeed in dis- carding 0.5 bits of information without losing any class information.

Of course, all of the preceding results apply not only to the complete patterns xk in a layer but also to any subpattern in it. As an example we shall con- sider one of the two hidden units in the X O R prob- lem. As Minsky and Papert (1969) pointed out, this

/

(o,i)

/

/

/

(0,o)

/

/

I

/

if, i)

J /

i •

(i,Ol

, D, Input l

(a)

) (o,t) \ l ,ll

\ \ \

\ \

• \ • (o,oi , (I,oi

, :> Input l

(b)

FIGURE 1. Strategies for s pair of hidden units solving the XOR problem. The two dashed lines show the lines of sign- change of the pair of units as a function of the (continuous) inputs. (a) and (b) show the two types of strategies where the dashed lines enclose either (1, 1) and (0, 0) or (1, 0) and (0, 1). We can choose which unit belongs to which dashed line (2 possibilities) and for each unit we can determine if the output changes from + 1 to - 1 or v.v. (4 possibilities). This leads in total to 16 different strategies.

unit has 14 possible strategies in responding to the four input patterns but, as can be seen in Figure 1, only for 8 of these strategies can a solution for the whole network be found. In all successful solutions, the line of sign-change separates one input pattern (e.g., (0, 0)) from the others. The same strategies were found by Sejnowski, Kienker, and Hinton (1986) for a network containing only one hidden unit but allowing (symmetric) connections between input and output units, that is, by introducing a direct path between input and output.

Why these strategies are successful can be seen in their CCE. Suppose that all input patterns are equally likely and consider a single neuron of a pair of suc- cessful hidden units. To be specific, let the output only be - 1 for the first input pattern (00) and 1 for all other inputs. In this case the conditional proba-

136 M, Hichsel and P. Seitz

bilities are

Pfclass = l loutput = 1) = 1

P(class = 2[output =- - 1 ) = 0

P(class = l loutput = 1) = 1/3

P(class = 21output = 1) = 2/3

The probabilities of the output patterns are:

((~)

P(xl = 1) = 1/4

P(x~ = 1) - 3/4 (7)

Using eqn (4) we obtain the CCE

1 = ~ ( l l o g 2 1 - O)

5 log2 5 ~- 3 log2 = 0.689 (8)

We arrive at the same result for all other successful strategies but a CCE S = 1 is obtained for all un- successful strategies. Thus the successful strategies succeed in concentrating part of the class information onto the first unit: an achievement where the un- successful strategies fail.

The CCE of a pattern exhibits the weak properties of being positive and not smaller than the class un- certainties of any of its subpatterns. Therefore. add- ing one more unit to a layer does not increase (but possibly decreases) the CCE of the combined pat- terns in this layer. Unfortunately. it is not. in general. possible to derive a better (i.e.. lower) upper limit for the CCE of the combined patterns, knowing sep- arately the CCE of the old train of units and of the new unit. The origin of this problem is the averagmg over different numbers of patterns.

In practice no a priori knowledge about the prob- ability distribution of the input patterns is normally available. Therefore the conditional probabilities Pi~. cannot be determined directly; rather, they must be estimated using a finite number of input patterns. The probability Pk for the pattern xk is estimated by

N(xk) Pk (9)

Ntotal

where N(Xk) is the number of occurrences of the pattern xk in the layer (hidden or output) for which the test set of input patterns has been generated. Ntotal is the total number of input patterns presented.

In the same way the joint probability P(class = i and pattern = Xk) can be estimated, describing the simultaneous occurrence of pattern Xk in the layer under examination and the input pattern belonging to class i:

P(class = i and pattern = Xk)

N(class --- i and pattern = xk)

Ntotal (10)

N(class = i and pattern = x~) is the number of occurrences of the pattern xk in a layer for all input patterns belonging to class i.

The conditional probability P,, - P(class ilpattern = xD that an input pattern from class i was presented to the network, given the observation of the pattern x~ in the layer under examination, can now be determined using the frequency interpreta- tion of conditional probabilities. Papoulis 11965):

P(class - i and pattern = x., P i k - -

P(xk)

N(class - i and pattern = Xkl -= ( I t )

N(Xk)

Note that these probabilities are discontinuous for the hard-limiter input-output relation in eqn (1). To make the probabilities continuous we could however replace eqn (1) by a probabilistic input-output re- lation with a smooth sigmoid probability law.

3. A MAXIMUM CLASS INFORMATION TEACHING ALGORITHM

In the previous section we have illustrated that en- tropy and CCE are natural and good measures for judging the performance of a multilayer network in- terpreted as a multistage encoder. With this measure we are not only able to judge the performance of existing networks but we are also able to construct a network from scratch. The basic idea is to concen- trate all (or almost all) class information on as few neurons (units) as possible. In this way the relevant information is contained in the "first few" neurons. This process can be carried out iterativelv: One starts with a single neuron m the first hidden layer and determines the optimum weights wij to that unit. that is. the weights that produce minimum CCE of the output of that unit with respect to the training set of input patterns. We proceed by adding new umts to this layer (with appropriate weights) until the CCE is zero or has become sufficiently small. Each Time a new unit is added we optimize only the weights to that unit to save computation time. The CCE is calculated for the complete set of possible outpul vectors x~. This implies that for n hidden umts in the layer under construction and m classes there is a maximum of 2" different output patterns and we have to estimate (at most) m2" conditional probabilities Pa. Fortunately in many real problems only a few different output patterns occur and the number o f conditional prob- abilities we have to estimate is much smaller.

After having calculated the weights for one com- plete hidden layer, one could, in principle, proceed in the same way by adding a new hidden layer with one neuron and adding neurons until the CCE for the optimum weights has again become small enough. In all the problems we have investigated, however.

Minimum Class Entropy 137

the necessary number of hidden units in the first layer turned out to be surprisingly small (usually less than ten) and also the number of different output patterns was small (less than one hundred). Therefore we have found it possible to proceed by brute force, that is, by adding another hidden layer with exactly one neuron per output pattern of the first layer and choosing the weights so that every unit in the second hidden layer responds to only one output pattern of the first hidden layer. In this way a neuron in the second hidden layer reacted only to input patterns of the same class. The output neurons had then only to be connected (using positive weights) to those units in the second hidden layer which responded to the appropriate class. Therefore the CCE of the out- put layer is the same as the CCE of the first hidden layer and it cannot decrease from layer to layer. Ob- viously this is a network with optimum performance in the link between first layer and output layer, though not necessarily with the minimum number of neu- rons.

The optimum connections (weights) to a partic- ular neuron can be determined by solving a mini- mization problem in the q + 1-dimensional space of the q incoming weights and the threshold. If the weights are normalized, the minimization problem still has dimension q. An efficient way of finding good solutions has been proved to be Simulated An- nealing, Kirkpatrick, Gelatt, and Vecchi (1983), Vanderbilt and Louie (1984), and Corana, Marchesi, Martini, and Ridella (1987). We have used the CCE as the "energy function" whose global minimum (or at least a good local minimum) has to be determined. A path in the q-dimensional space of variables (nor- malized weights to a neuron) was traversed where, at each step, only one coordinate was (randomly) changed. If the energy E (the CCE) was lower than at the previous location then this step was accepted, otherwise the step was only accepted with probability e aEr where T denotes an annealing parameter (physically the "temperature" of the annealing pro- cess) which has to be lowered slowly.

Unfortunately the energy surface becomes very flat if one is far away from a solution as illustrated in Figure 2a. In particular, if the threshold becomes infinitely high, the same CCE results as if the neuron under study were not present at all. The problem is to restrict the search space in such a way as to be certain that the global minimum is within the search space but, at the same time, the parameter space should not be too large because of the vast increase in computation time.

We have found a solution to this problem by not- ing that the class uncertainty cannot increase (at most it can stay constant) if one neuron is added. This constitutes the upper limit Smax for the CCE which can be used to construct a new energy function, see Figure 2b, having approximately the same shape near

S

S mllX

r

X (a)

E

r

X

(b)

FIGURE 2, (a) Typical conditional class entropy (CCE) on a straight line through the N + 1 dimensional space of weights and threshold, Far away from the minimum the CCE S tends towards the upper boundary Sm.~. Thus if we choose the CCE as our energy function and make the random walk of sire- ulated annealing at finite temperature then the random walk wi l l escape from the interesting region after a finite time and wil l not return. (b) The energy function is changed to E =

+ ~ (S.,x - S) - ' wi th ~ small compared with the global CCE variation. Now the random walk is confined to the in- teresting region but the energy surface has the shape of the CCE function near a global minimum.

the minimum but tending to infinity in the previously flat energy region:

l E= + Sma,_S (]2)

with a suitably chosen ~, small compared to the global CCE variation.

No optimization algorithm can guarantee to find the global minimum of a function in finite time. The method of Simulated Annealing offers a method where local minima can be escaped. This property makes it a well-suited optimization procedure for complex functions of many variables like our CCE. Another advantage in our present formulation, using hard-limiter output functions of neurons, is its ability to work also with discontinuous functions.

Another property of Simulated Annealing could be advantageous. It is more probable that a local minimum with reasonable extent is found than a very narrow global minimum. This is very important be- cause physical realizations contain inaccuracies (val-

138 ~ B~chsel and P. Sei~z

ues digitized to a few bits only, electromc noise in analog implementations or the failure of a neuron in the redundant network). A narrow minimum is much more affected by these inaccuracies and might lead to inferior performance than the more tolerant choice of parameters indicated by an extended local mini- m u m ,

It is clear that Simulated Annealing is not equally well-suited for all conceivable problems. If a CCE function has a particularly simple shape, faster and more direct methods perform much better. Our ex- perience indicates that in certain applications the CCE is well behaved in that all minima are close in value to the global minimum. In such cases we prefer to replace Simulated Annealing by a much faster steepest descent method (which requires the hard- limiter function in eqn (1) to be replaced by a sigmoid and differentiable functionj.

This just means that no general recipe for efficient CCE optimization can be given. In general. Simu- lated Annealing is a good first approach because it does not assume much about the CCE function under study.

4. RESULTS

A very rewarding field for the application of neural networks is image processing. Especially space-in-

variant pattern and object recognition are of practical interest. The problem presented here is the classi- fication of mirror symmetry axes an binary images. where the axes of symmetry are allowed to translate A similar problem has been studied by Sejnowski et al. (1986) and they consider it to be a difficult test case for a neural network.

The patterns whose classification we wanted to teach our network consisted of ~ ~ 8 arrays of ram domly generated binary images which were sym- metrical with respect to one of three symmetry axes as shown in Figure 3. This axis was either horizontal. vertical or diagonal and was aligned with pixel boundaries Vertical and horizontal boundary con- ditions were such that the images exhibited toroidal topology. The resulting patterns can be interpreted as ~ × 8 subarrays of a periodic pattern with period 8 and being mirror-symmetrical along one of three symmetry axes at any position.

A set of 8000 randomly generated patterns was used to train the network. Wc proceeded as de- scribed in Section 3 by adding new neurons and op- timizing the neuron weights with respect to the CCE. The successful emergence of a network solving the shift-invariant symmetry classification problem cart be illustrated by inspecting the two-dimensional pat- tern of weights to a neuron: Sejnowski et al. (1986) showed that. in order to recognize a particular sym-

i a m e n a * q i U B • ~ m m ! ~ i l i

a

t ~ :,! I t m

nwnmmmn-um S W

J M t g l l i i m i l i

I B

FIGURE 3. Examples of input patterns for the 8 x 8 mirror symmetry problem with periodicity such that the array has the topology of a toms. The light squares repmmmt Input values 1, the dark ones represent value 0. The symmetry axes lie vertical in the first, horizontal in the second, and ~ in the third row. Because of the Implicit periodicity every pattern shows two parallel symmetry axes.

Minimum Class Entropy

metry axis, this pattern of weights should be anti- symmetric with respect to the particular symmetry axis at any translated position of this axis. Our train- ing algorithm actually generated such patterns, for which an example is shown in Figure 4a. The plot in Figure 4b shows how the CCE decreased as a func- tion of the number of neurons in the network.

5. DISCUSSION

An unfortunate property of most neural networks is the large number of training patterns necessary for

139

the teaching. Typically this number is large compared to the total number of weights between input con- nections and the first hidden layer. This implies that the training phase is very computation-intensive. Why this has to be the case can easily be illustrated with the simple classification problem where we have only two classes and one hidden unit:

If the input patterns are linearly independent, it is always possible to assign the input patterns to two arbitrary classes so that the two classes are separated by a hyperplane. Therefore we can obtain correct

1 2 3

4 5 6 ! i 0 1 : :::i!i!

• - • • - ' S O ~ ~ ~ ~ [ ] •

i ' , i " B o .:, []o ~000"

(a)

1.5

1.0

0.5

0 . 0 I I I , I . I I , . - -

0 l 2 3 4 5 6 (b)

FIGURE 4. (a) Representatlon of the weights found by the maximum Informatlon algorithm for the 8 x 8 mirror symmetry problem. Only the 8 x 8 array of welghts from the Input unlts to 6 hldden units is shown. Posltlve welghts are represented as white squares and negatlve welghts as black squares. The area of the squares Is proportlonal to the magnitude of the correspondlng welghts. Every unlt has Its welghts antlsymmetrlc wlth respect to all posltlons of the symmetry axes of one single class, leadlng to an Invarlant output for all members of that class. (b) Plot of the condltlonal class entropy (CCE) S as a function of the number n of hldden unlts. The correspondlng welghts are shown In (a).

140 M. Btchset and P. 5eitz

classification of all training patterns tk = (t].k, ,t2.~, . . . . t,,.D with the following proper ty of the dot prod- ucts t~ • w. where w is a weight vector: tk • w > 19 for all training patterns belonging to the first class and tk " w < 19 for the patterns of the other class. This requirement can be written as a system of linear equations

t~.- w = c~ ( 1 3 ~

with any set of constants ck for which the inequality sign in Ck ~ 0 has been chosen properly for the particular classes to which the tk belong.

The linear system of eq n (13) always has a solution if the number n of input connections is larger than the total number of training patterns and we are still left with the choice of selecting particular con- stants Ck. This implies that. if n is not large enough. the network is not forced to exploit common features of the patterns: it is sufficient to "s to re" the patterns in the network.

Only when the number of training patterns is very large is it necessary to make trade-offs between the values of the weights, based on common properties of the presented patterns, it is exactly here that our new concept of a multilayer network a s a multistage encoder comes into play. The minimum entropy con- cept (maximum information content) takes full ad- vantage of the regularities in the training patterns and ensures an op t imum choice of weights. Thus the number of training patterns must always be larger than the total number of weights leading to the neu- rons in a layer if the weights are optimized simul- taneously.

In principle, the construction process for neural networks outlined in Section 3 will generate weights

which solve any given problem. LJnfortunately no strong inequality relation exists between the CCE of a group of neurons and the class uncertainties of ~ts subgroups. It is therefore possible that a much better .~olution can sometimes be found if the iterative al- gorithm of Section 3 is not used. but rather the weights to all neurons in a laver are optimized simultane- ously.

This can be illustrated with the two-class prot~- lem shown in Figure 5a. The two inputs (z~- zz) can only take on values in the square with z~! < 1 and z~ < 1. Assume that an input pattern z belongs to

the first class if z~ > 'z:l, and to the second class in all other cases. A solution to this problem with onlx~ two hidden units can be found geometrically and that is shown in Figure 5b. This particular solution has the (perfect) CCE of zero. With only one hidden unit a good solution exists, and that is shown in Fig- ure 5c. This solution has a (little) smaller CCE than any one of the two units in Figure 5b on their own This good solution for one neuron, however. ~s of less use if we add more neurons and optimize the connections to each of these neurons separately: in this case the op t imum solution will not be found with the iterative approach discussed before.

If all the weights are optimized simultaneously, ~t is conceivable that a good solution can be found more easily: by greatly increasing the available phase space ("dimension" of the problem) one offers a general optimizing algorithm more paths to a good CCE min- imum and therefore could avoid c~ iticat points (e.g a local minimum might display its true nature o( being a saddle point in higher dimensions). This problem has not been investigated full~ ~md only experience with many different problems wilt tell how ~mportant

Zt

/

/ t1\ 1-t,-t1 I z

(a) (b) (Ct

FIGURE 5. Failure of an iterstive strategy. (a) Example of s two-class problem for s network with two input units, Only input psttems z with Iz:] < 1 and ]z2] < 1 occur. In this square every Input pattern is ~ to be equally likely. An Input pattom belongs to class I if z, -> ]z,I o t ~ , ~ s e to class//. (b) Optimumsolution for the problem _in Figure 5(a) u ~ two h ~ u n ~ The two dashed lines show the lines of sign-change of the output of the units. These lines ~ the ~ or Input pa~tema In 4 sectors with a different output pattern of the hidden .unit pair In every . s ~ . . We c~,

s unit a line of z, = I . hidden unit out of Figure 5(b) ~ clams ~ ~. Titus, for ~ r e J , " ~ ~ ' ~ ~ ~ the weights to bm newly added h idd~ unit, will not find the ~ - n solution for two hidden units. Only limit weigms Io a hidden units are optimized simuitoneousty will the optimum solution be found.

Minimum Class Entropy

simultaneous optimization is in practical problems, Of course to optimize all weights simultaneously

greatly increases the computational load, needs more training patterns, and could make simultaneous op- timization unpractical with actual problems. For these reasons we prefer the iterative algorithm if it leads to good solutions. However, in analysing a particular type of problem where one would like to know the minimum number of parameters (weights, layers, neurons), only simultaneous optimization can pro- duce the desired result.

In conclusion, we have found that our new ap- proach to hard-limiter feedforward networks viewed as multistage encoders in the context of information theory has led to a deeper understanding of the per- formance of these networks. The concept of maxi- mum information (minimum entropy) for classifi- cation problems introduces a measure with which the performance of networks can be judged. This is par- ticularly useful in problems where no a priori knowl- edge or deeper insight is available to reduce the total number of neurons in the network.

We have shown that it is also possible to construct networks with optimum performance using an al- gorithm based on the minimum entropy principle. This principle might be a powerful tool for the gen- eral problem of solving any classification problem with an optimized neural network.

141

REFERENCES

Corana, A., Marchesi, M., Martini, C., & Ridella, S. (1987). Minimizing multimodal functions of continuous variables with the "simulated annealing" algorithm. ACM Transactions on Mathematical Software, 51,262-280.

Duda, R. O., & Hart, P. E. (1973). Pattern classification andscene analysis. New York: John Wiley & Sons.

Gardner, S. B. (1986). Application of neural network algorithms and architectures to correlation/tracking and identification. In J. S. Denker (Ed.), Neural networks for computing (pp. 153- 157). New York: American Institute of Physics.

Kirkpatrick, S., Gelatt Jr., C. D., & Vecchi, M. P. (1983). Op- timization by simulated annealing. Science, 220, 671.

Linsker, R. (1988) Self-organization in a perceptual neural net- work. Computer, 21, 105-117

Lippmann, P. (1987). An introduction to computing with neural nets. 1EEE Acoustics Speech Signal Processing Magazine, 4, 4.

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115-133.

Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge: MIT Press.

Sejnowski, T. J., Kienker. P. K., & Hinton, G. E. (1986). Learn- ing symmetry groups with hidden units: Beyond the percep- tron. Physica, 221), 260-275.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379-423,623-656.

Papoulis, A. (1964). Probability, random variables, and stochastic processes. New York: McGraw-Hill.

Vanderbilt, D., & Louie, S. G. (1984). A Monte Carlo simulated annealing approach to optimization over continuous variables. Journal of Computational Physics, 56, 259-271.