Properties of block feedback neural networks

Pergamon

CONTRIBUTED ARTICLE

0893-6080(94)00080-8

Neural Networks, Vol. 8, No. 4, pp. 579-596, 1995 Copyright © 1995 Elsevier Science Ltd Printed in the USA. All fights reserved

0893-6080/95 $9.50 + .00

Properties of Block Feedback Neural Networks

SIMONE SANTINI AND ALBERTO DEL BIMBO

Dipartimento di Sistemi e Informatica

(Received 27 September 1993; accepted 19 July 1994)

Abstract--In this paper, we discuss some properties o f Block Feedback Neural Networks (BFN). In the first part o f the paper, we study network structures. We define formally what a structure is, and then show that the set ~ o f n- layers B F N structures can be expressed as the direct sum of the set all, of n-layers BFN'architectures and the set ~, o f n-layers B F N dimensions. Both ~4n and ~n are shown to have the structure o f a distributive lattice and to indice such structure in ~ . Moreover, we show that the computing capabilities o f B F N are monotonically nondecreasing with the elements o f ~t~ ordered according to the lattice structure. In the second part we show that the increasing in the computing power allows the B F N to be universal computers, having the same computing power as a Turing machine.

Keywords---Block feedback neural networks, Network structures, Network architectures, Computing capabilities.

1. I N T R O D U C T I O N

In this paper, we discuss some properties of the B l o c k

F e e d b a c k N e t w o r k s (BFN) model, that we presented in Santini, Del Bimbo, and Jain ( 1993 ). BFNs are dis- crete-time dynamic multilayer perceptrons, built according to a recursive block-based scheme. We used these networks for nonlinear systems identification and image sequences prediction (Del Bimbo, Landi, & San- tini, 1992, 1993).

Being built according to a recursive scheme, the BFN model enjoys a number of unique properties that we investigate in part in this paper. Most of these properties stem from the great structural flexibility allowed by the recursive scheme. Intuitively (the term will be formally defined in the paper) , the structure of a network is a description of how the neurons are connected, regardless of the weight values and the details of how the neurons work.

Most current network models are based on a well- defined structure, and instances of the models are built by selecting the appropriate number of neurons in var- ious parts of a network. For all these networks, re- searchers study how several properties depend on the number of neurons - - the structure being fixed. A number of structures have been studied for generalization (Schwarze & Hertz, 1992), approximation capabilities (Blum & Li, 1991 ), and probability estimation from

Requests for reprints should be sent to Simone Santini, Depart- ment of Computer Science, UCSD 9500 Gilman Drive, La Jolla, CA 92093-0116.

finite training sets (Moody, 1992; Murata, Yoshizawa, & Amari, 1992).

On the other hand, if one wants to study the possibilities of entirely arbitrary models, he is overwhelmed by the countless structural possibilities, and cannot, usually, obtain general results.

BFNs are in quite a unique position. The model is general enough to allow the definition of a great number of structures and therefore to make structural issues worth studying. Yet, the systematic way in which networks are built suggests that general results can be in- ferred.

We argue that structural issues play a central role in neural network models and that the pos- sibility for BFNs to derive general propert ies should be regarded as one of their most desirable features.

This paper is divided in two parts. In the first part, we discuss the algebraic properties of the set of BFN structures. We begin by giving a formal definition of the terms "structure" and "architecture," and then we introduce a partial ordering in the set of architectures and in the orthogonal set of BFN dimensions. At the end of this section, we show that the partial ordering corresponds to an ordering of the computing capabilities of the BFNs.

This leaves the open question of w h a t can be done

with this model. In the second part we show that, if the BFNs are seen as sequential computing devices, their computing power is the same a Turing machine.

Before discussing the properties of BFNs, we in- clude a brief reminder of the model, which also serves

579

580 S. Santini and A. Del Bimbo

b ~ N 1

FIGURE 1. Cascade connection (a) and feedback connection (b). N1 is the embedded block, that is, a feedback neural network for which input-output behavior is known but for which internal details are unknown. The weights matrix W in (a) and A, B in (b) contain the parameters to be modified when the connection is trained. We suppose we know the derivatives of the cost with respect to the output of the connection (i.e., with respect to the output of N1) and the derivatives of the outputs of N1 with respect to the inputs of N1. The learning algorithm allows to compute the derivatives of the cost with respect to all the parameters in W, A, and B and the derivatives of the outputs of the overall block with respect to the inputs of the overall block.

as an introduction of the symbolism used throughout the paper.

2. I N T R O D U C T I O N TO BLOCK FEEDBACK NETWORKS

Block feedback networks have been extensively discussed in Santini, Del Bimbo, and Jain ( 1991 ) and San- tini, Del Bimbo, and Jain (1995). The BFN model is based on a block diagram notation, inspired by that in Narendra and Parthasarathy (1990). A block can be a single layer, or a whole network, which is regarded as a black box, for which input-output behavior can be specified without knowledge of the internal details.

Blocks can be connected by using a limited number of connections. In this paper, we consider two such connections: the cascade and the feedback connection.

The two connections are shown in Figure 1. We assume that the block NI is a BFN built by repeated applications of the same elementary connections. The network we obtain with the connections can also be considered as a block, to be embedded in further connections in a recursive way.

Throughout this paper we will use a compact notation to represent BFNs: we denote single layers with lower-case Latin letters, the cascade of layer n and block N with n-N, and the feedback connection of the block N and the layer q as q { N }. For instance, the network of Figure 2 is represented as:

a ' b { c { d } } ' e . (1)

A layer is characterized by two weights mat r ices - - the feedforward weight matrix and the feedback weight ma- t r i x - a n d by the output functions arrayf . For the layer of Figure lb, A is the feedforward weights matrix and B is the feedback weights matrix. The layer of Figure 1 a has feedforward weights matrix W and no feedback weights matrix.

When the internal structure of a layer needs to be evidenced, we will use the symbol

where A is the feedforward weights matrix and B the feedback weights matrix. A feedforward layer will be indicated as

This symbol can be used in more complex network specifications. For instance, a complete specification of the network of Figure 2 is

(2)

FIGURE 2. Sample block feedback neural network.

Properties of Block Feedback Neural Networks 581

The application of the network N to the input vector x yielding the output vector y will be indicated by

xNy. (3)

In part II, we will need to build networks with a computing power at least equal to that of a given network. We can formalize this with the following definition.

DEFINITION 2.1. Given two networks N2 and N2, we say

that N 2 covers Nl (Nl ~-- N2) if for any weights configuration of N~, there exists a weights configuration of N2 such that, for every x, xNly ~ xN2y.

The covering relation is reflexive and transitive. We introduce a further operation over networks: the

parallelization. It is based on the following lemma.

that

[xlxt]r[A trBl]Yl

[xtx2]r[a2tTB2]Y2

with

Aj = [All AI2] A2 = [A21 A22]

being x E R", x~ E R m, xz E R p, Yl E R k, Y2 E ~hAll E ~kxn, AlE E ~kXm, A21 E R hx", Ajl E ~hxp, Bj E R k×k and B2 E ~h×h.

Then there exists a layer AI with 2n + m + p neurons such that

that is, [xlx, Ix21W[yl ly21 ~.

Proof. To prove the lemma we have just to build such a layer. This is easily done by setting

with

[a A2 0] .=[B, 0] A = A21 0 A22 0 B2 "

It is easy to see that this layer has k + h output and that its first k output is equal to yl, whereas the other h outputs are equal to Y2. From the arbitrariety OfAl, A2, BI, B2 the lemma follows. •

Thus, if we have two layers, possibly with some common input, and separate outputs, with nl and n2 neurons, respectively, a single layer with nl + n2 neurons can generate both the outputs in parallel, while

receiving the same input of the two layers. This operation will be referred to as parallelization. The parallelization of layers N and M will be denoted by

Note that the parallelization of two layers is still a layer, and not a more complex structure. In real world there is no such a thing as a "paral lel ized" couple of layers. There is no need to define a learning procedure for them. Parallelization is just a modelling tool, and not a third type of connection. Parallelization can be applied to whole networks, on a layer-by-layer basis.

PART I: STRUCTURAL PROPERTIES

In this part, we study some structural properties of BFNs. Loosely speak ing- - the term will be defined more precisely in the next section--structural properties are those referring to the way neurons are connected, independently on the values of the synaptic weights and the characteristics of the neurons. In BFNs, structural properties are independent on the training algorithm, because training merely adjusts the weight values. This would not be true, on the other hand, for cascade-correlation type algorithms (Fahlman & Le- biere, 1989) or for pruning algorithms (Karnin, 1990; Mozer & Smolensky, 1989), which change the weights and the network structure.

We identify two elements of the network structure: the architecture of the network and the dimension of the network. The former translates in mathematical term the intuitive notion of the "connection scheme" of the layers, whereas the latter retains the information on the layers sizes. Both architecture and dimension determine partitions in the set of BFNs and, therefore, can be used to build two distinct quotient sets.

We study the quotient set ~A, of n layer BFN architectures and the quotient set ~D, of n layers BFN dimensions. We introduce a partial ordering in ~4. and ~. and show that this endows both partitions with the structure of a distributive and pseudocomplemented lattice. Then we show that the two quotient sets ~4. and ~). make up an orthogonal decomposition of the set of n layers BFNs.

Finally, we show that the ordering induced by ~A. and ~). in the BFNs corresponds to an increasing computing capability.

3. SOME DEFINITIONS

Our task in this section is to give a formal definition of the term "network architecture." We first introduce a few definitions that restrict the scope of the discussion to multilayer neural networks. Then we give a definition of "structure," which reflects the intuitive notion


of something obtained from a neural network by ig- not ing the values o f the weights. Finally, we refine this definition to obtain a definition o f network architecture.

Let WN be the set o f all weights of the network N, Vu the set o f all the neurons o f the network N, and ZN the set o f all delay units o f the network N.

I f the output o f the neuron n E VN is connected to an input o f the neuron m E VN by the weight w, we use the notation n ( w ) m . I f the output of the neuron n is connected to an input o f the neuron m by the weight w and the delay unit z, we use the notation n ( z ) m .

DEFINITION 3.1. A feedforward chain o f order r, Cr is an ordered set o f t neurons {nl . . . . . nr } such that

Vi E [1 . . . . . r - l], 3wi E W s : n i ( w i ) n i + l . (5)

Two neurons p, q are r-connected i f there exists a f eed forward chain o f order r, er, such that either Cr =

{p, n2 . . . . . n ~ _ l , q } o r C ~ = {q, n2 . . . . . n ~ - l , p } . This definition doesn' t consider the feedback connec-

tions. Two neurons are part of a feedforward chain only if there is a feedforward (i.e., instantaneous) connection between them. Two neurons are said to be disconnected if they are not r-connected for any value of r.

DEFINITION 3.2. A neuron no is an output neuron i f there are no m E VN and w E WN such that no(w)m.

A neuron n~ is an input neuron i f there are no m E VN and w E WN such that m ( w ) n i .

We assume, without loss o f generality, that no neuron is both an input and an output neuron.

DEFINITION 3.3. A ne twork N is multilayered i f there exists a part i t ion .C = { Ll . . . . . L, } o f VN, such that,

f o r every couple o f neurons n, m, E t i (p, q E VN, wl , wz e WN): 1. m and n are disconnected. 2. Both n and m are output neurons or

n ( w , ) q , m ( w z ) p ~ p , q G L j i ~ j . (6)

3. Both n and m are input neurons or

q ( w , ) n , p ( w z ) m = p , q E L j i ~ : j . (7)

The sets L~ are cal led layers o f the network. From the definition, it is clear that if the neuron n E

L~ is connected to the neuron m E L j , then all the neurons o f the layer L~ are either disconnected or connected to neurons o f the layer Lj. We can briefly say that the layer L~ is connected to the layer Lj. Note that the con- cept o f r - feedforward chain er can be easily extended to layers.

DEFINITION 3.4. Two layers Ll, L2 are feedforward connected (Ll(W)/-~) i f there exists a neuron n E L l and a neuron m G Ia such that n ( w ) m , with w E Wu.

Two layers L~ , L2 are feedback connected ( L2( w ) L~ ) i f there exists a neuron n E L~ and a neuron m E L2 such that n(~ ' )m, with w E Wu and z E Zu.

DEFINITION 3.5. A network is fully connected i f

LI(W)L2 = 'v'n C Ll, Vm E Lzn(w)m

and

(8)

(wl Ll Z L2 ~ k/n E Li, k/m E Lzn(w)mz. (9)

This means that if a neuron of layer L~ is connected (feedback connected) to neuron of layer/-2, then every neuron of layer L1 is connected (feedback connected) to all the neurons of layer L2.

The feedforward connections also induce a natural ordering in the set o f layers.

DEFINITION 3.6. Let £ = { L~ . . . . . L , } be the ordered set o f layers o f the neural ne twork N. The set £ is properly ordered i f f o r each i, t i and Li + i are connected by a f o r w a r d connection, that is,

Vnh E Li, mk E Li+13Wih k E WN:nh(Wlhk)mk.

To study structural relations among BFNs, we must abstract f rom the details o f the particular networks. For instance, we don ' t care about the specific weights value, because the intuitive notion o f structure we have is something that is left unchanged when the weight values change. We show this by the following definition.

DEFINITION 3.7. Two neural networks N and M are structurally equal ( N s M ) i f they have the same num-

ber o f neurons (i.e., I vNI -- I vMI), and VN and VM can be ordered as VN = { nl . . . . . ns }, VM = { ml . . . . . ms } so that

1. n i ( w ) n j ¢~ m i [ ~ ) m j 2. n i (~)nj ~ m i ( f ) m j .

This definition does not imply that N and M imple- ment the same function, because structurally equal networks may have different weights. On the other hand, the definition ensures that [ WN[ = [ WM[.

THEOREM 3.1. The structural equality relation (3.7) is an equivalence relation.

We omit the proof of this property for the sake o f brevity. It can be found in Santini and Del Bimbo (1993) . It is easy to see that Definitions 3.4 and 3.7 ensure that structurally equal networks have the same number of layers.

S THEOREM 3.2. Let M and N be such that M =-- N; let { Ql . . . . . QN } be the layers o f M, and {Ll . . . . . LN} be the layers o f N. These two sets can be ordered such that i f Li is r-connected to Lj f o r some r, then Qi is r- connected to Qj f o r the same r.

Let S, be the set of all multilayer neural networks with n layer. The equivalence relation ,,___s ,, (3.7) can be used to build up equivalence classes in S,, defined a s

[N]( --s ) = {M E S, IM s N} . (10)


DEFINITION 3.8. Let S# be the set o f n layer BFNs. A n- layer structure f is an e lement o f the quotient set

~ = S#/([N]( s-- )). (11)

The definitions and properties given so far are quite general, and they are valid for a large class of networks. In this paper, however, we are concerned with BFNs, for which architectures are determined by the connections of Figure 1. We now discuss some specific properties of the BFN model. These properties stem from the recursive mechanism that creates BFN networks.

First, we establish a principle that rules the trans- mission of properties among BFN. Let P be an arbitrary logic predicate with domain in the set of BFNs, and N a BFN. If P is true for N, we will use the notation PN. Then, the following holds.

LEMMA 3.1. (Inheritance Principle) I f Pq holds f o r each single layer q, and i f

PNI, PN2 =' PNI "N2

PN = Pq { N } for each layer q,

then P holds f o r all f eedback neural networks.

THEOREM 3.3. Let N be a B F N with layers layer { LI, . . . . L , } . Then 1. it is possible to f ind a permutation p, . . . . p , o f 1,

. . . n such that, f o r the ordered set { Lpj . . . . . Lp~} : ( a ) Each nondelay connection is forward-directed,

that is:

( b )

n E Lp,, m E Lpj, n(w)m ~ j = i + I.

Each delay connection is backward-directed,

that is:

n(z)m° 2. each layer has its input connected with the output

o f at most one layer: i f m, n E L~, p E ILk, and q E Lh, then

(wl (Iw m , q n = h = k .

P z z

Proof. We will prove only the last assertion (the proof of the other two being similar) by using the inheritance principle.

The property is certainly true for a single layer, because it receives either no feedback or feedback from itself. Suppose now that it is true for network N, and consider a network N ' obtained by application of elementary connections to N and an arbitrary layer 1. There are four different ways in which we can use the two elementary connections to put together the layer 1 and the network N:

I 'N

I {N}

l{ } .N

N ' I { }

In the first case, the layers of Nreceive exactly the same number of feedbacks as before the connection (i.e., none or one) by hypothesis; in case 2, the layers of N are untouched, and layer I will receive feedback only from the (k + I )th layer; in cases 3 and 4, the layer I will receive feedback from itself only. Thus, the property holds for the whole network. •

DEFINITION 3.9. Let NI and N2 be two layered networks, with I~11 = 1~21 (i.e., let the two networks have the same number o f layers). The network N~ is said to be ~b related to the network 312 (NI s N2(~b)) i f there are L) E ,~, L 2 E ,8-2 with L~ --/: f~ and L2i * 0 , and there is a neuron n ~ VN such that

N ' = {L 2 . . . . . L2_,,L~ U n , L2+, . . . . . L~} s__ N,, (12)

that is, there are two nonempty layers L~ E .el, L 2 E ,e2 such that the two networks can be made structurally equal by adding a neuron to ~2.

This definition doesn' t allow layers creation and de- struction. The two layers L~ and L 2 must be nonempty before we can add up a neuron to L/2 .

This definition is evidently antis~,mmetdcal: if N~ __s N2(~) , then we cannot have N2 - Nl(~b). We can state the following symmetrical version.

DEFINITION 3.10. Two networks NI and N2 are ~P related (N~ __s N2(k~)) i f either N~ s N2(~b) or N2 s N~(~b).

The following property of the @ relation is easily verified.

S S LEMMA 3.2. l f Nl N2 then Nt =- N2(~)

The kO relation, though symmetric, is not an equivalence relation, being not transitive.

DEFINITION 3.11. Two networks N1 and N2 are said to be O-re la ted- - written NI = N2 ( 0 ) - - i f there exists an ordered set o f networks { Qo, Ql • • • Qr } such that Nt S S

=-- Qo, N2 - Qr, and

~/i E [1, r - 1] Qj s Qj+,(ff2). (13)

For the O relation, the following property holds (see Santini & Del Bimbo, 1993, for the proof) .

THEOREM 3.4. The 0 relation is an equivalence relation.

As an equivalence relation, O can be used to build equivalence classes into the set 5in of all n layers BFN structures:

[N](O) = {ME 5t, lM = N(O)}. (14)

This also generates the quotient set ~n = 5 tJ (O) : the set of all the O equivalence classes of St#.

We can now define the term network architecture.


DEFINITION 3.12. A network architecture f o r n layers BFNs is a 0 equivalence class over the set o f all n layers neural networks.

The set ~g, = 7 , / 0 is the set o f n-layers ne twork architectures.

Note that, because the network architecture is defined in terms of the O equivalence, structurally equal networks, ~b-related networks, and kO-related networks have the same architecture.

An equivalence class can be represented by one of its components, because if N and M are O equivalent, then [ N ] ( O ) = [ M ] ( O ) . Therefore, we can pick one o f the networks belonging to a particular equivalence class and use it to label the whole class. We exploit the following lemma, for which the proof is in Santini and Del Bimbo (1993) .

LEMMA 3.3. Given an equivalence class [N] (O) C L , / 0 , there is a ne twork ?gN E [ N ] ( O ) having exactly one neuron f o r each layer.

Thus, we can label a network architecture using the "s ing le -neuron" network. In the following we will often refer to the "?/N architecture." This will mean that we refer to the equivalence class that ~/N belongs to.

4. A R C H I T E C T U R E F U N C T I O N S

The structure o f the set ~A, can be more easily studied if we find a way to characterize an architecture in terms of numerical properties only, without any reference to the recursive process o f actually building an architecture.

In this section we associate a descriptive func t ion to each network architecture. We first give an intuitive description o f the architecture function, and then the technical definitions.

The idea is quite simple: assign a number from zero to n to each layer o f a n-layers network. The value of the function for layer i will be equal to the layer where the feedback path ending in layer i starts from.

For instance, consider the network a" b { c { d } }- e[ eqn (1) and Figure 2]. This network has five layers. The feedback path ending in layer b (the layer number two) starts from layer d (the layer number four) ; therefore, for this network, f ( 2 ) = 4.

There is a problem in this representation: which value to assign to the layers without any feedback paths ending on them. A simple idea would be to define the value as equal to the layer number. Unfortunately, this does not work. Consider the following two networks:

a . b . c

and

a.b{ } .c .

According to this definition, they would both have f ( 1 ) = l , f ( 2 ) = 2, and f ( 3 ) = 3. Nevertheless, they have

different architectures (i.e., they do not belong to the same O class).

To work the problem out, we must modify the representation. Imagine, after any layer, the presence o f a ghost layer. The architecture functions take their arguments on the real layers only, but take values on the real plus ghost layers set, so that, for an n-layer network, f : { 1 . . . . n } ~ { l . . . . . 2n }. In the target set, real layers correspond to odd numbers and ghost layers to even numbers. The value o f f ( i ) is equal to the real layer corresponding to i (i.e., 2i - 1 ) if no feedback path ends in layer i. Otherwise, the value is always even, and corresponds to the ghost layer associated to the real layer where the feedback path starts.

For instance, for the first network above, we have f ( 1 ) = l , . f (2 ) = 3, f ( 3 ) = 5. For the second network we have f ( 1 ) = 3 , f ( 2 ) = 4 (the ghost layer after layer b) and f ( 3 ) = 5.

In the following we will sometimes use a shorthand representation for the nestings. A function f is represented as an n-tuple, for which ith element is equal to the value o f f ( i ) . For instance, the network a{ b } . c will be represented by the vector [4, 3, 6] .

This is a isomorphic representation; that is, the mapping from networks to functions is one-to-one and onto. We give now a more formal definition o f the concepts introduced so far.

Let ~ , be the set of integer numbers f rom 1 to n, E,, the set of even number from 1 to n, and 0,, the set of odd numbers from 1 to n.

DEFINITION 4.1. Let a n-layers B F N n,, be given, and let ~,, be a unitary ne twork such that

~/,, = N(O)

with layers { Lj . . . . Ln }. The architecture function o f N, is a funct ion f : ~ , ~ ~J2,, defined as fo l lows:

For each n E Li : i f there exists m C Lj such that

m ( ~' ) n , then define f ( i ) = 2 j , o therwise define f ( i ) = 2 i - 1.

It is easy to see that Definitions 3.3 and 3.4 and the last point of Theorem 3.3 ensure that the function f i s properly defined.

A set of functions isomorphic to architecture functions can be defined without making any reference to BFNs. To this end, we introduce the following class o f functions.

DEFINITION 4.2. A funct ion f : Nn -~ N2,, is a nesting if" I. f ( i ) >-- 2 i - 1 2. f ( i ) E 02, o f ( i ) = 2i - l 3. f ( i ) = 2k = V j : i < - - j < - - k f ( j ) <-- 2k.

The relation between nesting functions and BFNs is established by Theorems 4.1 and 4.2.

THEOREM 4.1. f i N is a BFN, its architecture description funct ion is a nesting funct ion or, using the term N-


representable to indicate entities that can be represented as nestings, every B F N is N-representable.

P r o o f We use the inheritance principle. First we prove that a single layer is N-representable; then we prove that the two allowed operations, cascade and feedback, map N-representable n e t w o r k s - - o r , in the case o f cascade, couples o f N-representable n e t w o r k s - - i n t o N- representable networks.

A single layer is described by the function f ( 1 ) = 1, which is trivially a nesting. Let NI be a network described by fl and N 2 a network described by f2 , with both f~ and f2 nesting functions. For ease o f notation, we will suppose both networks made up of n layers. Let N = NI "N2, a n d f : N2. -"* N4. be the corresponding describing function. Then:

f l ( i ) if i -< n

f ( i ) = f z ( i - n) + 2n if i > n (15)

To prove the theorem, we have to show that f is a nesting, that is, that it fulfills the three parts o f Defi- nition 4.2: 1. First, we prove t h a t f ( i ) --> 2i - 1: if i --< n, then

f( i) = f l ( i ) ~ 2i -- 1; i f / > n, thenf2( i - n) --> 2i - 2n - 1; t h u s f ( i ) = f 2 ( i - n) + 2n >-- 2i - 1.

2. I f f ( i ) E 04, then, either i --< n andf~( i ) E 02, o r / > n andf2( i - n) E 02, . The first case is obvious: b e c a u s e f ( i ) = f l ( i ) , t h e n f ( i ) = 2i - 1. For the second case: i f f ( i ) = f2( i - n) + 2n E 04, then f ( i - n) E 02, . Because f2 is a nesting, this implies tha t f2( i - n) = 2 ( i - n) - 1, t h u s f ( i ) = 2( i - n ) - 1 + 2 n = 2 i - 1.

3. Finally, l e t f ( i ) = 2k. We must show that, for a l l j such that i --<j -- k, we h a v e f ( j ) <--f(i). It can be e i t h e r i - < n a n d k < 2 n o r i > n a n d k > 2 n . The first case is straightforward, being f e q u a l to f l . For the second case, l e t j > i. T h e n f ( j ) = f 2 ( j - n) + 2n. But, because i -< j <- k, andf2 is a nesting, f2 ( j - n) <-f2(i - n ) . Thus, f ( j ) = f a ( j - n) + 2n --<

f2( i - n) + 2n = f ( i ) . Thus, the describing function of the cascade o f two N- representable networks is a nesting.

Now, let N be a N-representable network, and let q be a layer. We build the feedback network q { N }. I f f ( i ) : ~ , ~ ~2, is the representation function o f N, then the representation function o f the whole network is g : Nn+I ~ ~ 2 ( n + l ) defined as:

~ 2(n + 1) if i = 1

g(i) = I f ( i - 1) + 2 if i > 1 (16)

It can be shown, using the same technique used above, that if f i s a nesting, g is also a nesting. We skip the proof for the sake o f brevity. •

To prove the second theorem, we first state the following lemma.

LEMMA 4.1. Consider the nesting funct ions f with f ( i) = 2k > 2i - 1, and the set o f the restrictions o f such

functions to [ i, k] :

~R = { f : [i, k] --* [2i + 2, 2k] }.

Then ~? is isomorphic to the set o f the nesting functions g : ~k-i+l ~ ~2tk-i+l), with an isomorphism defined by:

g ( j ) = f ( i + j - 1 ) - 2 i - 1 j = 1 . . . k .

P r o o f The proof is a straightforward application o f the definition.

THEOREM 4.2. I f f is a nesting function, then there exist a B F N N such that the architecture funct ion o f N is equal to f That is, every nesting funct ion represents a network.

Proo f The proof is by induction on the size n o f the function domain. It can be verified by direct enumer- ation that for each n e s t i n g f : ~ 2 ----i, ~ 4 there is at least one BFN architecture described by f .

Suppose now the property holds for all the nestings f , : ~ , ~ N2,, and consider the nesting f ,+t : ~,+~

~2(n + 1 )"

Consider the value of f ( 1 ). We must discuss two c a s e s : f ( 1 ) = 1 and f ( 1 ) = 2 j , j >- 1.

Consider the case f ( 1 ) = 1. By Lemma 4.1, the restriction o f f to { 2 . . . . n + 1 } can be mapped onto a nesting functionf2 : N, ~ ~2,. Because the dominium of this function has size n, by hypothesis, it is the architectural description of a BFN. Let N be this network, and consider the network l .N . For the architecture function o f this networkf~, it holds: f , ( 1 ) = 1, f , ( i) I i > = f2( i - 1) + 2 = f ( i ) . Thus, f , = f , and f i s the description function o f the network.

Assume, n o w , f ( 1 ) = 2 j , with 2 --<j --< 2n. Consider the two functions f2 and j~, being f2 the restriction of f to { 2 . . . . . j }, and f3 the restriction of f t o { j + 1 . . . . . n }. Both f2 and3~ can be mapped, by virtue o f Lemma 4.1, on two nestings with a dominium of size less or equal n; thus, by assumption, they are the architectural description of two networks N~ and N2. I f we consider the network l { N~ }- N2, it is easy to see, with an argument similar to that above, that f is the architecture function of this network. •

We have proved that all architectural description functions are nestings, and all nestings are architectural descriptions. We still need to prove that the correspon- dence is unique.

THEOREM 4.3. Each B F N architecture is described by a unique nesting.

Proo f Because we have shown that architecture functions are nestings, it will be sufficient to show that the association is unique.

Suppose f and f2 are two distinct nesting functions describing the same network N, and, assume that f~ (k) * fz (k) .


Let us distinguish three cases: 1. I f both f~ E 02., and f2 ~ O2,, then, for the point 1

of Definition 4.2,f~ (k) = 2k - i = f2(k) , thus contradicting the hypothesis.

2. Assumef l E E2, and f2 E O2~. In this case, f l ( k ) = 2h and there are neurons n ft. Lk, m if: Lh such that m ( z ) n . But if such a connection exists, then, by Definition 3.3, all the neurons in the layer Lh are connected to neurons in the layer Lk and, by the definition o f O equivalence, the hth layer o f / / u is connected to the kth layer. Thus, we must have f2 (k) = 2h, contradicting the hypothesis. The case f2 c E2,, andf l C 02n is evidently symmetric.

3. I f f j ( k ) = 2h andfa (k) = 2q, with h :~ q, then with a reasoning analogous to point 2 we can see that this contradicts Theorem 3.3. •

THEOREM 4.4. I f tWO architectures are described by the same nesting, they are equal.

P r o o f Suppose A and B are two ne twork architectures, that is, two networks with just one neuron in each layer. Because they are descr ibed by the same f u n c t i o n f : N. + N2,~, they have the same number o f layers, namely n. Because they are both layered networks, if { LA i . . . . . LA. } are the layers o f the first network, and { LBI . . . . . LB,, } are the layers o f the second, we have

n G Lai, m @ LAU+I) =~ n(w)m.

Similarly,

n E LBi, m C L/~u+lt =:~ n ( w ) m .

Thus, the feedforward connections satisfy the equality requirements. We have to show that the feedback connections do the same.

L e t f ( i ) = 2 j --> 2i. Then

t i e t a i m C L A j ~ m<Wz>n.

Because B is described by the same function,

n C L n , , m E L B j ~ m<W>n.

I f f ( i ) = 2i - 1, then there does not exist m such that n E La~ and m(~')n. Similarly, there does not exist a corresponding m for the second network.

Thus, for each n, m in A, such that n ( w ) m or m ( ~ ) n , there exists a corresponding couple in B for which the same connection holds. •

The above properties imply that the relation between network architectures and nesting functions is one-to- one and onto (i.e., it is an isomorphism).

5. LATTICE STRUCTURE OF N E T W O R K ARCHITECTURES

Given the isomorphism between network architectures and nesting functions, we can study the properties o f

the nesting functions to investigate the structure of the set CAn.

Because o f the isomorphism, we can use the same symbol d , to indicate the set o f all nestings f : ~n ~2n. In CA, we can define a partial ordering as follows.

DEFINITION 5.1. Let f l , f2 E cA,. We say that f l <- f2 i f

ViE [1 . . . n ] f~(i) <-f2(i).

This definition makes the set cA, a poset. The meet (or infimum, A) operator can be defined as:

DEFINITION 5.2.

V f , f , , f 2 E ~t.

f = f ~ A f2 = f ( i ) =rnin( f~( i ) , f2( i ) ) V i E [1 . . . . n],

It is easy to verify t h a t f l , f : ~ cA. ~ fl A f2 C cA., thus cA. is closed with respect to A.

To make cA. into a lattice, we have to show that, for all f~, f2 ~ cA., sup { f l , f2 } = f~ v f2 exists. This is shown by the following lemma.

LEMMA 5.1. For all f i , f2 E cA., there exists f E cA,,

such that f = sup { f l , f2 }.

P r o o f Given f~ and f2, consider the set

AI,#2 = {g C d , : f j --< g and f2 ~ g } .

This set is surely nonempty, because the " 1 " function, defined as 1 (i) = 2n Vi belongs to it. Moreover, As,,s 2 is a finite partial lattice, because f , g E As,.s ~ = f A g E As~.s ~. Therefore, there is a minimum in As,,s2, that is, an element a such that

x E Af,,f2 =~ a <: x.

It is easy to see that a = f l V f2. • The closure property of cA~ with respect to the v is

a straightforward consequence o f the definition. These operators are associative, idempotent, and sat-

isfy the absorption identities (Gr~itzer, 1971 ):

a A ( a v b ) = a a V ( a A b ) = a . (17)

From this, the next theorem easily follows.

THEOREM 5.1. For each n, the set ~t, is a lattice. Figures 3 and 4 show the lattice diagrams for two-

and three-layer feedback neural networks. The codes beside the lattice elements are the corresponding function representations.

The covering relation between two functions f and g is defined as: f covers g (g - < f ) if g --< f and there is no element h such that g -- h -< f .

THEOREM 5.2. Let f g E ~ , . Then g - < f i f f g ( i ) = f ( i ) , i q: k and either: 1. g ( k ) = 2k - 1 a n d f ( k ) = 2k, or 2. g ( k ) = 2 j a n d f ( k ) = m a x { 2 ( j + l ) , g ( j + 1)}.

The proof o f this theorem is omitted, and can be found in Santini and Del Bimbo (1993 ).


[ 2 , 4 ] ~ ~ _

[1, 41 a]

[~,3] FIGURE 3. Lattice diagram of ~2.

3]

The passage f rom an e lement of the lattice ~A, to another e lement that covers it corresponds to one of the two fol lowing " e l e m e n t a r y " operations: 1. the insertion of a narrow feedback loop (i.e., a loop

with no layers embedded into) in a feedforward layer, or

2. the extension o f an existing feedback loop to the next available position.

6. D I S T R I B U T I V I T Y AND PSEUDOCOMPLEMENTATION

The lattice o f B F N architectures has several noticeable properties, stated in the two theorems that follow. To prove the first, we begin by proving the fol lowing lemma.

LEMMA 6.1. Le t I and J be two ideals o f ~4,, and let p E L q E J. I f g <- p v q, then there exist~6 E l a n d q E J such that g = /t v q.

P r o o f Define/6 = p A g, and q = q A g, that is,

/~(i) = min{p(i) , g(i)}

4(i) = min { q(i ) , g( i ) }

and h = /6 V q. According to the definition of v ( L e m m a 5.1 ), if

~ = {h E cA,:15 < h , ~ <- h },

then the p roof of the theorem is equivalent to the proof that g = inf { Sh }.

To prove this, first note that f rom the definition of/6 and t~ it fol lows that g E { ~ }.

Then, let s E 5D~, and consider the value of s ( i ) . By definition, s ( i ) >- max{ /6 ( i ) , q ( i ) } . That is, considering s ( i ) as a componen t of the lattice of integer numbers,

s( i ) > l~(i) v q(i) = (p( i ) A g( i ) ) V (q( i ) A g( i ) ) ,

but the lattice o f integer numbers is distributive; thus, for the single values s ( i ) , the distributive law holds:

587

s( i ) >-- g( i ) A (p( i ) V q( i ) )

and, because g -< p v q, we have g ( i ) A ( p ( i ) V q ( i ) ) = g ( i ) , thus s ( i ) >- g ( i ) , and, repeating the argument for all i, s --> g . Therefore,

s E ~ s ~ g .

Because we have just seen that g E { IP~ }, it must be g = min { ~ }. This proves that g = p v q.

We still have to prove that fi E I and q E J . F rom the definition, it is apparent that/6 ~ p and q - q.

Because ~A, is finite, every ideal is principal. Thus, there exists i, j E ~4, such that I = { x [x -< i } and J = { x I x - < j } . T h i s m e a n s t h a t p E I , / 6 ~ p = 16 E I , and similarly, q E J .

This completes the proof of the lemma. • Moreover , it is possible to prove that the fol lowing

theorem holds (Gr~itzer, 1971 ).

THEOREM 6.1. A lattice L is distributive iff, f o r any two ideals I and J o f L,

I v J = { i V j l i E I , j E J }.

The first relevant property of ~A, is stated in the following theorem.

THEOREM 6.2. The lattice ~4, is distributive.

P r o o f The proof is an application of Theorem 6.1. Let I and J be two ideals o f ~A,, and H the subset o f ~A, such that I v J = ( H ] . The theorem can then be proved by showing that

(H] = { a : a = i V j , i E l , j E J } .

FIGURE 4. Lattice diagram of ~ .


For a general property of ideals, it holds (Gr~itzer, 1971):

g E ( H ] = 3hl . . . . . h , @ H : g<--h~ v . . . v h,,.

Because H = 1 U J , for every h~, either hi E I or h~ C J (or both) .

Let us order the hi as { hi . . . . . h, } = {Pl . . . . . Pr, ql . . . . . qs } with p~ E I and qi C J . Thus,

g<- {p~ V . . . v p , V q~ V . . . vqs} .

But, being ideals, I and J are sublattices; thus, there are two elements:

such that

f f=p~ v . . . V p ~ E l

Ul=q~ v . . . q . ~ C J

q @ ( H ] = q ~ p V q p E l U l E J .

The distributivity is then a consequence of Lemma 6.1. •

The second property of ~An is stated in the following theorem.

THEOREM 6.3. The lattice ~4~ is pseusocomplemented.

P r o o f The proof is by construction. Given f E d,,, consider the function g defined as:

2 i - 1 if f ( i ) > 2 i - 1

g(i) = 2n if f ( i ) = 2 i - l (18)

It is easy to prove that g = f *. The proof is in (San- tini & Del Bimbo, 1993).

Moreover, it is easy to check that for all f in ~A~, it holds f * v f * * = 1.

THEOREM 6.4. The lattice ~4~ is a Stone Algebra (~A,, A, V, * ) .

7. THE LATTICE OF THE LAYER DIMENSIONS

So far, we have considered network architectures, that is, networks c o n t a i n i n g - - b y de f in i t ion - -a single neuron in each layer. In this section we turn our attention to the number o f neurons in the layers. We will not consider the feedback structure of the network, which is described by the elements of ~A,. The feedforward connections will be implicitly considered, because we will use the proper ordering (Definition 3.6) that is induced by the feedforward connections. All sets o f layers considered in the fol lowing will be assumed to be properly ordered.

Consider the set of networks with n layers. We first need to establish an equivalence pr inc ip le- -d i f ferent from O equ iva lence - - tha t focuses on layer dimensions properties. To this end, we state the following definition.

DEFINITION 7.1. Let N1 and N2 be two neural networks with n layers, and let L = { Li . . . . . L,, } and K = { Ki, . . . . K,, } be the respective layer sets. Nl and N2 are

D dimensionally equivalent (Nl ~- N z ( D ) ) if, f o r all i, I Li I = I Ki ], that is, the number o f neurons in Li is equal to the number o f neurons in Ki.

It is easy to show the following (we omit the proof) .

THEOREM 7.1. D is an equivalence relation. Thus, D induces equivalence classes in the set fin of neural network structures

[N](D) = {m: M = -° N(D)} (19)

and a quotient set ~On = ~, , / (D) . Our next step is to make ~,, into a poset.

DEFINITION 7.2. Let NI, N2 C ~, and let L -- { LI . . . . . L,, } and K = { Kl . . . . . K,, } be the respective layer sets. Then,

N, <- N2 i f fVi lLi l <-IKil .

Then, we can make ~D,, into a lattice.

DEFINITION 7.3. Let NI, N=, M E ~ and let L = { Ll,

. . . . L,,}, K = {K~ . . . . . Kn} and Q = {01 . . . . . Q , } be the respective layers sets, then:

1. M = N , AN2 = Vi[Oil : m i n ( l L [ , Ig, I) 2. M = N I v N 2 = V i l Q i l = m a x ( I L l , IK, I). It is easy to verify that these two operations are associative, idempotent, and satisfy the absorption identities. Moreover, for all N, M ~ ~,,, we have N A M C D, a n d N v M E ~, .

If [~ is the set of integer numbers, made into a lattice by the natural ordering, then it is easy to see from the definition of direct product that ~ , ~ ~" . This implies that ~D, is distributive, pseudocomplemented, and, as a lattice, it is a Stone Algebra.

As in the case of neural network architectures, we will pick up a particular element of each equivalence class and use it as a representative for the class. We exploit the following theorem, stated without proof.

THEOREM 7.2. For each n-layers neural network N, there exists a f eed forward ne twork ~N such that ~N E [ N ] ( D ) .

The feedforward network will be used as the "class placeholder ."

8. THE LATTICE OF NEURAL NETWORKS

In the previous sections, we created two distinct partitions in the set if,, o f n- layer neural network structures. The first led to the lattice ~A,, of neural network architectures, the second to the lattice ~ of neural network dimensions.

We now study the relation between these two partitions. Before doing so, we want to stress again that if,, is not the set o f BFNs with n layers, but the set of neural network structures, that is, ~,, is the quotient set of the


set S, under the structural equivalence relation " - s ,, (3 .7) . To study the two partitions we have introduced, it will be easier to refer directly to S, rather than to ~0. Because ft, is a partition o f S,, the O and the D equivalence also are partitions o f S,.

We define two "pro jec t ion" operators:

7r . : S. --, ~A.

where, for each network M E S,, the two projections are defined as

7L,(M) = ~IM

7%(M) = ~M.

We want to show that the two quotient lattices we have defined completely characterize the lattice ft, o f neural network structures.

First, we state an intermediate result with the following lemma, for which the proof is omitted for the sake o f brevity, and can be found in Santini and Del Bimbo (1993) .

LEMMA 8.1. Let N, M E S,. It is Try(M) = Try(N) and S

7%(M) = 7r,(N), i f f M -~ N. With the aid o f this lemma, we can prove the fol-

lowing theorem, which states the result we were look- ing for.

THEOREM 8.1. The lattice ~, is isomorphic to the direct product o f ~4. and ~9., that is,

7,, ~ ~/,, × ~,. (20)

l f s E ~ , l e t s = [ N ] ( s ) [see eqn (11) ] , and M E [N] ~ ( ), then an isomorphism between ~, and ~t, × ~9~ is given by

A(s) = (Try(M), 7r.(M)). (21)

Proof. We first prove that A is one-to-one by contradiction. Suppose s~, s2 E 59. with Sl * s2 and A(s~ ) = A(s2).oLet Nl, N2 E S,,, sl = [N~]( -s ), and s2 = [Nzl( - ). Because

A ( S l ) = A ( s 2 ) ¢0 7 t '~ (NI) = 71 . , (N2) , rr.(Nl) = 7 r . ( N 2 ) ,

s from I_emma 8.1 it follows that NI -- N2 and thus that

s2 = [ N 2 ] ( s ) = [Ni]( --s ) = st,

which is a contradiction. If N E S,, then, by Theorem 4.1 there is always a

nesting associated with it and, thus, an element o f ~A,. Moreover, an element o f 59, associated to N can be trivially built by counting the neurons it its layers. There- fore, 7r~ and 7% are defined for every element in S, (i.e., A i s onto) . •

The isomorphism A induces a lattice structure in the set ft,. We call ~, the lattice of BFNs.

589

9. SOME RELATION BETWEEN LATTICES AND COMPUTING CAPABILITIES

In the previous sections we discussed some algebraic properties o f the set o f BFNs, which derives f rom the introduction o f partial orderings into the sets ~A~ and D,. The resulting algebra allows to define operations over networks.

It is interesting to investigate whether there is a relation between the orderings introduced so far and cer- tain network properties. In this section, we present a first result in this sense.

L e t x = {xl . . . . . xn . . . . } a n d y = {yl . . . . . y . . . . . } be two sequences o f vectors (x, E ~"~ and y, E ~"y for all t ) , and let N b e a neural ne twork with nx inputs and ny outputs. Suppose that when the sequence x is given as the ne twork input, the ne twork outputs de- scribe the sequence y, we will say in this case that y is generated by x via N, and use the symbol x N ~ y. I f fN : R nx × t ~ R n' is the funct ion implemented by N, then

xN --* y = 'V(fN(X,, t) = y,.

DEVaNITION 9.1. l f Nt, N2 E Sn, we say that N~ is cov- ered by N2 within e(Nl ~- N2(e)) i f given a weights configuration wt o f N~, it is possible to f ind a weights configuration w2 o f N2 such that

Vx, y xNl ~ y = x N 2 " ~ y (22)

Vt Ily(t) - ~,11 - ~ (23)

It is easy to see that, for a fixed e, _<] is a partial ordering.

The results in the following are strictly related to the universal approximation theorem. There are a number o f proofs o f the theorem (see, for example, H e c h t - Nielsen, 1989; Hornik, 1991; Cotter, 1990) under different hypotheses.

One form of these theorems (see Santini & Del Bimbo, 1993, for the proof) is the following lemma.

LEMMA 9.1. I f ~0 is a function computed by a single layer, for which output function satisfies the hypothesis o f the universal approximation theorem, then the same function can be approximated arbitrarily well with a cascade of two layers.

By multiple applications o f the above lemma, it follows that:

LEMMA 9.2. I f g is a function computed by a single layer, for which output function satisfies the hypotheses o f the universal approximation theorem, then the same function can be approximated arbitrarily well with a cascade of n layers, n >- 2.

A first result that relates this definition with the lattice structures is the following.

LEMMA 9.3. Let sl, $2 ~ e/4n. I f s~ - < s2. I f we f ix e > O, then there exist two networks N~, N2 E S, such that 7r~(Nl) = SI , 7r,(N2) = s2 and Ni ~- N2(e) .


Proof Let f~ and f2 be the architecture functions of S 1

and s2, respectively. By Theorem 5.2, there is a value k such that f t ( i ) and f2 ( i ) are different only for i = k. W e have two possibilities. f l ( i ) = 2 i - 1: In this case, by Theorem 5.2, we have f2( i ) = 2i. I f 7r,(N~ ) = st , the ith layer of Nt must be without feed-

back, that is, representable as [A *~]" I f N2 is equal to

N~ everywhere but in the i th layer, and the ith layer is

given by [A* o]" Then, the two networks compute the

same function, and f2 ( i ) = 2i, as required. f l ( i ) = 2 j : Suppose, for the sake of simplicity, that f , ( j + 1 ) = 2 ( j + 1 ) - 1; thus, according to Theorem 5 .2 , f2( i ) = 2 ( j + 1 ). Let all the layers of NI between the ith and the j t h be represented as

[ ~ O ] k = i j A ~ B~ . . . . .

where, for some k, it might be B~ = O . For k = j + 1

the layer is: [A~+,* ~ ] " Let x(~)be the output vector o f

the kth layer, then the fol lowing holds:

x ") = ~0(A~x"-" + B~x ~')

x (~) = ~b(A~x ~-~) + B~x~h*)), h~ = l f l ( k ) - - j

x(J +1) = ~l(Aj+lx(J)) .

(24)

Let mk be the size of the kth layer of N[. The network N2 (also with n layers) is defined as:

• T h e j t h layer of Nz has size m r + M, with some suitable M, and a structure

where Aj E N mj×"j-' , Ar E N uxmj-' , B r E N '~j×mj, Br E R u×"j. Note that, b e c a u s e f ( i ) = 2 j , by definition of nesting function the l a y e r j either has no feedback or has a narrow feedback onto itself.

• The ( j + 1 ) th layer has size mr+, + m r and structure

~0 0 [ (A0+t ~_j+,) O] (26) where A r ~ W "j+'×mj, .~ ~ W "j×M, and the feedback matrix is empty b e c a u s e f ( j + 1) = 2 ( j + 1) - 1.

• The ( j + 2 ) th layer has size mr+2 and structure

[(Aj+2100) Bj+2]" (27)

• The ith layer has size m~, structure

q, [Ai (0 IA, ) ] (28)

with the 0 matrix appearing in the feedback matrix is an e lement of R m,×,,~+~. Moreover , the ith layer has the feedback path attached to layer j + 1 instead of l aye r j .

• All the other layers have the same size and structure as the corresponding layers in Nl. One can easily convince that the architecture func-

tion of the network N2 so defined is indeed f2. We must show that N2 can compute the same function as N~ within an approximat ion e.

To this end, let y~J) be the output o f t h e j t h layer of N2 at t ime t. For the sake of simplicity, the output o f the layers j and ( j + 1 ) will be divided as

[z/ ' ]

~J) (29) Y, = (J)/ q, J

= [z~ j+' ' ] (j+l) (30) Y, [q: j+l) j

( j + l ) ~mj+ I with z~ j) E R mj and z, E . F rom the structure (25 ) , we can write the equations

for the j t h layer as

z~ j ' = ~b(Ajy~J-') + Bjz~',)

q~J' = ~b(~t, jy~J-') + BjZ~J)I), (31)

the equations for t h e j + l th layer as

z~ j+ t )= ~0(Aj+,z~ j,) (32)

q~j+l) = ~k(~j+,q~j)), (33)

and the equation for the l a y e r j + 2 as y[j+2) ~ _ _ ()+1) (j+2)\

= q.'taj+zz, + Bj+zy,-i ). (34)

Note that eqn (33) can be written as

q~J+') = Ik(Aj+iq,(Ajy~ j - ' ) + Bjz~J_),)). (35)

By L e m m a 9.2, it is possible to choose M A j + I , Aj and Br such that

IIq~ j + ' , - z / ' l l < 6

for any specified 6. I f we start with the equality x~ i - j ' = y~i t), for

which validity derives f rom the equality of the two networks up to layer i, we can see, f rom equations above, that, if we could take 6 = 0 (that is, if we could repro- duce exactly the function of l aye r j + 1 ), then we would . (j) (j) ( j + l ) ( r+ l ) (j+2) (j+2) nave: z, = x, , z, = x, , and Yt : X t •

This is not possible in general but, because the function computed by the network is continuous, we find that

V~ > 036 > 0 :llq~ j+~) - z~J)ll

< 6 ~ [ly~ ") - xf")ll < ~. (36)

I f f ( j ) = 2h, the computat ion is more involved, because the feedback connection of layer i must be moved to layer h, and all the layers between j and h must be augmented, just like layer j + 1 in the case

Properties o f Block Feedback Neural Networks 591

discussed here. However, the arguments are the same, and are based on application of Lemma 9.2. •

By reiterated application of Lemma 9.3, we can prove the following theorem.

THEOREM 9.1. I f sl, S2 E e/4 n ands1 <- s2, then, f o r every network Nl such that 7r~(Nl) = sl, and f o r every ~ > O, there exists a network Nz with 7r ~(Nz) = s2 such that Nl -<1 N2(~).

This theorem can be stated also in a way that is not related to any particular realization N~ and N2. For s E ~A,, let:

f ~ ) = { ~ E C " [ 3 N E S. : 7 r . ( N ) = s, y,

= ~(x,), XNl --, ~ ~ [lY, - Y-,II < ~ }. (37)

Then Theorem 9.1 is equivalent to:

THEOP.EM 9.2. I f s~, S2 E ~A,, then, f o r all 6,

• , -< ~ = aT_~ a~?. (38)

PART II: COMPUTING POWER OF BLOCK FEEDBACK NETWORKS

In part I, we have seen that BFN architectures make up an algebra and that the partial ordering implied by this algebra corresponds to an increasing computing power of the networks. This leaves the open question of what kind of devices can be built into the BFN framework. More specifically, this leaves open the question on whether the BFN model is universal, that is, any computing device can be represented as a BFN.

In this part, we show that the BFN model has the same computing power as a Turing machine. That is, for any Turing computable function f • N - ' N there exists a BFN ~ ( f ) such that, considering a finite set M C N, and for every n E M, if n is given as input to a BFN, then, after a finite number of steps, the output of the network is equal t o f ( n ) .

This will be proved by proving the equivalence of the class of BFN-computable functions with the class of/z-recursive functions that, in turn, is known to be equivalent to the class of Turing computable (T-computable) functions.

10. /.t RECURSIVITY AND TUR ING C O M P U T A B I L I T Y

Throughout this second part of the paper, we will consider functions defined on the set of integer numbers, taking integer values. This may seem a strong limitation, because one often uses neural networks to ap- proximate functions defined on R". However, integer numbers can be used to represent the elements of any set for which cardinality is, at most, ×0- This can be done, for example, by using the Gtde l numeration ( G t - del, 1986). Among the sets for which cardinality is at most ×0, there is the set Q of rational numbers, that is,

of the numbers of the form p / q , p , q E N. This set is dense inside the set of real numbers, and so Q", which also has cardinality ×0, is dense in •". Therefore, any func t ionf : R N--* R M can be approximated arbitrarily well by a func t ionf : QU ~ QM and consequently by a network implementing a function defined on the set of integers.

10.1. Primitive Recursive Functions

Let h~ . . . . . hr be n-ary functions (i.e., hi " N" --' N) , g be an r-ary function, and x = (x~ . . . . . x,) E N". If, for any x, it holds

f (x ) = g(hl(x) . . . . . hr(x)) (39)

the function f is said to be obtained from g by substitution of hi . . . . . hr.

Let g be an n-ary function, and h an (n + 2)-ary function. If, for any x E N", y E N it holds

f ( x , 0 ) = g ( x ) (4O)

f ( x , y ' ) = h ( x , y , f ( x , y ) )

(where y ' is the successor of y) , then f i s said to be defined by induction from g and h.

We consider also three elementary functions, for which definition is self-contained: 1. the successor funct ion, that for x E N, has value

S ( x ) = x ' = x + 1, ~ 2. the identity functions, U~,, 1 <- i <- n, defined as:

U~,(x~ . . . . . x , ) = xi for any x~ . . . . . x,, 3. the 0-ary constant C O (for which value is 0).

DEFINITION 10.1. A function is said to be primitive recursive i f it is one o f the funct ions 1, 2, 3 above, or i f it is obtained by finite application o f induction and substitution to the funct ions 1, 2, 3 above.

To state it differently, the function f is primitive recursive if it is obtained by applying substitution or induction to a primitive recursive function. For instance, if the function h in eqn (40) is primitive recursive, so is f .

A similar definition holds for predicates on n-tuples of integer numbers.

DEFINITION 10.2. A n-ary predicate P ( n >- 1 ) is primitive recursive if f there exists an n-ary primitive recursive funct ion f such that

i In the definit ion of S(x) we have used the sum x + 1. This may lead to the impress ion that S(x) is def ined in terms of the sum. Ac- tually, the sum was used only as a shorthand. The successor should be cons idered as a p r imi t ive function, not def ined in terms of anything else, as in Peano ' s axioms. The sum is a function defined by induct ion in terms of it. I f x + y is regarded as a function of y, parametr ized

by x, then:

I f ( x , 01 = x f(x, y ) = S(x + y)"


Px ¢ ~ f ( x ) = 0, ( 41 )

f is called the characteristic function of the predicate P. Given an (n + 1 )-ary predicate Pxy, the/z operator,

applied to P, #yPxy gives the lowest value y for which Pxy. Note that the # operator gives a result only if there exists at least one y such that Pxy. A predicate Pxy is regular if such a y exists for all x. Similarly, the primitive recursive functionf(x, y) is said to be regular if for any x there exists at least one y such tha t f (x , y) = 0.

If g (x , y) is a regular function, the function f ( x ) is obtained by application of the # operator if: 1. for any x there exists at least one y such that g (x,

y) = 0, and 2. f ( x ) is the smaller y such that g (x , y) = 0.

DEFINITION 10.3. A function f is said to be # recursive if it is obtained by repeated application of" 1. substitution 2. inductive definition 3. # operator (at regular functions) to the three elementary functions.

The importance of the definition of # computability lies in the following theorem (Hermes, 1969).

THEOnEM 10.1. Any Turing-computable function is # recursive, and any tz-recursive function is Turing computable.

11. BUILDING UP ELEMENTARY TOOLS

Our proof of the equivalence from BFN computability and # recursivity is based upon the construction of a BFN to compute an arbitrarily selected #-recursive function. We do this by developing networks that im- plement the elementary functions, the substitution and induction schemes, and the # operator.

All these networks must be integrated together to make up a single complex network that computes the required function. Integration is done by a pair of neurons that we call the " g o " and the " d o n e " neurons. Suppose the network N1 has to perform some operation which will, in general, take several time steps to be completed. The network N~ needs to know when to start computing, and the environment of Nt needs to know when N~ has finished and its results are ready to be used. To do this, we add to the network Nj an extra input neuron, that we call the go neuron and an extra output net~ron that we call the done neuron (see Figure 5). We assume that the inputs to the network remain constant while go is active, and that go remains active at least until done goes active. This means that the network N~ "senses" constant inputs and constant acti- vations throughout its computation. This will not con- stitute a limitation, due to the memory block we will develop shortly.

Data in > ~ N ~ 7 ~ D a t a out

( etwoo \ Go ~ o n e

FIGURE 5. The "go " and "done" organization of subnet- works. The subnetwork illustrated receives the inputs data and a "go " signal. During its work, data are kept constant. When the network finishes, it outputs the results and acti- vates the "done" signal.

11.1. Number Representation and Output Functions

To develop network implemented functions taking value on the set of integer number (and, by extension, on any set for which cardinality is at most ×0), we need to define a representation of integer numbers in terms of neurons. Throughout this paper, we will use the binary representation, with every neuron representing either a 0 or a 1 value. The numbers from 0 to n - 1 will be represented in this way using log2 n neurons. Of course, any other consistent representation will do as well.

To compute any T-computable function, for any value of the argument, we would need an infinite number of neurons. This is a general property of the number representation, and in a sense it corresponds to the "infinite tape" condition for Turing machines. To ease the development, we will restrict to a B bits representation of numbers. Of course, B depends on the interval where the functions have to be computed but, for every finite interval I C N assumed as the domain of any given T- computable function f , there is a B such that a BFN with a B bits representation can c o m p u t e f ( n ) for all n c 1 .

As the neuron output function is concerned, we will follow the dominant trend in MLP literature, and assume the sigmoid function

1 a(x) = . (42)

1+ e x p ( ~ -~)

The use of this function gives rise to some approximation issues. We assume a representation made up of 0 and 1 value, but the value of or(x) always lies in the open interval (0, 1 ).2 This produces a representation

2 This problem cannot be avoided when dealing with finite co- dominium functions. Had the range of the output function been the closed interval [0, 1 ], its first derivative would have had to be 0 over a finite interval of the argument. But any finite interval with 0 derivative makes gradient descent method, and thus the algorithm in San- tini et al. ( 1991 ) inapplicable.

Properties o f Block Feedback Neural Networks 593

error, which propagates through the network, invali- dating the results obtained.

Anyway , for any network with M neurons, it is possible to adjust the slope/3 of the output functions [eqn (42) ] so that, retaining the continuity o f the function, the error is less than any arbitrarily set positive number. We assume that, for any network we build, the appropriate/3 has been chosen, to make the representation error negligible (i.e., less that an arbitrary value e) . The output function with this characteristic will be referred to as cry. When the threshold value 0 o f eqn (42) needs to be represented, it will be added as a superscript, as • (0) in: a~ . We define also a value w as an arbitrary real value such that cr®(~v) > 1 - e and a ® ( - 0 v ) < 6.

In the fol lowing we will use also the notation I "×" for the identity matrix and ei for the vector made up of all zeros, with a " 1 " in the ith position.

11.2. Some Basic Blocks

In this subsection we develop some specialized neural blocks that will be useful in the following to build more complex networks.

Some o f these blocks have just to perform a static mapping f rom input to output. Due to the approximation theorem (Cotter, 1990; Hornik, 1991 ), a three- layer feedforward network can perform any required ope ra t ion - - sub jec t to some regularity cons t r a in t - - with any degree o f accuracy.

Thanks to this theorem we can use a feedforward network to build each of the fol lowing blocks, where the inputs and the outputs are supposed to be binary representations o f integer numbers: 1. the s u c c e s s o r network if+, such that Vx, x f f+y ~ y

: X r,

2. the p r e d e c e s s o r network if_, such that Vx > 0, x f f _ y ~ x = y ' ,

3. the z e r o network Y, such that x2~y = (y = 1 ¢~ x = 0) ,

4. the n o t z e r o network ,~, such that xffy ~ (y = 1 ¢~ x ~: 0) ,

5. the un i t network ~9 such that x~v ~ y = x. We now use these feedforward blocks to build less

straightforward basic tools. T h e O - e n a b l e block has B + 1 inputs and B outputs•

The first B inputs carry an integer value n, the (n + 1 )th input carries a control signal e. I f e is zero, the output is equal to n; if e is 1, the output is zero. The O - e n a b l e network is described by:

- - 2 w

go = 2wI B×n - ; w 0 . (43)

T h e 1 - e n a b l e block is the dual o f the O - e n a b l e

block. The only difference is that this block allows the

n value to be transferred to the output when the e input is equal to 1:

~' = ~L-2~ go" (44 )

T h e p u l s e block has one input and one output• When the input steps to " 1 , " the network produces a one- time tick-wide output pulse, then returns to zero, and there it rests until the input has been reset to 0 and set to 1 again• This network is the neural version o f the monostable logic circuit; it is represented in Figure 6, and its description is:

], [Io] 0] (45)

T h e r e v e r s e net is a simple layer that, when a binary number is presented in input, reverses all its bits

= -2~oI B×B 0 . (46)

T h e s t o r e block has B + 1 inputs and B outputs. The first B inputs contain the representation o f an integer number n, whereas the (B + 1 )-th carries a control input e. When e is equal to 1, the input n is transferred to the b lock ' s output. When e is equal to 0, the value o f n currently in output is retained. The store network is represented in Figure 7 and described by

(;:t × 3~vl Bxn [ 2wI nxB { }. (47)

- 3 w I B×B _]

T h e b a c k c o u n t block has B + 2 inputs: a B bits integer number and a c o u n t input. It " l o a d s " a value in its memory element, when the load signal is active, and then, for each pulse at the c o u n t signal, decrements

0) ~ O)

\ z I-- FIGURE 6. The network that sends a single pulse when ac- tivated.


FIGURE 7. Network that stores a value. The value to be stored is given in input to the blocks ~ and % When a pulse is applied to the block ~), the value is moved in output, and there it remains until a new pulse is sent to store a new value.

the value until it arrives to 0. The actions taken by the back count block after its output reaches 0 are undefi- ned. Its description is

(48)

with Q = block diag(5ov, -5w , 5w, w) and

A = [ 5w'/Rx8 -5w'l~XBO w' lB×~00] . (49)

Note that a forward counter C+ can be obtained simply substituting the block ~ in eqn (48) with a block if+.

12. C O M P U T I N G bt-RECURSIVE FUNCTIONS

In this section we will show that, for any #-recursive function f : ~ ---, ~, there exists a BFN that computes f ( n ) for all n < M, being M an arbitrary integer number.

Let F be the class of all BFN-computable functions. To prove our assert we have to show that: 1. The functions S(x) , U~ and C o are in F. 2. I f g E F, hi E F i = 1 . . . r, and f i s obtained from

g by substitution of hi . . . . . hr, then f C F. 3. I f h E F, g E F, and f i s obtained by induction from

g and h, then f E F. 4. I f P i s a ( n + 1)-ary predicate suchthat P (x ) ** q(x,

y) = 0, q c F, a n d f ( y ) = #yPxy, then f E F. The first point can be worked out by feedforward

neural networks. The successor S ( x ) can be implemented (under the hypothesis of binary representation of integer numbers) by a suitable network of NAND circuits. The NAND circuit can be put in canonical form, and the layers of the canonical form can be mapped onto the layers of the feedforward neural network.

The unit functions U7 can be implemented by a feedforward layer such as

(50)

with

B [ 0 ~ ×B 0 ~ × ~ 2w. B×B 8×B . . 0 ~ × 8 l . . . . . - l 0 i + I " . ( 5 1 )

The 0-ary constant can be computed by a neuron with all weights at - w .

To be inserted in the network architectures we design, these function need to accept a " g o " signal and to yield a " d o n e " signal when the computation is finished. Let N be a feedforward network that computes one of the elementary functions. The network:

performs the same computation with the required control signals.

Now, let ~Ei, . . . , ~ r be r BFNs that compute the functions h i , . . . , hr, and $ the BFN that computes the function g. It is apparent that

j . ~2 .~ (53)

computes the function fobta ined from g by substitution of hi . . . . . hr.

Let us consider now the induction scheme. This is quite a complex realization, and it may be better un- derstood if decomposed into pieces. Let us begin by considering only the second of eqn (40) , and assume

f (x , 0) = 0.

We have a network 2d that computes the function h. The existence of such a network is guaranteed by point 2 above, because h is supposed to be BFN computable.

x,n)

lone

FIGURE 8. Network that implements the induction scheme. This network assumes no initialization for the function oh- tained from the induction scheme. If f is the function obtained by induction on h, then it is assumed f(0) = 0.


ix,n)

595

lone

gc

FIGURE 9. Network that implements the induction scheme. With initialization f(0) = g(x).

We claim that the network of Figure 8 solves the problem. This network is described by

Infact, setN=(i+),andx,=[xlntldolelht-,].

Consider the four blocks separately. I f we issue the " e " signal to the blocks e_ and e+, then

x,•y, = y, = [ x l n , - lltlh,-~]. (55)

Therefore, n, runs from n to 0, while successive values of h are fed into the ~E block. Note that n, is not fed back and therefore does not enter into the actual computation. It is used, in conjunction with the block 2~, only to detect the end of computation, and issue the end signal.

The insertion of the first of eqn (40) can be easily achieved by forcing at the initial step the value g (x ) into the last ~ block of Figure 8, that is, the feedback layer where the value o f f ( n - 1 ) is taken back. Sup- pose we have a network that computes g (again, this is allowed by the assumption we make that g is BFN computable) and uses the same " g o - d o n e " signal scheme

used for h. We must design a network that takes in input x, n, and the " g o " signal, and yields in output x, n, g ( x ) , and a " d o n e " signal to state the end of the computation of g. If ~ is the network that performs this operation, then

~ = ~gB 8t • ~ • , ( 5 6 )

~9

where g is the network that computes g. To attach this to the induction network, we substitute

the last ~9 block in the first layer of Figure 8 with a slightly different block made up of the single feedback layer

(57)

The result of this process is the network

(58)

which implements the induction scheme. This network is depicted in Figure 9.

gyPxy

done

FIGURE 10. Network that implements the/~ operator.


W e final ly p r o v e the c losure o f the set F wi th respec t

to the app l ica t ion o f the # ope ra to r to r egu la r funct ions .

T o s h o w this, let 7 be the n e t w o r k that c o m p u t e s the

( B F N - c o m p u t a b l e ) f u n c t i o n f ( x , y ) . T h e n e t w o r k

(C~00).(e~+) { ( ~ ) - ( ( J ? . Z ~ R ) ) } (59)

dep i c t ed in F igu re 10 c o m p u t e s # y P x y . Please note that

i f there is no y such that P x y , the n e t w o r k runs forever .

Th is case, h o w e v e r , has been exp l i c i t ly exc luded , be-

cause we i m p o s e P to h a v e a regu la r charac te r i s t ic funct ion .

Pack ing toge the r the a b o v e observa t ions , we h a v e

the p r o o f to the f o l l o w i n g theorem.

THEOREM 1 2.1. F o r any Tur ing -compu tab l e f u n c t i o n f

a n d f o r every in teger n u m b e r M, there exis ts a B F N

such that

Vn < M : nS t f (n ) . (60)

13. C O N C L U S I O N S

In this paper , we h a v e d i scussed s o m e arch i tec tura l

p roper t i es o f the B F N mode l .

In the first part o f the paper , w e bui l t two d i f fe ren t

quo t i en t sets in the set o f B F N ne tworks : the set ~/n o f

a rch i tec tu res and the set ~n o f d imens ions . W e h a v e

s h o w n that these two par t i t ions are o r thogona l and that

a c o u p l e ( a , s ) , a E ~1,, s E ft, c o m p l e t e l y spec i f ies a

n e t w o r k structure.

Bo th the sets d/, and ~ can be e n d o w e d wi th a lat-

t ice s t ructure, and we h a v e d e t e r m i n e d the proper t ies

o f these s tructures .

W e also h a v e s h o w n that o rde r ing in the archi tec-

tures imp l i e s an o rde r ing in the c o m p u t i n g capac i ty o f

the ne tworks .

In the s e c o n d part, w e h a v e c o n s i d e r e d the p r o b l e m

o f " h o w f a r " this i nc rea s ing in c o m p u t i n g capac i t i e s

can lead us. W e h a v e s h o w n that the B F N m o d e l is as

p o w e r f u l as the # recurs iv i ty , wh ich , in turn, has the

s a m e p o w e r as the T u r i n g mach ine .

REFERENCES

Blum, E. K., & Li, L. K. (1991). Approximation theory and feedforward networks. Neural Networks, 4, 511-515.

Cotter, N. E. (1990). The Stone-Weierstrass theorem and its application to neural networks. IEEE Transactions on Neural Net- works, 1(4), 290-295.

Del Bimbo, A., Landi, L., & Santini, S. (1992). Dynamic neural estimation for autonomous vehicles driving. In Proceedings of the 1 l th International Conference on Pattern Recognition, Le Hague, The Netherlands.

Del Bimbo, A., Landi, L., & Santini, S. (1993). Determination of road directions using feedback neural networks. Signal Process- ing, 32(1-2), 147-160.

Fahlman, S. E., & Lebiere, C. (1989). The cascade-correlation learning architecture. In D. D. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 524-532). San Mateo: Morgan Kaufmann.

G6del, K. (1986). On formally undecidable proposition ofprincipia mathematica and related systems I. In Kurt Giidel: Collected works. New York: Oxford University Press.

Gratzer, G. ( 1971 ). Lattice theory. A Series of Books in Mathemat- ics. New York: W. H. Freeman and Company.

Hecht-Nielsen, R. (1989). Theory of the backpropagation neural network. In Proceedings of International Joint Conference on Neural Network, pp. 1-593-I-605.

Hermes, H. (1961 (1969)). Aufziihlbarkeit, entscheidbarkeit, be- rechenbarkeit (English version Enumerability, decidibility, computability). New York: Springer-Verlag.

Hornik, K. ( 1991 ). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4, 251-257.

Karnin, E. D. (1990). A simple procedure for pruning back-propa- gation trained neural networks. IEEE Transactions on Neural Networks, 1(2), 239-242.

Moody, J. E. (1992). The effective number of parameters: An anal- ysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, and Lippmann, R. P. (Eds.), Advances in neural information processing systems, 4. San Ma- teo, CA: Morgan Kaufman.

Mozer, M. C., & Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance as- sessment. In D. D. Touretzky (Ed.), Advances in neural information processing systems 1 (pp. 107 115). San Mateo: Morgan Kaufmann.

Murata, N., Yoshizawa, S., & Amari, S.-i. (1992). Network information criterion--determining the number of hidden units for an artificial network model (Tech. Rep.). Department of Mathe- matical Engineering and Information Physics, University of To- kyo, Bunkyo-ku, Tokyo 113, Japan.

Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Trans- actions on Neural Networks, 1 ( 1 ), 4-27.

Santini, S., & Del Bimbo, A. ( 1993 ). Block feedback neural networks are universal computers (Tech. Rep. TR 19/93). Dipartimento di Sistemi e Informatica, Universith di Firenze.

Santini, S., Del Bimbo, A., & Jain, R. (1991). An algorithm for training neural networks with arbitrary feedback structure (Tech. Rep. TR 10/91). Dipartimento di Sistemi e lnformatica, Uni- versith di Firenze.

Santini, S., Del Bimbo, A., & Jain, R. (1995). Block-structured recurrent neural networks. Neural Networks, 8 (1), 135-147.

Schwarze, H., & Hertz, J. (1992). Generalization in fully connected committee machines (Tech. Rep. CONNECT). The Neils Bohr Institute and Nordita, Blegdamsvej 17, DK-2100 Copenhagen ~, Denmark.

Documents

Properties of block feedback neural networks