8
An Efficient Mapping of Multilayer Perceptron with Backpropagation ANNs on Hypercubes Q.M. Malluhi, M.A. Bayoumi, T.R.N. Rao Center for Advanced Computer Studies University of Southwestern Louisiana Lafayette, LA 70504 Abstract Thispaper proposes a parallel structure, the mesh-of- appendixed-trees (MAT), for efficient implementation of artifiial neural networks (ANNs). Algorithms to implement both the recall and the training phases of the multilayerpetreptron with backpropagation ANN model are provided, A recursive procedure for em- bedding the MAT structure into the hypercube topol- ogy is used as the basis for an efJicient mapping tech- nique to map ANN computations on general purpose hypercube massively parallel system. In addition, based on the mapping scheme, a fast special purpose parallel architecture for A"s is developed. The ma- jor advantage of our technique is high per$onnance. Unlike the other techniques presented in the litera- ture which require O(N) time, where N is the size of the largest layer, our implementation requires only O(1og N) time. Moreover, it allows the pipelining of more than one input pattern and thus further im- proves the per$onnance. I. Introduction Several characteristics of a basic ANN model of com- putation favor massively parallel digital implementation of ANNs. These include highly parallel operations, simple processing units (neurons), small local memory per neu- ron (distributed memory), robustness, and fault tolerance to connection or neuron malfunctioning [Ghos89]. There- fore, a highly parallel computing system of thousands of simple processing elements is a typical target architecture for implementing ANNs. The authors acknowledge the support of the. National Science Foundation and State of Louisiana grant NSFLEQSF (19!22-96)-ADP-04. Many parallel digital special purpose array processors for neural networks have been proposed in the literature. Typical examples include the L-Neuro chip Pura891, the Ring Array Processor (RAP) [Beck90, Morg921, the CNAPS system IHamm901, the SYNAPS neurocomputer [Rama92], the GCN RISC processor array fHira901, and the reconfigurable ANN chip [Madr91]. Several mapping schemes have been reported to imple- ment neural network algorithms on available parallel ar- chitectures. These mapping schemes fall into two general categories: Heuristic mapping and algorithmic mapping [Lin91]. Heuristic mapping is a trial and error approach in which the implementation tailors the target machine and depends on familiarity of the neural network model. Examples of this category of mapping schemes are given in [Brow87, Zhan89, Wah90, Chu921. Algorithmic map- ping is a systematically derived technique to map neural network algorithm on a specific massively parallel archi- tecture. Examples of algorithmic mapping schemes are; the implementation of neural networks on the ring sys- tolic array [KungSS, Hwan891, the mapping of ANNs on SIMD arrays Vomb88, Lin911 and the implementation of ANNs on the hypercube architecture lr(im891. The above discussion is summarized in Figure 1. In this paper, we present an algorithmic mapping tech- nique to implement the multilayer feedforward (multilayer perceptrons) with backpropagation learning ANN model (FFBP) on hypercube massively parallel machines. The mapping scheme is facilitated by the development of an architecturecalled the Mesh-of-Appendixed-Trees (MAT). Algorithms to implement both the recall and the learning phases of FFBP ANNs are provided. A major advan- tage of this technique is the high performance. Unlike almost all the other techniques presented in the litera- ture which require O(N)time, where N is the size of the largest layer, this mapping scheme takes only O(logN) time. Another important feature of this method is that it allows the pipelining of more than one input pattern and thus further improves the performance. 1063-6374193 $03.00 0 1993 IEEE 368

[IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

  • Upload
    trn

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

An Efficient Mapping of Multilayer Perceptron with Backpropagation ANNs on Hypercubes

Q.M. Malluhi, M.A. Bayoumi, T.R.N. Rao Center for Advanced Computer Studies University of Southwestern Louisiana

Lafayette, LA 70504

Abstract

This paper proposes a parallel structure, the mesh-of- appendixed-trees (MAT), for efficient implementation of artifiial neural networks (ANNs). Algorithms to implement both the recall and the training phases of the multilayer petreptron with backpropagation ANN model are provided, A recursive procedure for em- bedding the MAT structure into the hypercube topol- ogy is used as the basis for an efJicient mapping tech- nique to map ANN computations on general purpose hypercube massively parallel system. In addition, based on the mapping scheme, a fast special purpose parallel architecture for A " s is developed. The ma- jor advantage of our technique is high per$onnance. Unlike the other techniques presented in the litera- ture which require O ( N ) time, where N is the size of the largest layer, our implementation requires only O(1og N ) time. Moreover, it allows the pipelining of more than one input pattern and thus further im- proves the per$onnance.

I. Introduction Several characteristics of a basic ANN model of com-

putation favor massively parallel digital implementation of ANNs. These include highly parallel operations, simple processing units (neurons), small local memory per neu- ron (distributed memory), robustness, and fault tolerance to connection or neuron malfunctioning [Ghos89]. There- fore, a highly parallel computing system of thousands of simple processing elements is a typical target architecture for implementing ANNs.

The authors acknowledge the support of the. National Science Foundation and State of Louisiana grant NSFLEQSF (19!22-96)-ADP-04.

Many parallel digital special purpose array processors for neural networks have been proposed in the literature. Typical examples include the L-Neuro chip Pura891, the Ring Array Processor (RAP) [Beck90, Morg921, the CNAPS system IHamm901, the SYNAPS neurocomputer [Rama92], the GCN RISC processor array fHira901, and the reconfigurable ANN chip [Madr91].

Several mapping schemes have been reported to imple- ment neural network algorithms on available parallel ar- chitectures. These mapping schemes fall into two general categories: Heuristic mapping and algorithmic mapping [Lin91]. Heuristic mapping is a trial and error approach in which the implementation tailors the target machine and depends on familiarity of the neural network model. Examples of this category of mapping schemes are given in [Brow87, Zhan89, Wah90, Chu921. Algorithmic map- ping is a systematically derived technique to map neural network algorithm on a specific massively parallel archi- tecture. Examples of algorithmic mapping schemes are; the implementation of neural networks on the ring sys- tolic array [KungSS, Hwan891, the mapping of ANNs on SIMD arrays Vomb88, Lin911 and the implementation of ANNs on the hypercube architecture lr(im891. The above discussion is summarized in Figure 1.

In this paper, we present an algorithmic mapping tech- nique to implement the multilayer feedforward (multilayer perceptrons) with backpropagation learning ANN model (FFBP) on hypercube massively parallel machines. The mapping scheme is facilitated by the development of an architecture called the Mesh-of-Appendixed-Trees (MAT). Algorithms to implement both the recall and the learning phases of FFBP ANNs are provided. A major advan- tage of this technique is the high performance. Unlike almost all the other techniques presented in the litera- ture which require O(N)time, where N is the size of the largest layer, this mapping scheme takes only O(logN) time. Another important feature of this method is that it allows the pipelining of more than one input pattern and thus further improves the performance.

1063-6374193 $03.00 0 1993 IEEE 368

Page 2: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

DIGITAL ANN IMPLEMENTATION

n SERIAL MACWINES PARALLEL MACHINES

SPEaAL PURPOSE (1ENWUL PURPOSE (MsWind n

ALGORITHMIC MAPPING HEURlSnC MAPPINO

Figure 1 : ANN implementation methods.

This paper is organized as follows. Section I1 briefly describes a general ANN model of computation with some emphasis on the FFBP model. It also provides some terminology to be used throughout the paper. Section 111 describes the MAT structure and explains the mapping of both the recall and the learning phases of a FFBP ANN on the MAT architecture. In addition, Section 111 discusses the issue of pipelining multiple patterns. Section IV describes how to migrate the MAT implementation to hypercube machines. It shows that the MAT can be optimally embedded into the hypercube topology. In Section V, a modified MAT is used as a special purpose fast ANN computer. Section VI compares our technique with other techniques proposed in the literature. Finally, Section VII draws the conclusions.

II. ANN model of computation

A basic ANN model of computation consists of a large number of neurons connected to each other by connec- tion weights (see Figure 2). Each neuron, say neuron i, has an activation value ai. Associated with each connec- tion from neuron j to neuron i, is a synaptic weight (or simply, a weight) wi,. The ANN computation can be di- vided into two phases: recall phase and learning phase. The recall phase updates the outputs (activation values) of neurons based on the system dynamics to produce the derived ANN output as a response for an input (test pat- tern). The learning phase performs an iterative updating of the synaptic weights based upon the adopted learn- ing algorithm. The weights are updated in such a way that minimizes an error function measuring how good the ANN output is. In other words, the learning phase teaches the ANN to produce the desired outputs. In some ANN models, the weight values are predetermined, therefore, no learning phase is required.

In this paper we will deal with multilayer feedforward neural networks with backpropagation learning which is central to much of the work going on in the field nowa- days. The following two subsections describe the multi- layer feed forward ANNs and the error backpropagation learning algorithm.

neuron i -/ neuron j

Figure 2: A basic ANN topology. +I JLI nJL1 A A A

. . .

Figure 3: An L layers feedforward network.

11.1. Multilayer feedforward networks A Multilayer (L layers) feedforward (FF) network

(multilayer perceptron) has the general form shown i n Figure 3. There is a set of input terminals whose only role is to feed input patterns into the rest of the network. After this, there is a zero or more intermediate layers fol- lowed by a final layer where the result of computation is read off. The intermediate layers are called the hidden layers and the Lrh layer (final layer) is referred to as the output layer. The network interconnection is such that each node (neuron) in a layer receives input from every node of the previous layer. This interconnection topology implies that for every layer, say layer 1, there is a weight matrix WO of the synaptic weights for the links between layers I and 1-1.

We will use the index between brackets to indicate the layer number and a superscript to denote the test pattern number. For example wij [ l ] represents the element in the irh row and the jrh column of the weight matrix w[fl of the th layer and up[/ ] represents the activation value for the irh neuron in the lfh layer for the prh input pattern P . We will use Ni to denote the number of neurons i n layer i. For notation convenience, the input terminals will be considered as layer number 0, thus, NO will represent the number of input terminals.

For input pattern IP = ( I ; , I; , . . . , IGo), system dy- namics for the recall phase is given by,

q [ O ] = I;

369

Page 3: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

Algorithm 1:

1. 2.

Apply input, a[O] = I . Propagate the signal forward to compute the output a[L] using

a,[q = f Cw13[lla3[I - 11 , for = 1,2, . . . ,L .

3. Compute the deltas for the output layer, &[L] =

4. Compute the deltas for the preceding layers by

( 3 1 f’(h[Ll)(tl - a&]).

propagating the error backwards, 6,[l - 11 = f’(hl[I - l])C~~~[flS,[fl, for I = L,l - 1,. . . ,2. Adjust all weights according to Awij[II = 0&[11aJ[I - 11 wy”[I] = w,“’[l] + Awl,[I]. Repeat from step 1 for the next input pattem.

3

5.

6.

Figure 4: Backpropagation learning algorithm.

Each neuron i computes the weighted sum hi of its inputs and then applies a nonlinear function f(hi) producing an activation value (output) ai for this neuron. The function f is usually a sigmoid function given by f(z) = 1 / ( 1 + e-PD).

11.2. Backpropagation learning For each input P‘, there is a target (desired) output

P. The backpropagation (BP) learning algorithm gives a prescription for changing the synaptic weights in any feedforward network to learn a training set of input- target pairs. This type of learning is usually referred to as “supervised learning” or “learning by teacher“. The learning phase involves two steps. In the first step, the input is presented at the input terminals and is processed by the ANN according to the recall phase equations. In the second step, the produced output is compared to the target and an error measurement value is propagated backward (from the output layer to the first layer) and appropriate changes of weights are made. The second step proceeds along the following iterative equations:

6,P[L] = f ’ (hp[L] ) ( t ; - ap[L])

S;p[/ - 11 = f‘(hp[/ - 11) Wj,[ l ]S,”[ l l

j (2) Awij[I] = ,~Sp[fla;[I - 11 w y w [ I ] = w,Pjla[lI + A ~ i j [ f l .

The BP algorithm is summarized in Figure 4.

III. Mapping feedforward with Backpropagation ANN model onto the MAT architecture

As stated earlier, our main objective is to map the mul- tilayer feed forward with backpropagation learning (F’FBP)

neural network model on hypercube massively parallel machines. As an intermediate step, in this section we de- velop a mapping on what we call the mesh-of-appendixed- trees (MAT) architecture. The mapping procedure is de- veloped in stages. We start the discussion by showing how to map the recall phase of a single layer on a MAT. After that, doses of complexity are gradually added by dis- cussing more complex situations and by showing how to enhance the performance of the given mapping algorithm through processing more than one pattern in a pipelined fashion. Thenceforth in Section IV, we tailor the MAT implementation for the hypercube.

Subsection III.1 presents a method to map the pro- cessing of the recall phase on the MAT. Subsection 111.2 describes the implementation of the BP learning phase. Finally, an enhanced version of the mapping procedure that allows parallel (pipelined) processing of more than one input pattern is presented in Subsection III.3.

111.1. Recall Phase As a first step, consider the recall phase and concen-

trate only on the operations of one layer, say layer 1. That is, we only consider the computation of layer I activa- tion values ( a i [ / ] I 1 < i < A$), from the activation Val- ues ( a i [ / - 11 I 1 5 i < N L I ) of the preceding layer. Ac- cording to equation (1) we have,

This equation hints that the operations involved in com- putation are as follows:

1. Distribute a j [ l - 11 to all the elements of column j in the weight matrix W[/] . This is done for all

Multiply a , [ / - 11 and wij for all 1 5 i 5 NI and

Sum the results of multiplications of step 2 along each row of W[I] to compute the weighted sums

Apply the activation function f ( h i [ I ] ) for all 1 5 i 5 N I .

To start with, we suppose that each weight wi j [ l ] is stored in a distinct processor WP, (Weight Processor). In addition, we assume that each of the activation values a i [ / - 11 of layer I-1 is stored in processor CApi (COL umn Appendix Processor). An output activation value of layer 2, ai[ l ] . will be produced in processor RAPi (Row Appendix Processor).

The aforementioned four steps constitute the skeleton of our implementation. As we will see, step 1 will

1 5 j < N L 1 .

1 I j I NI-1.

( h i [ / ] I 1 I i I NI).

2.

3.

4.

370

Page 4: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

U 4 (b)

Figure 5: (a) General MAT structure, (b) 4 x 4 MAT.

be performed in O(log Nl) time units using a tree like structure. All the multiplications of step 2 can be done in parallel in the various weight processors. Therefore, step 2 will only take one unit of time. The summation of step 3 will be computed in O(1og N I - ~ ) time units by a tree like fashion. Finally, the hnction f is applied at once on all the hi values in the different row appendix processors.

Implied by the above discussion is an architecture shown in Figure 5 . The topology of the architecture is a variation of the ordinary mesh-of-trees topology and is referred to as Mesh-of-Appendixed-Trees (MAT). The N x M MAT, where N = 2" and M = 2", is constructed from a grid of N x M processors. Each row of this grid constitutes a set of leaves for a complete binary tree called ART (Appendixed Row Tree). Similarly, Each column of this grid forms a set of leaves for a complete binary tree called ACT (Appendixed Column Tree). For each ARTIACT, there is an appendix RAPICAP connected to the root of the tree. Figure 5(a) demonstrates the general structure of a MAT. Figure 5(b) shows a 4 x 4 MAT. Each of the N ARTs has 2M nodes and each of the M ACTs has 2N nodes. Summing up and subtracting the grid size since the grid processors are members of both ARTs and ACTs we get, the total number of nodes in a MAT =

The algorithm to map the operations of layer I of a FF ANN is given in Figure 6. For this purpose we use an Nl x Nl-1 MAT. We suppose that matrix W[l] is already

N ( 2 M ) + M ( 2 N ) - N M = 3NM.

entered into the grid processors (weight processors) so that, processor WP, stores wii[1] in its local memory. Moreover, the activation values of layer I-1 are assumed to be placed into the CAPs so that Uj[l- 11 is kept in the local memory of processor Wj. Step 1 of Algorithm 2 (see Figure 6) takes log NI + 1 time steps because the depth of the ACTs is log Nr + 1. Likewise, step 3 takes log Nl-l + 1. We assume that computing the product in step 2 takes a single unit of time. In addition, we assume that the computation of the function f in step 4 requires one time unit only. As a result, the total time for the computation of layer E , Z'l = log NI + log NI-1 + 4.

Thus far. we have seen the implementation of one layer. Now we will generalize the above discussion for a multilayer network. We start at layer 1 and perform the operations of Algorithm 2 to compute ( U ; [l] I 1 5 i 5 NI) from (u;[O] I 1 5 i 5 No) . The RAPs will contain the re- sult of computation. In order to continue the processing for the second layer, we need to place (ui [l] I 1 5 i 5 N I ) into the CAPs. This takes log No +log N1 + 2 time steps. This time can be saved by storing the transpose of W[2] rather than W[2] itself into the grid of WPs and repeating the operations of Algorithm 2 backwards starting from the RAPs and getting the results (ai[2] I 1 5 i 5 N 2 ) in the CAPs. For the third layer, W[3] is stored in the WPs and we start at the CAPs and get the resultant activation val- ues at the RAPs as we did for layer 1. We continue this way going back and forth from CAPS to RAPs and from RAPs to CAPs until we reach the output layer L.

Thereof, If N is the size of the largest layer of the ANN, we use an N x N MAT. We initialize the local memories of the WPs by storing W[1], Wt[2], W[3], Wt[4], ... in the upper left corner of the WPs grid. In other words, i f 1 is odd, wij[l] is stored into the local memory of WPi , otherwise, Wij[l] is kept into the local memory of WPj;. The CAPs local memories are initialized to have the ANN input values ( l i I 1 5 i 5 N O ) . After the initialization is complete, we proceed according to Algorithm 2.

Taking L as a constant, Lemma 1 below shows that we can process the recall phase of a FF ANN in a time complexity which is logarithmic in terms of the size N of the largest layer. This is achieved at the expense of using an N x N MAT of 3 N 2 processing units.

Lemma 1: Algorithm 2 takes less than 2L(log N + 2) time steps.

Proof: Let TRecoll be the total time for computing the output. We have,

5 2L log N + 4 L = 2L(log N + 2 ) . 0

371

Page 5: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

A/gorithrn 2: (RECALL PHASE}

for 1 =1 to L do

emdfor.

procedure R€CALL-UY€R(I)

RECAL L-LAYEFI(1);

If 1 is odd then for all 1 5 j 5 Nl-1 do parbegln

parend

for all 1 5 i 5 N I , 1 5 j 5 Nt-1 do parbegln

parend

for all 1 5 i 5 NI do parbegln

parend

for all 1 5 i 5 NI do parbegln

parend

CAPj passes a,[l - 11 downward through ACT, so that, WP, 1 5 i 5 NI receives aJ[l - 11

WP, finds the product w i j [ I ] a j [ l - 11

ARTi is used to sum the product values of its leaves and the result h,[q is sent to RAPi

RAPi applies the function f ( h , [ l ] )

else ( I is even) Same as the steps when 1 is odd but replace CAP,

and WPj respectively RAP, ACT, ART, and WP, by RAP, CAP, ART, ACT,

emdif endprocedure.

Figure 6: Algorithm for the recall phase of an L layers FF ANN.

Hitherto, the reader might have wondered, why do not we used the ordinary mesh-of-trees architecture instead of the MAT architecture? This point will be elucidated when we look into the pipelining issues in Section III.3. Having used the ordinary mesh-of-trees, the root of each tree (row or column tree) would have been responsible for both, adding the values it receives from its children and then applying the activation function f on the result of add operation. This creates a bottleneck in the pipeline and disrupts its smooth flow of computation.

111.2. Learning Phase As can be noticed from Algorithm 1 , the learning phase

is composed of two parts: forward propagation part and back propagation part. The forward part is identical to the recall phase whose implementation is provided in Algorithm 2. However, some values in Algorithm 2 have to be given a special care because they will be needed later during the BP part. When an appendix CAPi/RAPi receives a weighted sum hi from the root of its tree ARTJACTi, it should apply the function f ( h i ) and then

save hi for future use (notice the use of hi in Steps 3 and 4 of Algorithm 1). Similarly, when a WP receives an activation value ai it multiplies it by the corresponding weight and then saves it because it will be used in the calculation of A w in the BP part (see Step 5 of Algorithm 1). The BP part is performed in a very similar manner to the forward propagation part going the other way around starting from the appendices CAPiIRAPi containing the computed values (ai[L] I 1 5 i 5 N L ) after the forward part and going backwards. Algorithm 3 shown in Figure 7 illustrates how to implement the learning phase on a MAT (or a hypercube as going to be proved later).

Lemma 2: Using Algorithm 3, the number of time steps required to train one pattern on a MAT is 5 4L(ZogN + 3) .

Proof: Consider the time required to execute the pro- cedure BP-LAYER. Recall that we are assuming that the computation of a function, a multiplication or an addition takes a single unit of time. Thereby, the steps a, b, c and d require at most logN + 1, 4, logN + 1 and 2 time steps respectively. The steps for Z is even are very similar to, and hence take the same amount of time as, steps a, b, c and d. Therefrom, the procedure BP-LAYER requires < 2ZogN + 8 time units. Using the result of Lemma 1 i d adding up the times for steps 1, 2 and 3 of Algorithm 3 we get,

Z e a r n i n g = Treeall + 3 + ( L - 1)(2l0gN + 8) I 2L(logN + 2) + 3 + ( L - 1)(2Z0gN + 8) < ( 4 L - 2)logN + 12L - 5 - 5 4L(logN + 3). U

111.3. Pipelining Patterns In this subsection we introduce an improvement to the

above algorithm by showing how to process more than one input pattern in parallel. This is done by pipelining up to 2logN + 2 input patterns. Our ability to pipeline patterns depends on the two assumptions: (1) links are full duplex, that is, links can carry data in both directions simultane- ously, (2) a processor is able to receive and/or send data on all its channels (incident links) simultaneously.

When we pipeline input patterns we have to be careful not to assign more than one computation to a single processor at any moment of time. What we will be doing actually is to exploit processors of ARTs/ACTs who are broadcasting activation values downwards but performing no computation (see Steps 1 in Figure 6 for example). We will be having another pattern being processed in the other direction. This, of course, increases processor utilization.

372

Page 6: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

Algorithm 3: (LEARNING PHASE} 1. (Compute (a,[l] 1 1 5 i 5 N L ) . )

for 1 4 to L do

endfor.

2. (Compute the deltas for the output layer.) for all 1 5 i 5 NL do

If L is odd then

d.e(L is even)

endlf

R€CALL-LAY€ql);

RAP/ computes 6,[L] = f'(h,[L])(t, - aa[L]).

CAPi computes 6,[L] = j ' (h , [L] ) ( ta - a , [L] ) .

endfor.

3. (Back propagation.) for I = L to 2 do

endfor.

procedure BP-LAYER(I) It 1 is odd then

parbegln

BP-LAY€R(I);

a. for all 1 5 j 5 Ni-l do

RAP, passes 6,[I] downward through the ARTS so that, WP,, 1 5 i 5 Ni receives 63[1]

b. lor all 1 5 i 5 N I , 1 5 j 5 Ni-1 do parbegln

WP,i does the following: finds the product w,,[l]6,[I] finds Aw,,[I] = w6;[I]al[I - 11 (notice that a ; [ I - 11 is saved in WP,i during the forward propagation phase) updates wja by w ~ ~ " " [ I ] = ..j"i'd[l] + Awli

parend

parbegin C. for all 1 5 i 5 Nl do

ACTi sums the product values of its leaves and the result w j i [1]6, [I] is sent to CAPj

3 parend

parbegin d. for all 1 5 i 5 Ni do

CAP, computes

6 j [ I - 11 = f ' ( h i [ l - I]) x Cwji[l]6j[[] ( 2 1

{notice that h;[l - 1) is saved in CAPi during the forward propagation phase)

parend

else ( I is even) Same as steps a, b, c, and d but replace CAP,

ACT, and WP# respectively RAP, ACT, ART, and WPjj by RAP, CAP, ART,

endlf endprocedure.

Figure 7: Algorithm for the learning phase of an L layers FF ANN.

Consider the following scenario. CAPi passes uf[O] downward to the root of ACTi starting a recall phase computation. In the second time step, ut[O] will move further down in ACTi. At that moment CAPj sends u:[O] of the second pattern to the root of ACTi starting the recall

phase of the second pattern. In the following time step, we will have uB[O] at the root (level 1). u:[O] at level 2 and at[O] at level 3 of ACTi. The process continues in this manner. However, there is a limit to the number of patterns that can be concurrently placed in the pipeline. This limit is governed by the constraint that no processor should ever be doing more than one computation at a time. Such a situation occurs when af[l] reaches WPf coming down ARTi (while performing the computations of layer 2) at the same time when a; [O] reaches WPf while going down ACIj (while performing the computation of the first layer for pattern p ) . In the very next time step WP, is required to perform two computations. The first is to multiply u;[l] by ~ j i [ 2 ] and the second is to multiply uT[O] by wij[l]. One can check that p = 2logN + 3 . Therefore, the maximum number of patterns that can be pipelined is 2logN + 2.

IV. Implementation on the hypercube

So far, we have seen ANN implementation on MATS. However, there exists no actually built system with a MAT architecture. Nevertheless, we will show in this section how the above technique can be carried out on the hy- percube, a topology with which many existing massively parallel machines have been built. When we try to imple- ment the above mapping scheme on a hypercube machine several questions arise; Can we embed a MAT into a hy- percube structure? How can we choose the N 2 WPs such that each weight node WPf is a leaf of the irh ART and the J" ACT? If it is possible to embed a MAT into the hypercube, what is the minimum size hypercube required to embed an N x N MAT?

The first two questions are answered by Theorem 1 and its proof. The last question is answered in Lemma 3. The proof of Theorem 1 is an inductive proof that defines a recursive procedure to embed a MAT into a hypercube. The detailed proof is lengthy and requires introducing a lot of new notations and definitions. Therefore, and due to lack of space, only a very brief summary of the proof is given in this paper.

Theorem 1: An N x N MAT can be embedded into a hypercube of 4N2 nodes.

Sketch of proof: Let N = 2". The proof is by induction on n. For n = 2. It can be shown that a 4x4 MAT is embedded into a 6-HC (a 6 dimensional hypercube of 64 nodes). We suppose that the 2" x 2" MAT is embedded into the (271 + 2)-HC. The induction step consists of two parts. In the first part, the 2" x 2" MAT is duplicated and an automorphic transformation is applied on the nodes of

373

Page 7: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

the new (2n + 2)-HC containing the second MAT copy in order to orient the MAT structure in such a way that by deleting some edges from the two MATS and adding other edges from the two hypercubes a 2"+l x 2" MAT results. The second part is similar to the first however, it extends the resulting 2"+l x 2" MAT along the columns rather than the rows in order to obtain a 2"+' x 2"+l MAT. U

Lemma 3: The embedding of Theorem 1 is optimal, that is, the smallest size hypercube to embody an N x N MAT has 4N2 nodes.

Pmok Because N is the number of leaves of a complete binary tree (ACT or ART), N is a power of 2. Let N = 2". The total number of nodes in an N x N MAT is 3N2. Therefore, the minimum size of a hypercube to accommodate the MAT is 2 [ log 3 ~ 7 = 2rlog 3 zanl = 22n+2 = 4 22" = 4N2, where r.1 represents the smallest integer 2 2. 0

In order to execute Algorithm 2 on the hypercube, we need first to use the recursive mapping procedure explained in the proof of Theorem1 to associate each MAT processor with a hypercube node. This can be done in a static manner at the very beginning and is known a priori for a certain system configuration. Thus, its cost should not be considered as a part of the cost of the ANN computation.

V. Special purpose ANN computer An altemative approach is to utilize the above tech-

nique as a sketch for constructing a special purpose paral- lel machine for fast ANN computation. This section con- siders the architecture of such a parallel computer which is composed of three types of processing nodes; multipliers, adders and activation function computers.

Each node of the MAT architecture is a whole proces- sor. However, one can notice that weight processors are only used for multiplication, tree processors are merely used as adders, and CAPS and RAPS are used but for ac- tivation function computation (check algorithms 3 and 4). Therefore, for a special purpose machine, a great deal of hardware can be saved by replacing each WP by a mul- tiplier, each tree node by an adder and the CAPS and RAPS by function computers.

Another improvement which saves circuit area and increases resource utilization is to replace the two sets of trees (ACTS and ARTs) by one. This can be done because, at a given moment of time while processing a pattern, either ACTs or ARTs are being used but not both. Figure 8 shows a possible layout for a 4x4 MAT. This

multiplier ij

adder

activation function B 1 switch

Figure 8: A 4 x 4 mat with merged ACTs and ARTs.

layout was obtained by folding the structure of Figure 5(b) along the main diagonal of the array of weight processors. Thereby, the multiplier corresponding to WPq is placed next to that corresponding to WPji. Switches are used to choose whether the trees are being used as ACTS or ARTs. This is done by selecting the leaves of the z* tree to be the i* column or the i* row of the grid of multipliers.

Finally, we should mention that by merging the ACTs and ARTs into one tree we reduce the maximum number of patterns that can be pipelined. Actually this number is cut by half. Therefore, the utmost number of patterns that can be placed together in the pipeline is IogN + 1.

VI. Evaluation and comparison with previous work

In this section we compare our mapping technique with three major techniques in the literature. The first of these was introduced by Kung88, Hwan891. It maps A " s on the systolic ring architecture. Extensive work in the lit- erature depends on this technique. The second technique [Sham901 is an extension of Kung's method in order to make it possible to pipeline M input patterns. The target machine for this mapping scheme is a two-dimensional SIMD processor array. The third technique bin911 im- plements ANNs on fine grain mesh-connected SIMD ma- chines. The mapping is established based on a set of congestion-free routing procedures. We compare OUT tech- nique with these techniques in terms of number of pro- cessors used, time for precessing one pattern, the power of pipelining patterns, and finally the time to process k input patterns in a pipelined manner. The comparison is furnished in the table of Figure 9.

374

Page 8: [IEEE Comput. Soc. Press 1993 5th IEEE Symposium on Parallel and Distributed Processing - Dallas, TX, USA (1-4 Dec. 1993)] Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed

Mapping Archi- no. of time =hem tccturc pmcu- for one

son patan

pipe time for k lining? Pattans

(pipel id)

Figure 9: Comparison with other mapping schemes.

~ u n g 8 8 1 systolic ring

[Sham901 SIMD 2D mah

VII. Conclusions

O(N) O(N) NO O(W

O(MN) O(N) YES O ( W / M )

This paper presents a high-performance scheme for im- plementing both the recall and the training phases of the FFBP ANN model. This high performance is substanti- ated by a time complexity of O(logN), where N is the number of neurons in the largest layer, and the ability to pipeline the processing of up to 2logN + 2 input patterns.

The technique of this paper is used as a mapping technique to implement ANNs on a general purpose hypercube-based massively parallel systems. In addition, it is employed to construct a special purpose parallel hard- ware for fast ANN applications. The special purpose property is utilized to reduce the hardware and increase the processor utilization.

References

[BecWO] Beck, J., “The Ring Array Processor (RAP): Hard- ware,” International Computer Science Institute, Berkeley, CA, 1990.

[Brow871 Brown, J. R. and S . F. Vanable, “Artificial Neural Network on a SIMD Architecture,” Proceedings of the 2nd Sym- posium on the Frontiers of Massively Parallel Computation, pp. 127-136, 1987.

[Chu92] Chu L. C., B. W. Wah, “Optimal Mapping of Neural- Network Leaming on Message-Passing Multicomputers,” Jour- nal of Parallel and Distributed Computing 14, pp 319-339, 1992.

[Dura891 Duranton, M. and J. A. Sirat, “Learning on VLSI: A general Purpose Digital Neurochip,” International Conjerence on Neural Networks, Washington, DC, 1989.

[Ghos89] Ghosh, J. and K. Hwang, “Mapping Neural Networks onto Message-Passing Multicomputers,” Journal of Parallel and Distributed Computing, 6, pp. 291-330, 1989.

[Hamm901 Hammerstorm, D., “A V U 1 Architecture for High- Performance, Low-Cost, On-Chip Learning,” International Joint CoMerence on Neural Networks, Vol. 2, pp. 537-543, 1990.

mert911 J. Hertz, A. Krogh and R. G. Palmer, Introduction to the Theory of Neural Computation, Addison Wesley, 1991.

mira901 Hiraiwa, A., et al., “A Two Level Pipeline RISC Pro- cessor Array for ANN,” International Joint Conference on Neu- ral Networks, pp. 137-140, 1990.

[Hwan89] Hwang, J. N. and Kung, S . Y., “Parallel Algo- rithmdArchitectures for Neural Networks,” Journal of V U 1 Sig- nal Processing, 1989.

[Kim891 Kim, K. and Kumar K. P., “Efficient Implementation of Neural Networks on Hypercube SIMD Arrays,” International Joint Conference on Neural Networks, 1989.

[Kung88] Kung, S . Y., “Parallel Architectures for Artificial Neu- ral Nets,’’ International Conference on Systolic Arrays, pp. 163- 174, 1988.

[Lin91] Lin W., V. K. Prasanna and K. W. Przytula, “Algorith- mic Mapping of Neural Network Models onto Parallel SIMD Machines,” IEEE Transactions on Computers, 1991.

[Mad1911 Madraswala, T. H., et. al., “A Reconfigurable ANN Architecture,” International Symposium on Ciruits and Systems, 1991.

[Morg92] Morgan N., et. al., “The Ring Array Processor: A Multiprocessing Peripheral for Connectionist Applications,” Journal of Parallel and Distributed Computing 14, pp 248-259, 1992.

[Rama92] Ramacher Ulrich, “SYNAPSE-A Neurocomputer That Synthesizes Neural Algorithms on a Parallel Systolic Engine,” Journal of Parallel and Distributed Computing 14, pp 306-3 18, 1992.

[Sham901 Shams S . and W. Przutula, “Mapping of Neural Net- works onto Programmable Parallel Machines,” IEEE Intema- tional Symposium on Circuits and Systems, New Orleans, LA, May 1990.

[Tomb881 Tomboulian, Shenyl, “Overview and Extensions of a System for Routing Directed Graphs on SIMD Architectures,” Frontiers of Massively Parallel Processing, 1988.

[wat190] Wah B. W. and L. Chu, “Efficient Mapping of Neural Networks on Multicomputers,” Int. CO@ on Parallel Process- ing, Pennsylvania State Univ. Press, Vol. I, pp. 234-241, 1990.

[Zhan89] Zhang, X., et. al., “An Efficient Implementation of the Back Propagation Algorithm on the Connection Machine CM-2,” Neural Informa&ion Processing Systems 2, pp. 801-809, 1989.

375