12
An Application-Specific Array Architecture for Feedforward with Backpropagation ANNs Q.M. Malluhi, M.A. Bayoumi, T.R.N. Rao Center for Advanced Computer Studies University of Southwestem Louisiana, Lafayette, LA 70504 Abstract An application-specific array architecture for Artifiial Neural Networks (ANhb) computation is proposed. This array is configured as a mesh-of-appendixed-trees (MAT). Algorithms to implement both the recall and the training phases of the multilayer feedforward with backpropagation AMV model are developed on MAT. The proposed MAT architecture requires only O(1og N ) time, while other reported techniques offer O ( N ) time, where N is the size of the largest layer. Beside the high speed performance,pipelining of more than one input pattem can be achieved whichfurther improves the performance. L Introduction A basic ANN computation model has several characteristics that favor massively parallel dig- ital implementation. These include highly parallel operations, simple processing units (neurons), small local memory per neuron (distributed memory), robustness, and fault tolerance to con- nection or neuron malfunctioning [Ghos89]. Therefore, a highly parallel computing system of thousands of simple processing elements is a typical target architecture for implementing ANNs. Several parallel digital special purpose array processors for neural networks have been proposed in the literature. Typical examples include the LNeuro chip [Dura89], the Ring Array hessor (RAP) @3eck90, Morg921, the CNAPS system [ H m 9 0 ] , the SYNAPS neurocomputer [Rama92], the GCN RISC processor array [Hira90], and the reconiigurable ANN chip [Madr91]. A number of mapping schemes have been reported to implement neural network algorithms on available parallel architectures. Examples of these mapping schemes can be found in prow87, Tomb88, Kung88, Hwan89, -89, Wah90, Lin91, (311921. In this paper, we propose an application-specific parallel architecture on which we map the multilayer feedforward with backpropagation learning ANN model (FFBP). This architecture is called the Mesh-of-Appendixed-Trees (MAT). Algorithms to implement both the recall and the learning phases of FFBP ANNs are provided. A major advantage of this technique is the high performance. Unlike almost all the other techniques presented in the l i t e r a m which require O( N) time, where N is the size of the largest layer, the proposed mapping scheme takes only O(1ogN) time. Another important feature of this method is that it allows the pipelining of more than one input pattem and thus improves the performance further. Five different levels of parallelism can be identified for ANN computations; network level, training pattem level, layer level, neuron level and synapse level IJord921. For a particular 0-8186-3492-8193 $3.00 0 1993 IEEE 333

[IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

  • Upload
    trn

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

An Application-Specific Array Architecture for Feedforward with Backpropagation ANNs

Q.M. Malluhi, M.A. Bayoumi, T.R.N. Rao Center for Advanced Computer Studies

University of Southwestem Louisiana, Lafayette, LA 70504

Abstract An application-specific array architecture for Artifiial Neural Networks (ANhb) computation is proposed. This array is configured as a mesh-of-appendixed-trees (MAT). Algorithms to implement both the recall and the training phases of the multilayer feedforward with backpropagation AMV model are developed on MAT. The proposed MAT architecture requires only O(1og N ) time, while other reported techniques offer O ( N ) time, where N is the size of the largest layer. Beside the high speed performance, pipelining of more than one input pattem can be achieved which further improves the performance.

L Introduction A basic ANN computation model has several characteristics that favor massively parallel dig-

ital implementation. These include highly parallel operations, simple processing units (neurons), small local memory per neuron (distributed memory), robustness, and fault tolerance to con- nection or neuron malfunctioning [Ghos89]. Therefore, a highly parallel computing system of thousands of simple processing elements is a typical target architecture for implementing ANNs.

Several parallel digital special purpose array processors for neural networks have been proposed in the literature. Typical examples include the LNeuro chip [Dura89], the Ring Array h e s s o r (RAP) @3eck90, Morg921, the CNAPS system [ H m 9 0 ] , the SYNAPS neurocomputer [Rama92], the GCN RISC processor array [Hira90], and the reconiigurable ANN chip [Madr91]. A number of mapping schemes have been reported to implement neural network algorithms on available parallel architectures. Examples of these mapping schemes can be found in prow87, Tomb88, Kung88, Hwan89, -89, Wah90, Lin91, (311921.

In this paper, we propose an application-specific parallel architecture on which we map the multilayer feedforward with backpropagation learning ANN model (FFBP). This architecture is called the Mesh-of-Appendixed-Trees (MAT). Algorithms to implement both the recall and the learning phases of FFBP ANNs are provided.

A major advantage of this technique is the high performance. Unlike almost all the other techniques presented in the l i teram which require O( N ) time, where N is the size of the largest layer, the proposed mapping scheme takes only O(1ogN) time. Another important feature of this method is that it allows the pipelining of more than one input pattem and thus improves the performance further.

Five different levels of parallelism can be identified for ANN computations; network level, training pattem level, layer level, neuron level and synapse level IJord921. For a particular

0-8186-3492-8193 $3.00 0 1993 IEEE 333

Page 2: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

334 International Conference on Application-Specific Array Processors

nevmnj - - uvmni

Figure 1: A basic ANN topology.

implementation, the used level (or levels) of parallelism depends on the constraints imposed by both the ANN model and the computing environment. The network level is the highest level, coarsest grain parallelism in which each processor processes a different network with different parameters. On the other end, the synapse level is the finest grain parallelism in which each synapse operation is mapped into a distinct processor. Such a fine-grain parallelism is suitable for systems with very large number of simple processors. In other words, synapse level parallelism is suitable for massively parallel processors. Examples of synapse level implementation can be found in Wose87, Krik90, Chin%, Dura891.

The ANN implementation method discussed in this paper uses the training pattem, neuron and synapse level parallelism. It maximizes parallelism by unfolding the ANN computation to its smallest computational primitives and processing these primitives in parallel. Pipelining techniques are used to add training pattem parallelism.

This paper is organized as follows. Section I1 briefly describes a general ANN model of computation with some emphasis on the FFBP model. It also provides some terminology to be used throughout the paper. Section III describes the MAT structure and explains the mapping of both the recall and the learning phases of a FFBP ANN on the MAT architecture. In addition, Section III discusses the issue of pipelining multiple pattems. In section IV, a modified MAT is used as a special purpose fast ANN computer. Section V compares our technique with other techniques proposed in the literature. Finally, Section VI draws the conclusions.

II. ANN model of computation

A basic ANN model of computation consists of a large number of neurons connected to each other by connection weights (see Figure 1). Each neuron, say neuron i, has an activation value a;. Associated with each connection from neuron j to neuron i, is a synaptic weight (or simply, a weight) wjj. The ANN computation can be divided into two phases: recall phase and learning phase. The recall phase updates the outputs (activation values) of neurons based on the system dynamics to produce the derived ANN output as a response for an input (test pattem). The leaming phase performs an iterative updating of the synaptic weights based upon the adopted learning algorithm. The weights are updated in a way that minimizes an error function measuring how good the ANN output is. In other words, the leaming phase teaches the ANN to produce the desired outputs. In some ANN models, the weight values are predetermined, therefore, no learning phase is required.

In this paper we will deal with feedfoward neural networks with backpropagation leaming which is central to much of the work going on in the field nowadays. The following two subsections describe the multilayer feed forward A " s and the error backpropagation learning algorithm.

Page 3: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

Session 8: Applications 111 Neural Networks and Sorting 335

ai01 go1 401 +$to1

Figure 2: An L layers feedforward network.

11.1. Multilayer feedforward network

A Multilayer (L layers) feedforward (FF) network has the general fonn shown in Figure 2. There is a set of input terminals whose only role is to feed input pattems into the rest of the network. After this, there is a zero or more intermediate layers followed by a final layer where the result of computation is read off. The intermediate layers are called the hidden layers and the L“ layer (final layer) is referred to as the output layer. The network interconnection is such that each node (neuron) in a layer receives input from every node of the previous layer. ‘zhis interconnection topology implies that for every layer, say layer I, there is a weight matrix W[D of the synaptic weights for the links between layers I and I-1.

We will use the index between brackets to indicate the layer number and a superscript to denote the test pattem number. For example m;j[Z] represents the element in the Ph row and the f h column of the weight matrix W[D of the Zth layer and af[Z] represents the activation value for the Ph neuron in the F“ layer for the p“ input pattem IP. We will use Ni to &note the number of neurons in layer i. For notation convenience, the input terminals will be considered as layer number 0, thus, NO will represent the number of input terminals.

For input pattem IP = (I:, g, . . . , I&o), system dynamics for the recall phase is given by,

Each neuron i computes the weighted sum hi of its inputs and then applies a nonlinear function f (h i ) producing an activation value (output) a; for this neuron. ?he fundonf is usually a sigmoid function given by f(z) = 1/(1+ e-@”).

11.2. Backpropagation learning For each input P, there is a target (desired) output P. The backpropagation (BP) learning

algorithm [Rume86] gives a prescription for changing the synaptic weights in any feedforward network to leam a training set of input-target pairs. This type of learning is usually referred to as “supervised leamhg” or “learning by teacher”. The learning phase involves two steps. In the first step, the input is presented at the input terminals and is processed by the ANN according to the recall phase equations. In the second step, the produced output is “pa red to the target and an error measurement value is propagated backward (from the output layer to the first layer)

Page 4: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

336 International Conference on Application-Specific Array pr0ceeeOrS

Algorithm 1:

1.

2.

3. 4.

5.

6.

Apply input, a[O] = I . Propagate the signal forward to compute the output a[L] using

a.[q = f Cw,,[qa,[t- 11 ,for I = 1 ,2 ,..., L.

Compute the deltas for the output layer, 6,[L] = f’(h,[Z])(t, -a&]).

Compute the deltas for the preceding layers by propagating the m r backwards, 6 , [ 1 - 1 ] = f ’ ( h , [ I - 1 ] ) ~ w , , [ ~ 6 , [ ~ , f o r I = ~ , L - l I ..., 2.

Adjust all weights according to

(I 1 J

[I] = 716, Ma, [ I - 11 wy”[q = W G d [ q + Aw,j[q. Repeat from step 1 for the next input pattem.

Figure 3: Backpropagation learning algoriihm.

and appropriate changes of weights are made. The second step proceeds along the following iterative equations:

6351 = f’(hp[L])(lp - 474) 6f[l - 11 = f‘(hp[I - I])

j

Awij[I] = @‘[I]ap[Z - 11 w:”[I] = wf”j’[I] + Aw;j[l]

wji[l]6jP[Z]

for I = L , L - 1, ..., 2. The BP algorithm is summarized in Figure 3.

JIL Mapping feedforward with backpropagation ANN model onto the MAT a m h i m

The proposed Mesh-of-Appendixed-Trees (MAT) architectme is shown in Figure 4. The topology of the architecture is a variation of the ordinary mesh-of-trees topology Leig921. The NxMMAT,whereN =2”andM =2m,iscons~fromagridofNxMprocessors. Each row of this grid constitutes a set of leaves for a complete binary tree called ART (Appendixed Row Tree). Similarly, each column of this grid forms a set of leaves for a complete binary tree called ACT (Appendixed Column The). For each AKUACI’, there is an appendix RAPICAP connected to the root of the tree. Figure 4(a) demonstrates the general structure of a MAT. Figure 4(b) shows a 4x 4 MAT. Each of the N ARTS has Wnodes and each of the M ACTS has 2N nodes. Summing up and subtracting the grid s ize , since the grid processors are members of both ARTS and ACTS, we get the total number of nodes in a MAT = N ( 2 M ) + M ( 2 N ) - N M = 3 N M . out objective in this section is to map the multilayer feedforward with backpropagation

leaming (FFBP) neural network model on what we call the MAT architecture. ’Ihe mapping procedure is developed in stages. We start the discussion by showing how to map the recall phase of a single layer on a MAT. After that, doses of complexity are gradually added by discussing more complex situations and by showing how to enhance the performance of the given mapping algorithm through processing more than one pattem in a pipelined fashion.

Subsection III.1 presents a method to map the processing of the recall phase on the MAT. Subsection III.2 describes the implementation of the BP leaming phase. Finally, an enhanced version of the mapping procedure that allows parallel (pipelined) processing of more than one input pattem is presented in Subsection III.3.

Page 5: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

Session 8: Applications IIk Neural Networks and Sorting 337

Figure 4: (a) General MAT structure, (b) 4x4 MAT.

111.1. Recall phase As a first step, consider the recall phase and concentrate only on the operations of one layer, say

layer 1. That is, we only consider the computation of layer 1 activation values (a$] I 1 5 i 5 N I ) , from the activation values (a$ - 11 I 1 5 i 5 of the preceding layer. Equation (1) hints that the operations involved in the computation are as follows:

1. Distribute aj[Z - 11 to all the elements of column j in the weight matrix W[Z]. This is done for all 1 5 j 5 Ni-1.

2. Multiply aj[Z - 11 and w;j for all 1 5 i 5 Ni and 1 5 j 5 Ni-1. 3. Sum the results of multiplications of step 2 along each row of W[Z] to compute the weighted

4. Apply the activation function f(hj[Z]) for all 1 5 i 5 N I .

To start with, we suppose that each weight w;,[Z] is stored in a distinct processor WPY (Weight Processor). In addition, we assume that each of the activation values aj[Z - 11 of layer Z-1 is stored in processor CAP, (Column Appendix Processor). An output activation value of layer I , a$], will be produced in processor RAPi (Row Appendix Processor).

The aforementioned four steps constitute the skeleton of our implementation. As we will see, step 1 will be perfomed in O(Zog N I ) time units using a tree like structure. All the multiplications of step 2 can be done in parallel in the various weight processors. Therefore, step 2 will only take one unit of time. The summation of step 3 will be computed in O(Zog Nl-1) time units by a tree like fashion. Finally, the functionfis applied at once on all the hi values in the different row appendix processors.

Figure 5 illustrates the algorithm to map the operations of layer Z of a FF ANN. For this purpose we use an NI x Nldl MAT. We suppose that matrix W[Z] is initially entered into the grid processors (weight processors) so that processor WPij stores w;j[I] in its local memory.

sums (hj[Z] I 15 i 5 NI).

Page 6: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

338 Intemational Conference on Application-Specific Array Processors

Moreover, the activation values of layer 1-1 are assumed to be placed into the CAPs so that aj[l- 11 is kept in the local memory of processor CAP) Step 1 of Algorithm 2 (see Figure 5) takes log NI + 1 time steps because the depth of the ACTS is Zog NI + 1. Likewise, step 3 takes Zog NI-1 + 1. We assume that computing the product in step 2 takes a single unit of time. In addition, we assume that the computation of the function f in step 4 requires one time unit only. AS a result, the total time for the computation of layer I ,

Thus far, we have seen the implementation of one layer. Now we will generalize the above discussion for a multilayer network. We start at layer 1 and perform the operations of Algorithm 2 to compute ( a ; [ l ] I 1 5 i 5 N l ) from (a;[O] I 1 5 i 5 NO). The RAPs will contain the result of computation. In order to continue the processing for the second layer, we need to place (a$] I 15 i 5 NI) into the CAPs. This takes log No + log NI + 2 time steps. This time can be saved by storing the transpose of W [ 2 ] rather than W [ 2 ] itself into the grid of WPs and repeating the operations of Algorithm 2 backwards starting from the RAPs and getting the results (a; [2] 11 5 i 5 N z ) in the CAPs. For the third layer, W [ 3 ] is stored in the WPs and we start at the CAPs and get the resultant activation values at the RAPs as we did for layer 1. We continue this way going back and forth from CAPs to RAPs and from RAPs to CAPs until we reach the output layer L.

Thereof, If N is the size of the largest layer of the ANN, Ne and Nd are the sizes of largest even and odd layers respectively, we use an N,xNd MAT. We initialize the local memories of the WPs by storing W [ 1 ] , w ' [ 2 ] , W [ 3 ] , W * [ 4 ] , ... in the upper left comer of the WPs grid. In other words, if 1 is odd, w;j[Z] is stored into the local memory of WPu, otherwise, wij[Z] is kept into the local memory of WPJ. 'Ihe CAPs local memories are initialized to have the ANN input values (Ii I 1 5 i 5 NO). After the initialization is complete, we proceed according to Algorithm 2 shown in Figure 5.

Taking L as a constant, Lemma 1 below shows that we can process the recall phase of a FF ANN in a time complexity which is logarithmic in terms of the size N of the largest layer. This is achieved at the expense of using a maximum of 3 N Z prwessing units corresponding to an N x N MAT.

= log NI + log N I - ~ + 4.

Lemma 1: Algorithm 2 takes less than or equal %(log N + 2),time steps.

Proof: Let T R ~ ~ ~ I ~ be the total time for computing the output. We have,

L L T R ~ ~ ~ I I = = (log Ne + log Nd + 4 )

I=1 I=1

5 2L log N + 4 L = 2L(Zog N . + 2 ) . 0

Hitherto, the reader might have wondered, why do not we used the ordinary mesh-of-trees architecture instead of the MAT architecture? This point will be elucidated when we look into the pipelining issues in Section III.4. Having used the ordinary mesh-of-trees, the root of each tree (row or column tree) would have been responsible for both, adding the values it receives from its children and then applying the activation functionfon the result of add operation. This creates a bottleneck in the pipeline and disrupts its smooth flow of computation.

111.2. Learning phase Initially, the target values (ti I 1 5 i 5 N L ) are assumed to be stored in local memories of

CAPs or RAPs depending on wheter L is even or odd. We note from Algorithm 1 that the learning phase is composed of two parts: forward propagation part and back propagation part.

Page 7: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

Seeeion 8: Applications III: Neural Networka and Sorting 339

A/@~fi?hm 2: {RECALL PHASE)

for 1 -1 to L do

endfor. RECALL-LAYEF1( I);

pr~cedure RECALL-LAYER(1) It 1 is odd then

1. for all 15 j 5 N L ~ do parbegin

CAP, passes aJ[l - 11 downward through ACT, so that, WP, 1 6 i 5 Nl receives aI [ I - 11

parend

for all 1 5 i 5 N1, 1 5 j 5 NM do 2. parbegin

WP, finds the product w,,[qa,[I - I] parend

for a11 1 5 i 5 Nl do 3. plvbeoln

ART, is used to sum the product values of its leaves and the result h.[q is sent to RAP!

parend

for dl 1 5 i 5 Nl do 4. ParbeOln

RAP, applies the function f(h.[q) parend

else { I is even} Same as the steps when I is odd but replace CAP, RAP, ACT, ART, and WP, by RAP, CAP, ART, ACT, and WP, respectively

endlf endprocedure.

Figure 5: Algorithm for the recall phase of an L layers FF ANN.

The forward part is identical to the recall phase whose implementation is provided in Algorithm 2. However, some values in Algorithm 2 have to be given a special care because they will be needed later during the BP part. When an appendix CAPi/RAPi receives a weighted sum hi from the root of its tree ARTi/Afli, it should apply the function f ( h ; ) and then save hi for future use (notice the use of hi in Steps 3 and 4 of Algorithm 1). Similarly, when a WP receives an activation value ai it multiplies it by the corresponding weight and then saves it because it will be used in the calculation of Aw in the BP part (see Step 5 of Algorithm 1). The BP part is performed in a very similar manner to the forward propagation part going the other way around starting from the appendices C A P ~ / W J mntaining the computed values (ai[L] I 1 5 i I N L ) after the forward part and going backwards. Algorithm 3 shown in Figure 6 illustrates how to implement the leaming phase on a MAT.

Lemma 2: Using Algorithm 3, the number of time steps required to train one pattem on a MAT is 5 4L(logN + 3).

proof: Consider the time r e q k d to execute the procedure B P - W E R . Recall that we are assuming that a computation of function f , a multiplication or an addition takes a single unit of

Page 8: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

340 International Conference on Application-specific Array F”cessors

time. Thereby, the steps a, b, c and d require at most ZogN + 1 , 4 , ZogN 4- 1 and 2 time steps respectively. The steps when Z is even are very similar to, and hence take the same amount of time as, steps a, b, c and d. Therefrom, the procedure BP-LAYER requires 5 210gN + 8 time units. Using the result of Lemma 1 and adding up the times for steps 1 , 2 and 3 of Algorithm 3 we get,

Zeorning = %call + 3 + ( L - 1)(2ZogN + 8 )

5 (4L - 2)ZogN + 12L - 5 5 4L(logN + 3) . 0

- < 2L(l0gN + 2) + 3 + ( L - 1)(2logN + 8 )

111.3. Pipelining patterns In this subsection, we introduce an improvement to the above algorithm by showing how to

process more than one input pattem in parallel. This is done by pipelining up to 2ZogN + 2 input patterns. Our ability to pipeline patterns depends on the following two assumptions: (1) links are full duplex, that is, links can cany data in both directions simultaneously, (2) a processor is able to receive andor send data on all its channels (incident links) simultaneously.

When we pipeline input patterns we have to be careful not to assign more than one computation to a single processor during any moment of time. What we will be doing actually is to exploit processors of ARTdACTs which are broadcasting activation values downwards but performing no computation (see Steps 1 in Figure 5 for example). We will be having another pattem being processed in the other direaion. This, of course, increases processor utilization.

Consider the following scenario. CAPi passes a:[O] downward to the root of ACTj starting a recall phase computation. In the second time step, a;’[O] will move further down in ACTj. At that moment CAPj sends a?[O] of the second pattem to the root of A m i Starting the recall phase of the second pattem. In the following time step, we will have a:[O] at the root (level l), af[O] at level 2 and a:[O] at level 3 of ACTj. The process continues in this manner. However, there is a limit to the number of patterns that can be concurrently placed in the pipelie. This limit is govemed by the constraint that no processor should ever be doing more than one computation at a time. Such a situation occurs when at [ 11 reaches WPu coming down M i (while performing the computations of layer 2) at the same time when a;[O] reaches WPu while going down ACT, (while performing the computation of the first layer for pattem p). In the very next time step Wf‘v is required to perform two computations. The first is to multiply af[I] by wj;[2] and the second is to multiply a;[O] by w;j[ l ] . One can check that p = 2ZogN + 3. Therefore, the maximum number of patterns that can be pipelined is 2ZogN + 2.

IV. Special purpose ANN computer This section discusses the architecture of a special purpose processor array for ANN com-

putations. This array is composed of three types of processing nodes; multipliers, adders and activation function computers. Each node of the MAT a r c h i m is a whole processor. How- ever, one can nopce that in Algorithm 2, weight processors are only used for multiplication, tree processors are merely used as adders, and CAPS and RAPS are only used for activation function computation. Therefore, for a special purpose machine, a great deal of hardware can be saved by replacing each WP by a multiplier, each tree node by an adder and the CAPS and RAPS by function computers.

Another impmvement which saves circuit area and increases resource utilization is to rephce the two sets of trees (ACTS and A m ) by one. ?his can be done because, at a given moment

Page 9: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

Session 8: Applications 111: Neural Networks and Sorting

Algorithm 3: {LEARNING PHASE) 1. {Propagate the signal forward to compute (a,[?] I 1 5 I 5 NL).}

for I=1 to L do

endfor.

2. {Compute the deltas for the output layer.} for all 1 5 I 5 N L do

if L is odd then

els$L is even}

endif

RECALL-LAYER(?);

RAP! computes &[I,] = f’(h,[L])(t. - a,[L]).

CAP, computes 6,[L] = f’(h,[L])(t, - a,[L]).

endfor.

3. {Back propagation.} for I = L to 2 do

endfor.

procedure BP-LAY€R(I)

BP-LAY€R(I);

if ? is odd then for all 1 5 3 5 NI-, do parbegln

RAP, passes &[?I downward through the ARTS so that, WP, 1 5 I 5 Nj receives &[?I

parend

for all 1 5 I 5 Nj, 1 5 3 5 N L ~ do parbegin

WP, does the following: finds the product w,.[I]6,[lI finds Aw,,[?I = q6,[?Ia,[? - 11 {notice that a.[[ - 11 is saved in Wpj during the forward propagation phase} updates w,, by ~ ; : ~ [ l ] = w ; ! ~ [ ? I + Aw,.

parend

for all 1 5 I 5 Nj do parbegin

ACTj is used to sum the product values of its leaves and the result wJ,[?16,[q is sent to CAP1

J parend

for all 1 5 I 5 N, do parbegln

CAP1 computes 6,[1- 11 = f’(h,[I - 4) x CwJ,[?I6,[l] . {notice

that h.[? - 11 is saved in CAP, during the forward propagation phase} ( J 1

parand

else { I is even} Same as steps a, b, c, and d but replace CAP, RAP, ACT, ART, and WP, by RAP, CAP, ART, ACT, and WP, respectively

endif endprocedure.

Figure 6: Algorithm for the learning phase of an L layers FF ANN.

341

Page 10: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

342 International Conference on Application-Specific Array Processors

multiplier jj

adder

activationfunction

switch

Figure 7: A 4 x 4 mat with merged ACTs and ARTs.

of time while processing a pattem, either ACTs or ARTs are being used but not both. Figure 7 shows a possible layout for a 4 x 4 MAT. This layout was obtained by folding the structure of Figure 4(b) along the main diagonal of the array of weight processors. Thereby, the multiplier corresponding to WPg is placed next to that corresponding to WPJ. Switches are used to choose whether the trees are being used as ACTs of ARTs. This is done by selecting the leaves or the z* tree to be the z* column of the I* row of the grid of multipliers.

The architecture of Figure 7 is appropriate for executing the recall phase (Algorithm 2). However, if this special purpose computer is to support learning (Algorithm 3), more complex processors are required. In Algorithm 3, RAPS and CAPS are responsible for performing multiplication, subtraction, as well as activation function computation (see steps 2 and d). In addition, WPs need to update the weight values (see step b). Therefore, for a learning machine, Figure 7 should be modified by adding multipliers and subtracters to appendices and adders to WPS.

Finally, we should mention that by merging the A m s and ARTS into one tree we reduce the maximum number of pattems that can be pipelined. Actually this number is cut by half. Therefore, the utmost number of pattems that can be placed together in the pipeline is ZogN + 1.

V. Evaluation and comparison with previous work

In this section we compare our implemntation technique with three major techniques in the literature. The first of these was introduced by Kung88, Hwan891. It maps ANNs on the systolic ring architecture. Extensive work in the literature depends on this technique. The second technique [Sham901 is an extension of Kung’s method in order to make it possible to pipeline M input patterns. The target machhe for this mapping scheme is a two-dimensional SIMD processor array. The third technique [Li l ] implements A ” s on h e grain mesh-connected SIMD machines. The mapping is established based on a set of congestion-free routing procedures. We compare our technique with these techniques in terms of number of processors used, time for precessing one pattem, the power of pipelining patterns, and finally the time to process k input pattems in a pipelined manner. The comparison is furnished in the table of Figure 8.

Page 11: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

Session 8: Applications IIL Neural Networks and Sorting 343

Figure 8: Comparison with other mapping schemes.

W. Conclusions This paper presents a high-performance scheme for implementing the FFBP ANN model on

a mesh-of-appendixed-ts based array processor. This high performance is substantiated by a time complexity of O(ZogN), where N is the number of neurons in the largest layer, and the ability to pipeline the processing of up to 2logN + 2 input pattems. We should note here that even though the technique developed in the paper is presented for the FFBP model, it can be easily adapted to implement several other ANN models for example, the Hopfield model [Hope821 and the B o l t ” machine [Hert91].

The technique of this paper is used to construct an application-specific architecture for ANN computations. The application-specific property is utilized to reduce the hardware and increase the processor utilization. In addition, even though not discussed here, this technique can be utilized to map ANNs on hypercube general purpose massively parallel machines.

References M I Beck J., ‘The Ring Array Processor (RAP): Hardware,” International Computer Science Institute, Berkeley, CA, 1990.

[Brow871 Brown, J. R. and S. F. Vanable, “Artificial N e d Network on a SIMD Architecture,” Proceedings of fhe 2nd Sympavim on thc Frontiers of Massively Pard121 Computation, pp. 127-136, 1987.

[Chin901 Chinn, G. et al., “Systolic Array Implementation of Neural Nets on the MasPar MF-1 Massively Parallel Proccswr,” Int. Conference on Neural Nefworks. San Diego, CA, Vol2, pp. 169-173, 1990.

[Chu92] Chu L. C., B. W. Wah, “Optimal Mapping of Ned-Network Leaming on Message-Passing Multicomput- crs:’Jountal ofParaUe1 and Distributed Computing 14, pp 319-339, 1992.

[Dura891 Dwanton, M. and J. A. Sirat, “Lcaming on MSI A general Purpose Digital Newwhip,” International conference on Neural Networks. Washington. DC, 1989.

[Ghos89] Ghosh, J. and K. Hwang, “Mapping N d Networks onto Message-Passing Multicomputers,” Journal of Parallel and Disfributed Computing, 6, pp. 291330.1989.

[Hamm901 Hammcrstom. D., “A VLSI Architcctun for High-Performance, Low-Cosf On-Chip Lcaming,” Inferno- tional Joint Conference on Neural Networks, Vol. 2, pp. 537-543, 1990.

[Hat911 J. Hertz, A. Krogh and R. G. palma, Introduction to the Theory of Neural Computafion, Addison Wesley, 1991.

[Hd] Hiraiwa, A., et al., “A ’Avo Lcvel Pipeline RISC Processor Array for ANN,” International Joinf Conference on Neural Network. pp. 137-140, 1990.

Page 12: [IEEE Comput. Soc. Press International Conference on Application Specific Array Processors (ASAP '93) - Venice, Italy (25-27 Oct. 1993)] Proceedings of International Conference on

344 International Conference on Application-Specific Array Pmceeeora

[Hop821 Hopfield, J. J., “Neural Networks and physical System with -at Collective Computational Abilities,” Proceedings of the National Academy of Sciences, USA 19, pp. 2554-2558, nprintcd in 1988.

Wwan891 Hwang, J. N. and Kung, S. Y., “wrsllcl Algorithms/Axchitcctwrithms/Architccturcs for N d Networks,” J o d of VLSI Signal Processing, 1989.

[Kim891 Kim, K. and Kumar K. P., “Efficient Implementation of Neural Nctwokr on Hypcrcubc SIMD Arrays.” Intentotional Joint Conference on Neural Netwaks, 1989.

IKrilr901 Krikclcs, A. and M. Grozingcr, ”Implementing Neural Nctwoh with Associative Shing Processor,” Int. Wakshop f a Artifiial Intelligence and Neural Networks, Oxfad, 1990.

[Kung881 Kung, S. Y., “Parallcl Architectures for Artificial Neural Nets,” International Conference on Systolic Arrays. pp. 163-174, 1988.

[big921 Leighton, F. T., Introductkm to Parallcl Algaithnu ond Archirrftracs: Arrays, Trees, Hypcrcubcs, Morgan Kaufmann. 1992.

[Lin911 Lin W., V. K. Rasanna and K. W. h y t u h , “Algorithmic Mapping of Neural Network Models onto Parallel SIMD Machincs,” IEEE Trmrwaions on Conrputcrs. 1991.

FInd1911 Madraswala, T. H.. et. al., “A Rcwnfigurable ANN Architcclurc,” Intcmati&Synposim on Cbruits and System. 1991.

[M0@21 Morgan N., et. al., ‘%e Ring Array Processor: A Multiprocessing Peripheral for Connectionist Applica- tions,” Journal of Parallel and DistriLwtcd Computing 14. pp 248-259, 1992.

IRnma921 Ramachcr Ulrich, “SYNAF’S&A Neurocomputer That Synthesizes Neural Algcfithms on a Parallel Systolic Engine,” J o m l of Parallcl and Dis&ibntcd Computing 14, pp 306-318.1992.

[Rosc87] Roscnbcrg, C. R., and G. Bclloch, “An Implementation of Network Leaming on thc Connection Machine:’ Proc. loth In?. conference on AI. Milan, Italy, pp. 32S340, 1987.

[Rume86] Rumelha D. E., G. E. Hiton and R. J. W W , ‘'Learning Representations by Back-Propagation,”Naturc 323, pp. 533336,1986.

[Sham901 Shams S. and W. Rzutula, “Mapping of Neural Networks onto Programmable Parallcl Machines,” IEEE Intemational Sympmiwn on Circuits and System, NCW Orleans, LA, May 1990.

[Tomb881 Tomboulian, Shcrryl, “Ovaview and Extensions of a Systcm for Routing Dkcted Graphs on SIMD Ar- chitectures:’ Frontiers of Massively Parallcl Proceming, 1988.

[wah90] Wah B. W. and L. Chu, “Efficient Mapping of Neural Networks on Multicomputers,” In?. Conf on Parallel Processing, Pennsylvania State Univ. Press, Vol. I, pp. 234-241, 1990.

[Zhan89] Zhang, X., e t al., “An Efficient Implementation of the Back Propagation Algorithm on the Connection Machine CM-2,” Neural I n f m d i o n Processing System 2 , pp. 801-809, 1989.