Compact yet efficient hardware implementation of artificial neural networks with customized topology

Expert Systems with Applications 39 (2012) 9191–9206

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Compact yet efficient hardware implementation of artificial neural networkswith customized topology

Nadia Nedjah a,⇑,1, Rodrigo Martins da Silva a,1, Luiza de Macedo Mourelle b,1

a Department of Electronics Engineering and Telecommunication, Faculty of Engineering, State University of Rio de Janeiro, Brazilb Department of Systems Engineering and Computation, Faculty of Engineering, State University of Rio de Janeiro, Brazil

a r t i c l e i n f o

Keywords:Artificial neural networksHardware for neural networksTopologySigmoid activation functionParallelismFPGA

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.02.085

⇑ Corresponding author.E-mail addresses: [email protected] (N. Nedjah), m

[email protected] (L. de Macedo Mourelle).1 The authors are grateful to FAPERJ and CNPq for th

a b s t r a c t

There are several neural network implementations using either software, hardware-based or a hardware/software co-design. This work proposes a hardware architecture to implement an artificial neural net-work (ANN), whose topology is the multilayer perceptron (MLP). In this paper, we explore the parallelismof neural networks and allow on-the-fly changes of the number of inputs, number of layers and number ofneurons per layer of the net. This reconfigurability characteristic permits that any application of ANNsmay be implemented using the proposed hardware. In order to reduce the processing time that is spentin arithmetic computation, a real number is represented using a fraction of integers. In this way, thearithmetic is limited to integer operations, performed by fast combinational circuits. A simple statemachine is required to control sums and products of fractions. Sigmoid is used as the activation functionin the proposed implementation. It is approximated by polynomials, whose underlying computationrequires only sums and products. A theorem is introduced and proven so as to cover the arithmetic strat-egy of the computation of the activation function. Thus, the arithmetic circuitry used to implement theneuron weighted sum is reused for computing the sigmoid. This resource sharing decreased drasticallythe total area of the system. After modeling and simulation for functionality validation, the proposedarchitecture synthesized using reconfigurable hardware. The results are promising.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

An artificial neural network (ANN) is an attractive tool for solv-ing problems related to pattern recognition (Sibai, Hosani, Naqbi,Dhanhani, & Shehhi, 2011), generalization Ahmadi, Mattausch,Abedin, Saeidi, and Koide (2011), prediction (Falavigna, 2012),function approximation, engineering (García-Fernández et al.,2004; Kashif, Saqib, & Zia, 2011), optimization and non-linear sys-tem behavior mapping (Nedjah, Silva, & Mourelle, 2012). Whendealing with an ANN implementation, systems based on hardwareare usually faster than software alternatives (Dias, Antunes, &Mota, 2004; Omondi & Rajapakse, 2008; Rojas, 1996; Zhu & Sutton,2003; Zurada, 1992).

When a particular task does not require so much speed, a soft-ware-based neural network system can be sufficient and satisfac-tory in running the task, through a PC or a general-purposeprocessor. ANN systems based on software do not demand muchdesign effort. On the other hand, ANNs provide an adequate re-search field for applying the parallel computation and, of course,

ll rights reserved.

[email protected] (R.M. da Silva),

eir financial support.

this parallelism can be best explored in a hardware-based imple-mentation (Chen, 2003; Lin & Lee, 2011; Omondi & Rajapakse,2008). So, a hardware architecture can be devised in order to usethe massive parallelism provided in the neuron-layer computation.Circuit components can be designed and adequately mapped to ex-ploit details from both the arithmetic computation and controlprocess.

The hardware designed, in this work, can be used by any neuralnetwork applications. It supports ANNs with different number oflayers, neurons per layer and inputs. This is one of the proposedhardware main features: flexibility through and on-the-fly recon-figurability. To perform a multilayer perceptron neural network(MLP), the hardware requires the following parameters:

(1) The number of inputs of the network. Let imax be thisnumber.

(2) The number of layers of the network. Let lmax be this number.(3) The number of neurons per layer. Let ni be this number,

where i represents the ith layer: i = 1,2, . . . , lmax.(4) If at least one neuron of a certain layer is going to operate

with a bias, parameter bias of such layer must be on.

A neural network usually has more than one layer. Neverthe-less, the proposed hardware provides only one single physical

http://dx.doi.org/10.1016/j.eswa.2012.02.085

mailto:[email protected]



http://dx.doi.org/10.1016/j.eswa.2012.02.085

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

Fig. 1. Binary representation of a fraction.

9192 N. Nedjah et al. / Expert Systems with Applications 39 (2012) 9191–9206

layer, a hardware layer, which performs the entire computation dueto all the layers of the ANN by reusing the neurons of this uniquephysical layer. In this context, the neural network layers are namedvirtual layers. Note that this is done without loss of performance asthe computation due to the ANN layers are data-dependent, andthus need to be executed sequentially. This strategy reduces thedesigned circuit area. Note that there exists an overhead due tothe required control to use a single physical layer. However, thetime spent is minimal and so has a very little impact on the overallperformance of ANN hardware. The time spent computing theweighted sum and activation function is far longer than that spentcontrolling the layers computation. Moreover, the neurons of thephysical layer operate in parallel to perform the required computa-tion. For instance, weighted sums are computed by all hardwareneurons at the same time, and so is the case of the computationof the activation function. Hence, the overall processing time ofthe ANN is also reduced.

Whenever a digital hardware is designed with some similar cir-cuit blocks, it reveals a feature which is very attractive to an imple-mentation in Field Programmable Gate Array (Wolf, 2004). Since allhardware neurons are digital circuits that are literally equal, thesynthesis of the hardware layer in a FPGA is easily and bestachieved.

In this work, a real number is represented as a fraction of inte-gers (Santi-Jones & Gu, 2008). Floating-point representation basedon the IEEE-754 standard (Tanenbaum, 2007) is not used here.Mathematic operations, such as sums and multiplications, usingfloating-point numbers require specific routines and, in terms ofhardware design, a circuit that needs very large silicon area to beimplemented. Also, long computing time is also demanded (Ned-jah, Martins, Mourelle, & Carvalho, 2008).

This work aims at optimizing arithmetic computation for com-pact yet efficient implementation of ANNs. A sum or multiplicationof two fractions of integers can be split into simple operationsoperating over integer numbers. Integer-based operations areachieved using combinational circuits (Uyemura, 2002). Thus, aless-complex yet more efficient hardware is used and a simple fi-nite state machine would rule those integer operations.

A neuron weighted sum is a set of sums and multiplications offractions. The activation function used in this work is the logisticsigmoid, which is approximated by quadratic polynomials. Theseare derived from a curve-fitting method: least mean squares. Thisis done, in contrast with using a lookup table, which is known tocompromise the neuron rendered result. The quadratic polyno-mial-based approximation yields a far more precise result.

Provided that the sigmoid is approximated by second-degreepolynomials, only sums and multiplications are needed to get thefinal result. Thereby, the circuit, once designed to computeweighted sums, can be reused for computing the sigmoid function,without extra effort or cost. This strategy spared unnecessary cir-cuit, which lead to area extension and motivated by increasingthe precision of the neuron rendered results (Martins, Nedjah, &Mourelle, 2009).

This paper is organized as follows: First, in Section 3, we de-scribe the data representation used in this design as well mostarithmetic operations. Then, in Section 4, we describe the sigmoidcomputation. Subsequently, in Section 5, the overall hardwarearchitecture and controllers are presented. Therein, in Section5.1, the hardware neuron layer is detailed, including a full descrip-tion of the neuron circuit design in Section 5.1.1. We carry on pre-senting the control unit of the hardware layer in Section 5.2, andthe clock generator in Section 5.3. Thereafter, in Section 6, weintroduce the software system used to allow the implementationof the reconfigurability feature of the proposed architecture. Afterthat, in Section 7, we report some simulation and synthesis resultand discuss them. Finally, in Section 8, we draw some conclusion

about the reported work and point out some future directionsand improvements.

2. Related work

Research in the area of neural networks have been ongoing forover two decades now and hence the are many related work pub-lished. There are many work surveys published (Lindsey & Lind-blad, 1994; Moerland & Fiesler, 1997; Rojas, 1996). In Zhang andPal (1992), the authors report on an efficient systolic implementa-tion of ANNs. In Kung (1988) and Kung and Hwang (1989), theauthors describe a novel scheme for designing special purpose sys-tolic ring architectures to simulate feed forward stage is artificialneural networks. In Kung (1988), the authors present interestingresults on implementing the back propagation algorithm on CMUWarp. In Ferrucci (1994), the author describes a multiple chipimplementation of ANNs, using basic building blocs, such as mul-tipliers and adders. In Nedjah, Martins, Mourelle, and Carvalho(2009), the authors also take advantages of MAC (multiply andAccumulate) hard cores implemented into the fabrics of the FPGAto implement efficiently sums and products that are necessary inthe ANN underlying computations. In Beuchat, Haenni, and Sanchez(1998), the authors developed an FPGA platform, called RENCO – aREconfigurable Network COmputer. In Bade and Hutchings(1994) and Nedjah and Mourelle (2007), the authors report animplementation of stochastic neural networks based on FPGAs.Both implementations result in a very compact circuit. In Zhanget al. (1990), the author presents an efficient implementation ofthe back propagation algorithm on the connection machine CM-2.In Botros and Abdul-Aziz (1994), the author introduce a systemfor feed forward recall phase and implement it on an FPGA. InLinde, Nordstrom, and Taveniku (1992), the authors describeREMAP which is an implementation of whole neural computerusing only FPGAs. In Gadea et al. (2000), the authors report on apipelined implementation of an on-line back propagation networkusing FPGA. In Canas, Ortigosa, Ros, and Ortigosa (2008), theauthors propose a hardware implementation of ANNs, where theactivation function is discretized and stored in a lookup table.

Many analog hardware implementations of ANNs have alsobeen reported in the literature. In general, the implementationare very fast, dense and low-power when compared to digital ones,but they come along with precision, data storage, robustness andlearning problems, as shown in Holt and Baker (1991), Nedjah,Martins, and Mourelle (2011), and Choi, Ahn, and Lee (1996). It isan expensive and not flexible solution, as any ASIC (Montalvo,Gyurcsik, & Paulos, 1997).

3. Numeric representation

A neural network operates with real numbers. Fixed-point rep-resentation implies a great accuracy loss. Floating-point notation(IEEE-745) offers good precision, but requires a considerable sili-con area and a considerable time for arithmetic computing.

Searching for speed and circuit area trade-off, the alternativechosen to represent a real number was the Fractional Fixed Point,

N. Nedjah et al. / Expert Systems with Applications 39 (2012) 9191–9206 9193

where a Fraction of integers is used to represent a real number(Santi-Jones & Gu, 2008). This model is depicted in Fig. 1, showingthe binary structure of a general fraction. This piece of data has33 bits. 17th bit of the least significant bits is for the algebraic signof the fraction.

A real float number converted to a fraction is shown in (1). Suchfraction in its binary structure is also shown in Fig. 2.

�0;003177124702144560 #12�3777

ð1Þ

Regarding a and b real numbers, Eqs. (2)–(4) display most of thearithmetic operations performed by the hardware. Neuronweighted sum and sigmoid computing are based on sums or sub-tractions and multiplications of fractions

aþ b #Na

Daþ Nb

Db¼ Na � Db þ Da � Nb

Da � Dbð2Þ

a� b #Na

Da� Nb

Db¼ Na � Db � Da � Nb

Da � Dbð3Þ

a� b #Na

Da� Nb

Db¼ Na � Nb

Da � Dbð4Þ

There is an advantage of using fractions: a sum or a multiplicationof two fractions is achieved through a mere sequence of integeroperations, which require simple combinational circuits (Uyemura,2002).

A sum of two fractions, for instance, demands 3 multiplicationsand 1 addition of integers: an unsophisticated finite-state machineis used to command the combinational computing sequence. Com-binational Adder and multiplier provide attractive response timingand are easily allocated on FPGAs (Wolf, 2004).

3.1. Adaptive number framing technique

A fraction that results from a sum or product of two other frac-tions might require a bit range that exceeds the width of the binarystructure of Fig. 1. Repeated fraction operations would demandunlimited number of bits to keep up with the highest result preci-sion. This is certainly impracticable. One possible and commonsolution is a truncation.

For instance, consider two fractions which fit into the binarystructure of Fig. 1, so that their multiplication results into the frac-tion Num

Den ¼ 4500231279030. Neither the numerator nor the denominator

would fit in the Fig. 1 notation as 450023 > 65535 and1279030 > 65535. Therefore, such fraction must be adjusted to fitinto the binary limitation imposed in the hardwareimplementation.

The multiplication of two fractions (each one in the form ofFig. 1) generates a fraction of 64 bits, in general, whose numeratorand denominator are of 32 bits. The proposed hardware does notsupport this bit length and a framing technique must be thoughtof, aiming at minimizing the loss of accuracy.

An easy truncation or framing that could be done on 4500231279030

would be to take only the least-significant sixteen bits from

Fig. 2. Example of number represented using a fraction.

numerator and from denominator, as shown in Fig. 3. This istermed direct framing (the simplest one).

The framing depicted in Fig. 3, performed over fraction450023

1279030 ¼ 0:35184 . . ., yields 619 ¼ 0:31578 . . . . This technique implies

a high precision loss. In this work, an adaptive framing technique isused, where successive one-bit right-shifts of the binary represen-tations of both the numerator and denominator, until the fractionunder consideration is framed into the width of the representationof Fig. 1, which is the fraction default binary length.

Algorithm 1 describes the steps of the proposed framing tech-nique. It performs the adjustment of Num

Den to produce ND that fits in

the default structure used in this design. Note that Num is a naturalof a > 16 bits and Den – 0 is an integer of b > 17 bits. Recall that theMSB of Den is the fraction sign bit. Numerator N has 16 bits edenominator D has 17 bits; The bit MSB of D is the sign bit of theresulting fraction. Note that the largest number that can be repre-sented is � 65535

1 ; . . . ; 01 ; . . .þ 65535

1

� �. Hence, when the framing oper-

ation reaches an all-zero denominator, the largest possible integeris used see lines 10 and 11 of Algorithm 1.

Algorithm 1. Framing technique

Require: a; b; Num[a � 1..0]; Den[b � 1..0];Ensure: N[15..0]; D[16..0];

auxN[a � 1..0] Num[a � 1..0];auxD[b � 2..0] Den[b � 2..0];repeat

if Or (auxN[(a � 1)..16]) —- Or (auxD[(b � 2)..16]) thenEndFraming false;RightShift auxN[a � 1..0];RightShift auxD[b � 2..0];if Nor(auxD[b � 2..0] then

EndFraming true;auxN[15..0] 1111..11; //216 � 1auxD[15..0] 0000..01; //1

end ifelse

EndFraming true;end if

until EndFraming;N[15..0] auxNum[15..0];D[15..0] auxDen[15..0];D[16] Den[b � 1];

return N[15..0], D[16..0]

Adaptive framing of the fraction 4500231279030 is illustrated in Fig. 4.

In this fraction, both the numerator and denominator go throughfive right shifts of one bit in order to frame the fraction withinthe binary structure of Fig. 1, i.e. a numerator of 16 bits anddenominator of 17 bits, including a sign bit. This method isworthwhile because it minimize the loss of accuracy. Note that,fraction 450023

1279030, which evaluates precisely to 0.3518471028, resultsin 0.3157894736 with a direct standard framing while when ityields 14063

39969 using adaptive framing, which is equivalent to0.3518476819. Note that an overall evaluation of precision loss

Fig. 3. Direct framing.

Fig. 4. Adaptive framing.


throughout the computational process depends on the specificcomposition of the bits that are being shifted out from thenumerator and denominator.

3.2. Precision of the adaptive framing

The loss of precision that is occasioned by a single iteration ofthe framing procedure described in Algorithm 1 depends on thethe bit that is shifted out from the numerator and the correspond-ing in the denominator. Considering one framing iteration, there 4possible cases as described below.

(1) Both the numerator and denominator are even. i.e.Num = 2 � N + 0 � 20 and Den = 2 � D + 0 � 20. Thus, wehave no loss of precision as explained in (5), wherein E00 isthe error introduced by right-shifting both the numeratorand denominator:

E00 ¼2N2D� N

D¼ 0 ð5Þ

(2) Both the numerator and denominator are odd. i.e.Num = 2 � N + 1 � 20 and Den = 2 � D + 1 � 20. Thus, wehave a loss of precision as explained in (6), wherein E11 isthe error introduced by right-shifting both the numeratorand denominator:

E11 ¼2N þ 12Dþ 1

� ND¼ D� N

Dð2Dþ 1Þ ð6Þ

(3) The numerator is even, i.e. Num = 2 � N + 0 � 20 and denom-inator are odd, i.e. Den = 2 � D + 1 � 20. Thus, we have a lossof precision as explained in (7), wherein E01 is the errorintroduced by right-shifting both the numerator anddenominator:

E01 ¼2N

2Dþ 1� N

D¼ �N

Dð2Dþ 1Þ ð7Þ

(4) The numerator is odd, i.e. Num = 2 � N + 1 � 20 and denom-inator are even, i.e. Den = 2 � D + 0 � 20. Thus, we have a lossof precision as explained in (8), wherein E10 is the errorintroduced by right-shifting both the numerator anddenominator:

E10 ¼2N þ 1

2D� N

D¼ 1

2Dð8Þ

Therefore, assuming that it is 0s and 1s are evenly distributed in abinary representation, the average error introduced by a singleshift can be evaluated as shown in (9).

Eavg ¼4D� 4N þ 18Dð2Dþ 1Þ ð9Þ

The proposed adaptive framing technique provides an adequateaccuracy for the purpose of this work. The proposed hardware hasto be equipped with a right-shifter to perform fraction adjustment.A one-bit right-shift is an integer-divide-by-2 operator. This prop-erty is also useful in the sigmoid computation as the strategy usedfor computing the sigmoid was forged in order to take advantage ofthe shifter already included in the hardware. This will be explainedin the next section.

4. Activation function

There are many functions, commonly used, which map neuronoutput. The most common are functions ramp as described in(10), hyperbolic tangent as defined in (11) and sigmoid as de-scribed in (12). The curves of these three activation functions areshown in Fig. 5.

y ¼ uðvÞ ¼1 if v P b

1b�a v � a

b�a if a < v < b

0 if v 6 a

8><>: ð10Þ

uðvÞ ¼ b � eav � e�av

eav þ e�av ð11Þ

wherein a – 0 and b – 0.

uðvÞ ¼ 11þ e�av ð12Þ

The sigmoid is widely used in multilayer-perceptron neural net-works Haykin (1999). The hardware, presented in this paper, com-putes sigmoid of (13), where parameter a is set to 1 and v is theneuron weighted sum (including bias). Initially, through least meansquares, 3 quadratic polynomials are obtained, which fit into thecurve e�v. Each polynomial approximates e�v in a certain range ofthe domain v, as illustrated in Fig. 6 and defined in (14).

uðvÞ ¼ 11þ e�v ð13Þ

expð�vÞ ffi

P00ðvÞ if v 2 ½0;2½P01ðvÞ if v 2 ½2;4½P10ðvÞ if v 2 ½4;8½P11ðvÞ ¼ 0 if v 2 ½8;þ1½

8>>><>>>:

ð14Þ

In Fig. 6, the approximation method generates the quadratic poly-nomial of (15) for exp (�v), wherein F[x,y[(v) is a fractional functionin v 2 [x,y[:

expð�vÞ ffi

f½0;2½ðvÞ ¼ 0:1987234v2 � 0:8072780v þ 0:9748092f½2;4½ðvÞ ¼ 0:0268943v2 � 0:2168304v þ 0:4580097f½4;8½ðvÞ ¼ 0:0016564v2 � 0:0235651v þ 0:0840553f½8;þ1½ðvÞ ¼ 0

8>>><>>>:

ð15Þ

We know that the ANN hardware operates with fractions, whoserepresentation is depicted in Fig. 1. So, it will be better to expressthe polynomial of (15) in the hardware representation. This isshown in (16):

expð�vÞ ffi

f½0;2½ðvÞ � 1285864703

NvDv

� �2þ 11691�14482

NvDvþ 56072

57521

f½2;4½ðvÞ � 104638893

NvDv

� �2þ 13883�64027

NvDv

� �þ 12560

27423

f½4;8½ðvÞ � 6338032

NvDv

� �2þ 581�24655

NvDv

� �þ 456

5425

f½8;þ1½ðvÞ � 01

8>>>>>>>><>>>>>>>>:

ð16Þ

As an example, P01(v) refers to the polynomial which fits into e�v forv 2 [2,4[. This hardware deals only with binary structures whichrepresent fractions. Thus, P01(v) is best expressed as in (17):

P01Nv

Dv

� �¼ Nv

Dv

104638893

Nv

Dv

� �þ 13883�64027

� þ 12560

27423: ð17Þ

The weighted sum parameter is in evidence to save one multiplication.Using polynomials, e�v is computed performing only multiplica-

tions and sums of real numbers – which are the basic operations of

Fig. 5. Common activation functions.

0 1 2 3 4 5 6 7 8−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9exp(−x)

Quadratic Polynomial in [0,2]



Fig. 6. Curve-fitting: quadratic polynomials and e�v, for v P 0.


neuron weighted sum. So, weighted sum digital circuit is reused tocompute e�v.

The achieved precision is proportional to the highest degree ofthe exploited polynomial. Nevertheless, with higher degree


polynomials, the required computation becomes more complexand thus the response time becomes longer. A second degree poly-nomial provides reasonable accuracy, i.e. on each range of (14), andyet does not slow down the hardware. The error imposed by thisapproximation was plotted and the result is shown in Fig. 7.

Domain ranges in (14) were chosen based on based on the factthat borderline values of each range are powers of 2. A right-shifter(discussed previously) is used to frame a fraction into the binarystructure of Fig. 1. During activation function computation, thesame right-shift register is reused in the selection of the adequatepolynomial to compute e�v, taking into account a certain range ofv P 0. In order to explain the polynomial selection procedure,non-negative weighted sums are considered initially: v P 0.

Let NvDv

be the fraction notation of a neuron weighted sum. Thehardware first checks Nv

Dv< 2. This comparison is equivalent to

Nv2 < Dv , since Nv

Dvis non-negative. The one-bit right-shifter is

responsible for performing Nv div 2 and a combinational compara-tor provides the boolean result of Nv div 2 < Dv.

If Nv is odd, then Nv div 2 – Nv2 . Nonetheless, Theorem 2 ensures

that when we have Nv div 2 < Dv then it immediately follows thatNv2 < Dv and vice-verça.

If Nv div 2 < Dv is valid, then P00 is the selected polynomial, be-cause the weighted sum v 2 [0,2[. Otherwise, the hardware checksNvDv< 4, which is has the same results of comparison Nv

4 < Dv . Tworight-shifts are required to get Nv div 4 and another comparisonis done: Nv div 4 < Dv. Theorem 2 still ensures that Nv div4 < Dv () Nv

4 < Dv .If Nv div 4 < Dv is valid, then P01 is the selected polynomial as the

weighted sum v 2 [2,4[. Otherwise, other shifts are performed untilthe adequate polynomial is reached, taking into account the corre-sponding range of v.

For the sake of clarity, before proving Theorem 2, we first proveTheorem 1, which establish that for P odd, we have P div2 ¼ P

2� 0;5 < Q is equivalent to P2 < Q .

Theorem 1. 8P; Q 2 @�, wherein P is odd, we have P div2 < Q () P

2 < Q.

Proof. P is odd, then it follows that P div 2 ¼ P2� 1

2. Therefore, P div2 < Q is equivalent to P

2� 12 < Q as P div 2 ¼ P

2� 0:5.

P div 2 ¼ P2� 1

2< Q () P

2< Q þ 1

2() P < 2Q þ 1 ð18Þ

P2< Q () P < 2Q ð19Þ

0 1 2 3 40

0.005

0.01

0.015

0.02

0.025

0.03

Fig. 7. Error introduced by the activ

Considering (18) and (19), we need to prove that, for P odd, we al-ways have P < 2Q + 1, P < 2Q. Assuming Q – 0, we need to provethe following: P < 2Q + 1 ? P < 2Q and P < 2Q ? P < 2Q + 1.

(1) P < 2Q + 1 ? P < 2Q: Q – 0, then 2Q + 1 is odd. For Q = 1, wehave P < 3 ? P < 2. As P is odd, then P < 3 ensures thatP < 2, as the largest odd integer smaller than 3 is P = 1. ForQ > 1, the largest odd integer smaller than 2Q + 1 isP = 2Q + 1 � 2. Hence, for P = 2Q + 1 � 2 = 2Q � 1, we haveP = 2Q � 1 < 2Q. Finally, for any P such that P < 2Q � 1, wehave P < 2Q, as P < 2Q � 1 < 2Q. Therefore, P < 2Q + 1? P < 2Q holds.

(2) P < 2Q ? P < 2Q + 1: It is clear that, if P < 2Q, then P < 2Q + 1,as P < 2Q < 2Q + 1. This proves that P < 2Q ? P < 2Q + 1holds. h

Theorem 2. 8P; Q 2 @�, we have P div 2s < Q () P2s < Q, wherein

s 2 @�.

Proof. The result of P div 2s can be formulated in terms of P2s as

shown in (20), wherein P mod 2s is the remainder of the integerdivision of P by 2s.

P div 2s ¼ P2s � P mod 2s

2s ð20Þ

(1) For s = 1, Eq. (20) ca be reduced to P div 2 ¼ P2� P mod 2

2 . Inthis case, for P even, we have P div 2 ¼ P

2� 02 ¼ P

2 and, thus,P div 2 ¼ P

2 < Q () P2 < Q ; 8P; Q 2 @�. For P odd, Theorem

1 ensures that P div 2 ¼ P2� 1

2 < Q () P2 < Q ; 8P; Q 2 @�.

(2) For s > 1, we need to prove that, 8P; Q ; s 2 @�, wherein s > 1,Eq. (21) holds.

5

ation fu

P div 2s ¼ P2s �

P mod2s

2s < Q () P2s < Q ð21Þ

Using P div 2s ¼ P2s �

P mod 2s

2s ¼ c, where c 2 @, we have (22).

P ¼ 2scþ P mod 2s ð22Þ

Comparing (22) to (21), the equivalence defined in (21) can beexpressed as in shown in (23).

c < Q () 2scþ P mod 2s

2s < Q ð23Þ

6 7 8 9

nction approximation.


8P; Q ; s 2 @�, wherein s > 1 and c 2 @. We know that P mod2s 2 {0,1,2, . . . ,2s � 1}, i.e. 0 6 P mod 2s < 2s. Using d = P mod2s, we have d 2 @ and 0 6 d < 2s. Replacing d in (23), we thenhave (24)

c < Q () 2scþ d

2s < Q ð24Þ

8P; Q ; s 2 @�, where s > 1, c 2 @ and fd 2 @ j 0 6 d ¼ P mod2s < 2sg.

Observing the comparison operations in (24), we get to (25) and(26).

c < Q () Q � c > 0 ð25Þ2scþ d

2s < Q () d < 2sðQ � cÞ () Q � c >d

2s ð26Þ

Based on (25) and (26), the required proof, i.e of the equivalence of(24), would be completed if the propositions Q � c > 0 ! Q � c >d2s and Q � c > d

2s ! Q � c > 0 hold; That is, with the premises8P; Q ; s 2 @�, where s > 1, c 2 @ and fd 2 @ j 0 6 d ¼ P mod2s < 2sg.

(a) Q � c > 0 ! Q � c > d2s: Q is a non-zero natural and c is

natural. So, Q � c is an integer. As d is a natural such that0 6 d < 2s, there follows 0 6 d

2s < 1. If Q � c is positive,then Q � c P 1 and, therefore Q � c > d

2s, since0 6 d

2s < 1. So, Q � c > 0 ensures that Q � c > d2s, i.e

Q � c > 0 ! Q � c > d2s holds.

(b) Q � c > d2s ! Q � c > 0: Assuming 0 6 d

2s < 1, it is clearthat Q � c > d

2s ensures Q � c > 0. Any integer Q � c lar-ger than {0,1,2, . . . ,2s � 1} is with no doubt positive(Q � c > 0). There follows that Q � c > d

2s ! Q � c > 0holds. h

Once the suitable polynomial is selected from (14), the hard-

ware computes it and returns the resulting fraction Nfe

Dfe, which rep-

resents the fraction value of e�v, for v P 0. Thus, e�v ffi Nfe

Dfe, where

v P 0. Replacing Nfe

Dfein the sigmoid of (13), it easily comes to (28).

uðvÞ ffi 1

1þ Nfe

Dfe

if v P 0 ð27Þ

uðvÞ ffi Dfe

Dfe þ Nfeif v P 0 ð28Þ

Whenever the weighted sum is negative, v < 0, a sigmoid propertycan be used: uðvÞ ¼ 1�uð�vÞ; v 2 R. This way, through (28),u(v) is finally solved for v < 0, as shown in (29) and (30).

uðvÞ ffi 1� Dfe

Dfe þ Nfe

if v < 0 ð29Þ

uðvÞ ffi Nfe

Dfe þ Nfeif v < 0 ð30Þ

Note that, using the same approximation, we can exploit any ofthe commonly used activation functions that are based on theexponential function, such as hyperbolic tangent, described earlier.The nice properties of the exponential function would be takenadvantage of so as to reduce the necessary overall computation.The ramp function can also be easily used. It does not requireany approximation. The underlying computation requires a multi-plication followed by an addition, in the general case. Two com-parisons are needed to determine whether this computation isnecessary. Otherwise, either constants 0 or 1 are used instead. Inorder to accommodate a new activation function, the controllingsequence of the stage within the hardware must be slightly re-adjusted.

5. Hardware architecture

The proposed architecture consists of two subsystems: the loadand control system (LCS) and the ANN computing hardware (ANN-CH). Component LCS loads and stores the data needed for an neuralnetwork application. The ANNCH includes the digital circuit thatimplement the neuron’s hardware, which include logic and arith-metic computations as well as the underlying control flow. Theblock diagram of the overall hardware is depicted in Fig. 8.

The load and control system LCS, in Fig. 8, includes three mem-ories: one for ANN inputs, another for weights and biases and athird memory, for the polynomial coefficients that allow comput-ing the neuron activation function. Since the hardware is able toadapt it-self to different MLP topologies, the number of inputs,weights and biases may be altered on-the-fly, enabling the recon-figurability feature of the hardware to execute different ANN-based applications.

The hardware synthesis in FPGA requires the exact sizing of theLCS memories. This will be explained in details via an illustrativeexample in Section 6. The weight (and bias) memory is sized asin (31).

ðimax þ 1Þnþ nðnþ 1Þðlmax � 1Þ; ð31Þ

where imax is the maximum number of inputs, n is the maximumnumber of neurons per layer that the hardware is able to support.This is the actual number of neurons in the physical layer. Parame-ter lmax is the maximum number of layers the actual ANN configu-ration can include so as to be implemented in the proposedhardware. Activation function memory stores nine coefficients:three for each polynomial. Note that, the proposed ANN hardwarecan accommodate any ANN topology that exploits at most n neu-rons in any of its layers. The sole modification that is required fordifferent topologies consists of the size of the three data memoriesmanaged by component LCS. Their contents are provided by a soft-ware system.

LCS also controls the ANN application operation in ANNCH. A33–bit data bus enables the necessary data flow to the ANNCHunit. A control bus is also available and establishes the communi-cation between LCS and ANNCH. The LCS is the master controllerthe data bus. It sends the required data upon ANNCH’s requests.

In Fig. 8, ANNALU is the ANNCH arithmetic and logic unit andincludes the digital neurons, whereby the weighted sums and sig-moid are computed. Neurons within the same layer of such an ANNapplication are performed by the hardware in parallel.

The control unit, ANNCU in Fig. 8, commands the digital neuronarithmetic only. This is executed by ANNALU and includes theweighted sum and activation function computation. A clock gener-ator synchronizes the communication between ANNCU and ANN-ALU. The LCS as master triggers ANNCU, and the latter leads thewhole computation corresponding to the current layer until neu-ron outputs are ready. Afterwards, LCS restarts ANNCU to computeanother ANN application layer – this process goes on until the neu-ral network outputs are available on yi data buses. In the following,i.e. Sections 5.1 and 5.2, we describe the architecture and operationof ANNALU and ANNCU respectively.

5.1. Hardware layer: ANNALU

As mentioned previously, the hardware architecture is designedwith only one physical layer. This is a set of digital hardware neu-rons that work in parallel and define the ANNALU, as shown inFig. 9. Whenever an ANN is performed, the physical layer is reusedfor computing all the layers of the application neural network. Ifthe hardware layer has nmax neurons, so the number of neuronsin all the ANN layers must not exceed nmax.

L OAD AND CONTROL SYSTEM

LCS

Control Bus

Data Bus

Memory Memory

Memory

ANNInputs

Weightsand

biases

Sigmoid

ANN HARDWARE ANNCH

ClockGenerator

.

.

.

y1

y2

yn

ANN Arithmetic andLogic Unit

ANNALU

ANN Control Unit

ANNCU

Fig. 8. Overall hardware architecture.


For instance, assuming that the kth layer of such an ANN appli-cation has 3 neurons, the LCS activates only hardware neurons 1, 2and 3 of Fig. 9, to compute the kth layer. Next, when the (k + 1)th isgoing to be computed, outputs y1, y2 and y3 from neurons 1, 2 and3, respectively, are fed back through registers Regyi, buffers andmultiplexers. For instance, assuming that layer (k + 1)th has twoneurons, then only hardware neurons 1 and 2 would be switchedon. Still in Fig. 9, neural network inputs flow via the same databus and are sent to digital neurons through x1, . . . , xn. Registers Re-gwi stores weights and biases, also provided by LCS via data bus.

When computing a certain ANN application layer, the input isshared with all hardware neurons. However, each neuron has itsown weight (for the same input), as shown in Fig. 10 for a three-neuron-layer example. The first product of Fig. 10 is computed,at the same time, for all hardware neurons, and this is done forall the remaining terms of the weighted sum and also to the acti-vation function computing.

5.1.1. Hardware neuron modelThe neuron architecture is illustrated in Fig. 11. As seen previ-

ously, the weighted sum and sigmoid computing require only sumsand multiplications of fractions. These operations, in turn, consistof other simpler operations: sums and multiplications of integers,which are performed by the combinational circuits MULTIPLIERand ADDER, respectively, as Fig. 11 shows.

The two shift registers (ShiftReg1 and ShiftReg2) are included soas to adjust the fraction that results from a multiplication of twoother fractions. This adjustment refers to the adaptive framingexplained in Section 3.1. ShiftReg1 shifts the numerator whileShiftReg2 shifts the denominator of such a fraction.

In Fig. 11, ShiftReg3 is another shift-register that shifts thefraction numerator which results from an addition of two otherfractions. ShiftReg2 is also used for shifting the denominator ofsuch a fraction. The neuron weighted sum is accumulated in Shift-Reg3 used for the numerator and Reg4 for the denominator.

The combinational circuits TwosComple1 and Twoscomple2,when needed, perform the two’s complement Tanenbaum (2007)on signed integers stored in ShiftReg1 and ShiftReg2, respectively.

During an addition of two fractions, there is a sum or subtraction oftwo signed integers: one stored in ShiftReg1 and the other inShiftReg2.

A combinational Comparator is necessary to select the adequatepolynomial for computing sigmoid function, as explained in Sec-tion 4. The arithmetic signals of Xi and Wi are processed by compo-nent ASPU (Arithmetic Signal Processing Unit) of Fig. 12 in orderpredict the signal of the fraction obtained from multiplying or add-ing two given fractions. The ASPU also decides if an Adder operandwill go through the two’s complement (TwoCompl1 and/or Two-Compl2 of Fig. 11). The Adder component is responsible for theadditions of the neuron’s weighted sum. RShiftReg3 and Reg4 areassigned to accumulate the weighted sum as a fraction: numeratorin RShiftReg3 and denominator in Reg4.

In Fig. 12, component NTC (Negative transition Controller) re-leases the negative transition of Clk1 to the Flip-Flop D, wheneversignal E is set. ASPU works in parallel with multiplications andadditions of fractions during weighted sum and activation functioncomputation. All Neurons of Fig. 9 work in parallel. When the com-putation of a given layer is being performed, the one related to thenext layer is being initialized. As soon as the computation of all lay-ers has been completed, ANNALU sends a signal to the ANNCU,informing that the output of the Neural Network is available. So,the latter informs the software sub-system that the whole compu-tation is done.

During weighted sum, for instance, register Reg1 is used forstoring a neuron input and registers Reg1 and Reg2 store, respec-tively, numerator and denominator of the synaptic weight relatedto that input. Fig. 11 displays fraction sign bit from Reg1 (inputdata) and fraction sign bit from Reg3 (denominator of a weight) in-jected to the ASPU unit.

5.2. Control unit: ANNCU

A general view of the control unit ANNCU is depicted in Fig. 13.Processing an application neural network starts by obtaining therequired data via the control block DI. The finite state machine(FSM) within this block is responsible for requesting to component

Fig. 9. Hardware layer: ANNALU.


LCS the inputs xi of the net. These data are then stored into specificregisters in ANNALU. The state machine within block WSC controlsthe computation of the weighted sum through fraction sums andproducts using the provided components in the neuron hardwaredescribed earlier.

Once the weighted sum yielded, the state machine within blockAFC initiates the control of the same components available in thehardware layer so as to compute the activation functions of

the neurons. Once this is done, the computation due to the currentneuron layer has been achieved. If the current layer is not the lastin the net, then the FSM implemented in block AFC sends the con-trol to that implemented by block DI to iterate the process oncemore. Otherwise, the output results are ready and therefore, thecontrol unit enters state End and waits for a reset trigger.

As shown in Fig. 13, the control unit ANNCU includes two FSMs:primary and secondary. The primary FSM consists of blocks DI,

Fig. 10. Arithmetics of a three-neuron layer.


WSC, AFC as well as states start and End. The secondary FSM isdefined solely by the control within block LSW, which is responsi-ble for preparing the synaptic weights for the next layer, if any,

Fig. 11. Neuron a

while and in parallel with the computation of the weighted sumsdue to the current layer. The current layer synaptic weights arestored in registers within the neuron hardware, allowing for thekind of parallelism. (Details of the FSMs description can be foundin Martins (2010). These were not included here as this kind of de-tailed description is not necessary for understanding the overallbehavior of the control unit.)

5.3. Clock generator

The macro architecture of Fig. 8 include a clock generator,which yields two signal Clk1 and Clk2. Both these signals imple-ment the synchronization of the general activities of the ANNCH,which also interferes in both included units: ANNALU and ANNCU.The time diagram of the clock signals is shown in Fig. 14.

rchitecture.

Fig. 12. Arithmetic signal processing unit to predict the fraction signal obtained from a sum or product of two other fractions.

Fig. 13. Control flow within ANNCU.


Component ANNCU consists mainly of a state machine the con-trols the operations performed by ANNALU. The state transitions inthis machine occur during the positive transitions of clock signalClk1. The time period tH1 is the required time for the stabilizationof the output signals. The right-shift register used in the neuronmicro-architecture operates in synchrony with the negative transi-tion of clock signal Clk2. Note that during one clock cycle with

respect to Clk1, four shifting operation may take place. This accel-erates the time spent in shifting operations. Shifting operations arerequired so as to frame the results of additions and multiplicationswithin the fractional data representation. The use of faster clockClk2 is only possible thanks to the reduced time of a shiftingoperation with respect to that required by the more complexoperations performed by ANNALU.

Fig. 14. Clock signals used by the hardware sub-system ANNCH.


6. Software system

The software system is capable of resetting the hardware sys-tem at any moment. We view the software system as the masterand the hardware as the slave component. The software systemis a simple C program that, after setting the topology parameters,loops to feed the hardware via the data bus with the inputs,weights and biases required the net underlying computation. Thesoftware also defines, via the control bus, the topology of the net-work, i.e. the number of layers and the number of neurons perlayer included in the hardware. Note that by a simple compilingstep of the software system, one can accomplish the re-configura-bility feature of the neural network topology and/or data settingsand input. However, this is only possible when the number of re-quired neurons in any layer of the new ANN either coincides oris lesser than that of those already implemented in the physicallayer of the existing hardware. Otherwise, the whole system,including the hardware, need to be implemented from the start.

The first task that must be executed by the software systemconsists of setting up the network following parameters:

(1) The number of inputs. Recall that this number is representedby imax.

(2) The number of layers. Recall that this number is representedby lmax.

(3) The number of neurons per layer. Let ni be the number ofneurons for layer i, with i = 1,2, . . . , lmax.

(4) The existence or not of bias in each of the layer. If thereexists at least one of the neuron, localized in layer i, thatmakes use of a given bias then the parameter biasi is set,i.e. biasi = 1 and otherwise, biasi = 0.

Fig. 15. Example of MLP neural

Two data memories are destined to store the input and thesynaptic weights as well as the biases. Fig. 15 shows a typicalexample of an MLP neural network and used to show how thesoftware subsystem acts. The ANN has imax = 2 inputs and lmax = 3layers whereby there are n1 = 3 neurons in the first layer, n2 = 2in the second and n3 = 1 in the third and last layer. In this example,only the second neuron of the first layer has a non-zero bias.

Tables 1 and 2 represent the memory of synaptic weights andbiases, configured by the software system with respect to the neu-ral network example of Fig. 15. The memory presents a limit foravailable entries, which corresponds to the maximal number ofneurons per layer as it is required in the ANN being implemented.It is assumed that the ANN hardware can accommodate an ANN ofat most seven neurons per layer, i.e. imax = 7 and at most five inputs,i.e. lmax = 5. Any ANN of any topology that requires at most sevenneurons per layer and five inputs can be accommodated, with noextra effort. For the example of the ANN in Fig. 15, the layer thathas more neurons is the first one, with n1 = 3 6 nmax = 7 neurons.Therefore, its hardware can be accommodated, with no hardwarenew configuration. Note that this simply an illustrative exampleand the setting parameters, such as a maximum number of 7 neu-rons per layer and a maximum number of 5 inputs. The data is usedto configure the weight and biases memory, presented in Tables 1and 2.

The main strategy of the proposed hardware implementationconsists of the usage of a unique physical layer that permits allthe computations evolved in the several layers of the ANN. Usinga feed-forward process, the physical layer is able to process allthe required computation. Note that, in one hand, this does notdeteriorate the performance of computational process as a subse-quent layer needs to wait for the result obtained by the previous

network with six neurons.

Table 1Memory for weights and biases of the first layer for the ANN of Fig. 15.

Weights and Biases Weights and Biases

Address Data Address Data

1st Layer,1stNeuron(11)

1 w(11)1 1st Layer, 2ndNeuron (12)

7 w(12)2

2 w(11)2 8 w(12)0 = b12

3 w(11)0 = 0 9 X4 X 10 X5 X 11 X6 X 12 X

1st Layer,3rdNeuron(13)

13 w(13)1 1st Layer, 4th–7th Neurons(14–17)

19 X14 w(13)2 20 X15 w(13)0 = 0 21 X16 X . . . X17 X 41 X18 X 42 X

Table 2Memory for weights and biases of the hidden and output layers for the ANN of Fig. 15.

Weights and Biases Weights and Biases

Address Data Address Data

2nd Layer, 1stNeuron(21)

43 w(21)1 2nd Layer,2nd Neuron(22)

51 w(22)1

44 w(21)2 52 w(22)2

45 w(21)3 53 w(22)3

46 w(21)0 = 0 54 w(22)0 = 047 X 55 X. . . . . . . . . . . .

50 X 58 X

2nd Layer,3rd–7thNeurons

59 X 3rd Layer, 1stNeuron (31)

99 w(31)1

60 X 100 w(31)2

61 X 101 w(31)0 = 062 X 102 X. . . . . . . . . . . .

98 X 106 X

3rd Layer,2nd–7thNeurons

107 X 4th Layer,1st–7thNeurons

155 X. . . . . . . . . . . .

154 X 210 X

Fig. 16. The MicroBlaze processor connected to the ANN hardware co-processor.


one to start its underlying computation. On the other hand, the sil-icon area requirement for the ANN hardware implementation willbe reduced to a minimum.

In order to understand the sizing of the weight and biases mem-ory for a given ANN, let imax be the number of inputs, lmax the num-ber of layers and ni = n the number of neurons per layer, wherebyi = 1,2, . . . , lmax. Let us consider also that all neurons lmax � n havea bias. Each neuron of the input layer receives lmax inputs and so re-quires lmax weights augmented by one bias. Therefore, the totalnumber of data (weights and biases) that the memory mustaccommodate for the first layer of the ANN is defined in (32). Forinstance, in the case of the previously used example, this part ofthe memory must have 42 words.

n� lmax þ 1� n ¼ nðlmax þ 1Þ ð32Þ

On the other hand, each neuron that does not belong to the firstlayer, i.e. either on the hidden (i = 2,3, . . . , lmax � 1) or output layers(i = imax), would receive at most n inputs augmented with one bias.Each of these inputs are the output of the previous layer neurons.Therefore, the total number of weights and biases required in theconsidered layers is defined as in (33). For instance, in the case ofour example, this part of the memory must have 168 words.

ðn� nþ 1� nÞðlmax � 1Þ ¼ nðnþ 1Þðlmax � 1Þ ð33Þ

So, the whole memory must have a total size defined in (34). For theexample, the size sums up to 210 words.

nðimax þ 1Þ þ nðnþ 1Þðlmax � 1Þ ð34Þ

Tables 1 and 2 show how the synaptic weights and biases of theANN in Fig. 15 are organized within the available memory space.Observe that the table presents a total of 210 entries and the dataare organized in well-defined groups. Each group includes theweights and bias, if used, for each neuron. The bias is always regis-tered at the position, next to that used to bookkeeping the lastweight of the neuron at hand. The non-used positions, if any, areshown with the don’t care symbol, so that whichever data saved inthose positions will not affect the performance of the system aswhole. The number of available entries in the input layer ismmax + 1 = 6, but the ANN of Fig. 15 uses only three of them: twoweights as the network has two inputs and one bias. In the next lay-ers (second and third), however, the neurons have n + 1 = 8 entriesavailable, but the ANN requires only n1 + 1 = 4 and n2 + 1 = 3 ofthose positions for the neurons in the second and third layer respec-tively. In the example, there no forth layer.

7. Performance results

The architecture is entirely described in VHDL. It was simulatedin ModelSim XE 6.3c. The detailed VHDL code can be found inMartins (2010). Simulation snapshots accompanied by detailedexplanations of the computations done throughout a layer of theANN hardware are given in Martins (2010).

The VHDL specification of the ANN proposed hardware was syn-thesized for the Xilinx Virtex-5 XC5VFX70T FPGA, through XilinxXST Synthesis tool (ISE Design Suite 11.1). This synthesis tools al-lows for integration of software/hardware co-designs. The ANNhardware sub-system was implemented through automatic syn-thesis and the software sub-system was implemented in C lan-guage and executed by a MicroBlaze processor. The MicroBlaze(Xilinx, 2008) is a RISC microprocessor IP from Xilinx TM, whichcan be synthesized in reconfigurable devices. Although we hadaccess to an embedded PowerPC core, only the MicroBlaze offersa low latency point-to-point communication link, known as FastSimplex Link (FSL) (Xilinx, 2009), which allows the connection ofa component, identified as co-processor, to the microprocessor.Therefore, the MicroBlaze is connected to the ANN hardwareco-processor through an FSL channel and provides the necessaryinput data, as shown in Fig. 16. The C program, executed by themicroBlaze, provides the parameter and coefficient settings forthe LCS three memories via the from-microBlaze FSL FIFO, then aftera while receives the results, which are sent by the co-processor viathe to-microBlaze FSL FIFO. The C program is also responsible ofprinting the received results on the terminal via the availableUART. The timer is used to measure the number of elapsed cycleduring the ANN hardware operation.

In Table 3, we report the hardware area required by a singleneuron for different numbers of inputs, as well as that needed to

Table 4Time delay of a network and performance factor for different number of inputs andnet size.

i n l Net delay (ns) Performance factor (�10�3)

BINa STOb MACc FFPd BINa STOb MACc FFPd

2 6 3 3.45 5.85 3.67 3.33 2.49 21.37 68.12 53.254 9 5 4.92 7.09 5.11 4.71 0.96 11.75 24.46 28.138 13 7 11.32 19.87 11.79 9.05 0.20 2.55 8.48 13.38

a Binary-radix based (Nedjah & Mourelle, 2007).b Stochastic (Nedjah & Mourelle, 2007).c MAC-based floating-point (Nedjah et al., 2009).d Fraction-based proposed here.


implement to whole network, considering the maximum numberof inputs allowed (i = imax), the maximum number of neuron perlayer (n = nmax) and the maximum number of layers (l = lmax). Therequired area is given in terms of slices. Virtex-5 FPGA slices are or-ganized differently from previous generations. Each Virtex-5 FPGAslice contains four 6-input lookup tables and four flip-flops. Wecompare these figures to those imposed by the binary-radixstraightforward design and the stochastic computing-based designNedjah and Mourelle (2007) and a previous MAC-based design, re-ported in Nedjah et al. (2009). In Table 4, we show the net delayimposed in comparison with those imposed by the implementa-tions reported in Nedjah and Mourelle (2007) and Nedjah et al.(2009).

The used FPGA has 11,200 slices. Therefore, a rough approxima-tions, given the data presented in Table 3, we can expect possibleimplementations of ANNs of a hundreds of neurons per layer forvirtually infinite number of layers. However, only the actual map-ping of the hardware with such number of neurons would provideexact figures, as the required area depends on the how the synthe-sis tool would optimize the hardware resources via sharing vs.duplication.

From the performance results given in Tables 3 and 4, it can beobserved that in the proposed architecture both the area and com-putation time are reduced and so the performance factor, which isdefined as 1

area�time, was improved, as depicted in the chart of Fig. 17.The comparison indicates that the trade-off between area and timeis improved for the proposed implementation as the net sizeincreases.

For a testbed application, we use the MLP that implementsword recognition of a given speech. The neural network requires220 data as input nodes and returns 10 results as output nodes.The network input consists of 10 vectors of 22 components ob-tained after preprocessing the speech signal. The output nodes cor-respond to 10 recognizable words extracted from a multi-speakerdatabase Waibel, Hanazawa, Hinton, Shikano, and Lang (1989).After testing different architectures Canas, Ortigosa, Diaz, andOrtega (2003), the best classification results, which achieved a96.83% of correct classification rate in a speaker-independentscheme, have been yielded using 24 nodes in a single hidden layer,with full feed forward connectivity. Thus the MLP has two layers of24 and 10 neurons respectively, as which was taken from Canaset al. (2008).

The FPGA implementation of the testbed MLP has been firstused in Canas et al. (2008), wherein the activation function isimplemented using lookup table after discretization and the han-dled data are all of 8 bits. The application uses 220 input data.Three alternative architectures has been investigated based onhow the required memories that store the input data, the weightsand biases as well as the lookup tables used to implement the acti-vation functions are implemented. The implementation alterna-tives are: distributed memory blocks (DRAM) or embeddedmemory blocks (BRAM). In the former, the flip-flops available

Table 3Area requirements of one neuron and network for different number of inputs and netsize.

i n l Neuron area (#Slices) Net area (#Slices)

BIN a STO b MAC c FFPd BINa STO b MAC c FFPd

2 6 3 116 8 4 6 98 21 8 114 9 5 212 12 8 9 574 75 25 418 13 7 436 20 11 15 780 421 57 129

a Binary-radix based (Nedjah & Mourelle, 2007).b Stochastic (Nedjah & Mourelle, 2007).c MAC-based floating-point (Nedjah et al., 2009).d Fraction-based proposed here.

within the configurable logic blocks (CLBs) are used while in thelatter specific blocks of RAM are used. Note that, in general, the de-lay due to the memory access of DRAM blocks is much shorter thanthat due to BRAM.

Table 5 shows the number of slices required to implement thedescribed MLP, the number of clock cycles that are necessary toaccomplish a whole net computation through all layers, togetherwith the corresponding minimal clock period. The product of thenumber of clock cycles times the duration of one cycle defines theevaluation time, given in the last column of Table 5. These figuresare reported for three different implementation alternatives: inalternative (a), only distributed RAM for the whole designs isused; in alternative (b), the weights associated with synapticsignals together with the biases are stored in BRAM, while theremaining data are stored in DRAM; in alternative (c), BRAMsare used to implement all required memories. In this testbedMLP, input data memory requires a total of 220 entries, theweight and biases memory needs to accommodate 220 �24 + 24 � 10 = 5520 words and the activation function table,which in our case requires only 9 fractions. However, in theimplementation reported in Canas et al. (2008), this memoryrequires much more that this. The authors did not specify howmany entries they used to digitize the sigmoid activation func-tion. It is worthwhile noting that a memory entry in our designis of 32 bits while in Canas et al. (2008), it is of 8 bits only. Also,the design of Canas et al. (2008) was mapped using Virtex-E 2000FPGA device is used. This family of FPGAs is based on 4-inputLUTs.

The chart of Fig. 18 illustrate the comparison of the perfor-mance factor achieved by the proposed design and that obtainedfor the design using lookup table for the activation function, as re-ported in Canas et al. (2008). The chart shows that our design al-ways wins, i.e. with respect to the investigated alternatives forthe implementation of the data memories. Our design alwaysrequires less hardware area due to re-use of circuitry by bothweighted sum and activation function computation. Furthermore,despite the fact that the proposed design requires mores clockcycles to complete the needed computation, it does so at a higheroperation frequency.

Table 5Performance comparison of the proposed design with a design that uses a lookuptable as activation function.

Design Option #Slices #BRAM Clock(ns)

#Cycles Time(ms)

factor

Canasa (a) 6321 0 58.162 282 16.402 2.59(b) 4411 24 59.774 16.856 4.73(c) 4270 36 64.838 18.284 3.80

Proposed (a) 3712 0 49.341 356 17.565 6.05(b) 2988 13 51.003 18.157 4.25(c) 2192 20 54.677 19.465 8.80

a The results reprodced here are also available in Canas et al. (2008).

Fig. 17. Comparison of the performance factor yield by the neural network hardwareproposed here and those reported in the literature.

0

2

4

6

8

10

perf

orm

ance

fact

or

(a) (b) (c)

memory alternatives

Canas

Proposed

Fig. 18. Comparison of the performance factor for the testbed application.


8. Conclusion

In this paper, we presented a novel hardware architecture forprocessing an artificial neural network, whose topology configura-tion can be changed on-the-fly without any extra effort. An extra ef-fort was undertaken to implement efficiently arithmetic andcomputing models. Furthermore, the model minimizes the re-quired the silicon area as it uses a single physical layer and re-usesby feedback it to perform all the computation executed by all thelayers of the net. This is done so without deteriorating the neuralnetwork inference time. The IEEE Standard for Floating-PointArithmetic (IEEE-754) was not used. Instead, the search for simplearithmetic circuits and which require less silicon area motivatedthe use of fractions to represent real numbers.

The model was specified in VHDL, simulated to validate its func-tionality. We also synthesized the system to evaluate time and arearequirements. The comparison of the performance result of theproposed design was then compared to three similar implementa-tions: the binary-radix straightforward design, the stochastic com-puting based design and the MAC-based implementation withfloating-point operations. Furthermore, the design performancewas compared to a similar one that uses a look-up table to imple-ment the activation function. The proposed design has been provensuperior in many aspects.

The next stage of this work is to implement all the adjustmentsto accommodate some learning techniques, so that the hardwarecould be able to infer the weights depending on the data trainingset presented by the user.

Acknowledgments

We are grateful to FAPERJ (Fundação de Amparo á Pesquisa doEstado do Rio de Janeiro, http://www.faperj.br) and CNPq (Conse-lho Nacional de Desenvolvimento Científico e Tecnológico, http://www.cnpq.br) for their continuous financial support.

References

Ahmadi, A., Mattausch, H. J., Abedin, M. A., Saeidi, M., & Koide, T. (2011). Anassociative memory-based learning model with an efficient hardwareimplementation in FPGA. Expert Systems with Applications, 38(4), 3499–3513.

Bade, S. L. & Hutchings, B. L. (1994). PGA-based stochastic neural networks. In IEEEworkshop on FPGAs for custom computing machines, IEEE, Los Alamitos (pp. 189–198).

Beuchat, J.-L., Haenni, J.-O., & Sanchez, E. (1998). Hardware reconfigurable neuralnetworks. In 5th Reconfigurable Architectures Workshop, Orlando, Florida.

Botros, N. M., & Abdul-Aziz, M. (1994). Hardware implementation of an artificialneural network using field programmable arrays. IEEE Transactions on IndustrialElectronics, 41, 665–667.

Canas, A., Ortigosa, E. M., Diaz, A. F., & Ortega, J. (2003). XMLP: A feed-forwardneural network with two-dimensional layers and partial connectivity. LectureNotes in Computer Science, LNCS, 2687, 89–96.

Canas, A., Ortigosa, E. M., Ros, E., & Ortigosa, P. M. (2008). FPGA implementation of afully and partially connected MLP – Application to automatic speechrecognition. In A. R. Omondi & J. C. Rajapakse (Eds.), FPGA Implementations ofNeural Networks. Springer.

Chen, C. (2003). Fuzzy logic and neural network handbook. New York: McGraw-Hill.Choi, Y. K., Ahn, K. H., & Lee, S.-Y. (1996). Effects of multiplier output offsets on on-

chip learning for analog neuro-chips. Neural Processing Letters, 4, 1–8.Dias, F. M., Antunes, A., & Mota, A. M. (2004). Artificial neural networks: A review of

commercial hardware. Engineering Applications of Artificial Intelligence, 17(8),945–952.

Falavigna, G. (2012). Financial ratings with scarce information: A neural networkapproach. Expert Systems with Applications, 39(2), 1784–1792.

Ferrucci, A. T. (1994). A field programmable gate array implementation of selfadapting and scalable connectionist network, Ph.D. thesis, University ofCalifornia, Santa Cruz, California.

Gadea, R. Ballester, F. Mocholi, A., & Cerda, J. (2000). Artificial neural networkimplementation on a single FPGA of a pipelined on-line backpropagation. InProceedings of the 13th international symposium on system synthesis, IEEE, LosAlamitos.

García-Fernández, I., Martin-Guerrero, J. D., Pla-Castells, M., Soria-Olivas, E.,Martinez-Dura, R. J., & Munoz-Mari, J. (2004). Crane collision modelling usinga neural network approach. Expert Systems with Applications, 27(3), 341–348.

Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). NewJersey: Prentice Hall International.

Holt, J. L. & Baker, T. E. (1991). Backpropagation simulations using limited precisioncalculations. In Proceedings of international joint conference on neural networks(Vol. 2, pp. 121–126).

Kashif, S., Saqib, M. A., & Zia, S. (2011). Implementing the induction-motor drivewith four-switch inverter: An application of neural networks. Expert Systemswith Applications, 38(9), 11137–11148.

Kung, H. T. (1988). How We got 17 million connections per second. In Internationalconference on neural networks (Vol. 2, pp. 143–150).

Kung, S. Y. (1988). Parallel architectures for artificial neural networks. InInternational Conference on Systolic Arrays (pp. 163–174).

Kung, S. Y., & Hwang, J. N. (1989). A unified systolic architecture for artificial neuralnetworks. Journal of Parallel Distributed Computing, 6, 358–387.

Linde, A., Nordstrom, T., & Taveniku, M. (1992). Using FPGAs to implement areconfigurable highly parallel computer. In Selected papers from: SecondInternational Workshop on Field Programmable Logic and Applications. Berlin:Springer-Verlag (pp. 199–210).

Lindsey, C. S. & Lindblad, T. (1994). Review of hardware neural networks: A user’sperspective. In 3rd Workshop on Neural Networks: From Biology to High EnergyPhysics.

Lin, C. J., & Lee, C. Y. (2011). Implementation of a neuro-fuzzy network with on-chiplearning and its application. Expert Systems with Applications, 38(1), 673–681.

Martins, R. S. (2010) Implementação em hardware de redes neurais artificiais comtopologia configurável, M.Sc. Dissertation, Post-graduate Program of ElectronicsEngineering, State University of Rio de Janeiro – UERJ, UERJ/REDE SIRIUS/CTCB/S586.

Martins, R. S., Nedjah, N., & Mourelle, L. M. (2009). Reconfigurable MAC-basedarchitecture for parallel hardware implementation on fpgas of artificial neuralnetworks using fractional fixed point representation. In Proceedings of ICANN09.LNCS (Vol. 5164, pp. 475–484). Berlin: Springer.

http://www.faperj.br

http://www.cnpq.br

http://www.cnpq.br


Moerland, P. D., & Fiesler, E. (1997). Neural network adaptations to hardwareimplementations. In Handbook of neural computation. New York: OxfordUniversity Publishing.

Montalvo, A., Gyurcsik, R., & Paulos, J. (1997). Towards a general-purpose analogVLSI neural network with on-chip learning. IEEE Transactions on NeuralNetworks, 8(2), 413–423.

Nedjah, N., Martins, R. S., & Mourelle, L. M. (2011). Analog hardwareimplementations of artificial neural networks. Journal of Circuits, Systems, andComputers, 20(3), 349–373.

Nedjah, N., Martins, R. S., Mourelle, L. M., & Carvalho, M. V. S. (2008). ReconfigurableMAC-based architecture for parallel hardware implementation on FPGAs ofartificial neural networks. In Proceedings of ICANN08. LNCS (Vol. 5768,pp. 169–178). Berlin: Springer.

Nedjah, N., Martins, R. S., Mourelle, L. M., & Carvalho, M. V. S. (2009). Dynamic MAC-based architecture of artificial neural networks suitable for hardwareimplementation on FPGAs. Neurocomputing (Vol. 72, pp. 2171–2179).Amsterdam: Elsevier (10–12).

Nedjah, N., & Mourelle, L. M. (2007). Reconfigurable hardware for neural networks:Binary versus stochastic. Journal of Neural Computing and Applications (Vol. 72).London: Springer. 12.

Nedjah, N., Silva, M. V. C., & Mourelle, L. M. (2012). Preference-basedmulti-objective evolutionary algorithms for power-aware applicationmapping on NoC platforms. Expert Systems with Applications, 39(3),2771–2782.

Omondi, R., & Rajapakse, J. C. (2008). FPGA implementation neural networks. Berlin:Springer.

Rojas, R. (1996). Neural networks. Berlin: Springer-Verlag.

Santi-Jones, P., & Gu, D. (2008). Fractional fixed point neural networks: Anintroduction. UK: Department of Computer Science, University of Essex.

Sibai, F. N., Hosani, H. I., Naqbi, R. M., Dhanhani, S., & Shehhi, S. (2011). Irisrecognition using artificial neural networks. Expert Systems with Applications,38(5), 5940–5946.

Tanenbaum, A. S. (2007). computer organization (5th ed.). New Jersey: Prentice HallPTR.

Uyemura, J. P. (2002). Introduction to VLSI circuits and systems (10th. ed.). New York:Wiley.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phonemerecognition using time-delay neural networks. IEEE Transactions on Acoustics,Speech, and Signal Processing, 37(3), 328–339.

Wolf, W. (2004). FPGA-based system design. New Jersey: Prentice Hall PTR.Xilinx. (2008). Microblaze processor reference guide. <http://www.xilinx.com/

support/documentation/sw_manuals/mb_ref_guide.pdf>.Xilinx. (2009). Fast simplex link v2.11b. <http://www.xilinx.com/support/

documentation/ip/documentation/fsl_v20.pdf>.Zhang, X. et al. (1990). An efficient implementation of the back propagation

algorithm on the connection machine. Advances in Neural Information ProcessingSystems, 2, 801–809.

Zhang, D., & Pal, S. K. (1992). Neural networks and systolic array design. Singapore:World Scientific Company.

Zhu, J., & Sutton, P. (2003). FPGA Implementations of neural networks – A survey ofa decade of progress. Field programmable logic and application. Lecture Notes inComputer Science (Vol. 2778). Berlin: Springer.

Zurada, J. M. (1992). Introduction to artificial neural systems (1st ed.). USA: WestGroup.

http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf

http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf

http://www.xilinx.com/support/documentation/ip/documentation/fsl_v20.pdf

http://www.xilinx.com/support/documentation/ip/documentation/fsl_v20.pdf

Documents

Compact yet efficient hardware implementation of artificial neural networks with customized topology