20
Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks Valeriu Beiu †,‡ , Jan Peperstraete , Joos Vandewalle and Rudy Lauwereins ,Katholieke Universiteit Leuven, Department of Electrical Engineering Division ESAT, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium on leave of absence from “Politehnica” University of Bucharest Department of Computer Science, Spl. Independentei 313, 77206 Bucharest, România Abstract – This paper is devoted to show that there are simple and accurate ways to compute a sigmoid nonlinearity and its derivative in digital hardware by sum of steps, and that threshold gate implementation of such algorithms are area-efficient when com- pared to other known methods. §1. OVERVIEW The paper starts by describing classical solutions for digital hardware implementation of the non- linear activation functions used by artificial neurons, the accent falling on sigmoid nonlinearities. Fresh results from the known literature are mentioned and shortly compared (§2. Classical Solutions). But even if approximation techniques are used, the computations involved are quite com- plex. That is why we introduce a particular sigmoid function ( §3. A Particular Sigmoid Function). It is not very difficult to show that this particular sigmoid function is equivalent with the classical sigmoid function if the amplification factor (gain) is changed by a constant. As this constant can be used to multiply all the incoming weights, the input-output behavior of one artificial neuron will be the same. The change in weights is not relevant as anyhow they will be represented in a fixed-point format, limited by the accuracy of the technology. We prove that the exact computation of the particular sigmoid function can be done without com- plex operations, as it always gives rise to periodic binary numbers satisfying a very simple rule (§4. Mathematical Considerations). Before investigating on a hardware implementation for the particu- lar sigmoid function, we shall also prove that it cannot be efficiently computed by a systolic or semi-systolic system. The same idea will then be taken to another sigmoid function: the hyperbolic tangent, where a similar, but more complex rule can be deduced ( §5. Other Sigmoid Functions). Two more sigmoid functions are introduced to the reader: the fast sigmoid, and the error function. Approximations for all these functions by modifying the gain of the particular sigmoid function can be an alternative to finding dedicated algorithms. The Scientific Annals, Section: Informatics, vol. 40 (XXXX), no. 1, 1994. 31 This research work was partly carried out in the framework of a Concerted Action Project of the Flemish Community, entitled: “Applicable Neural Networks”, and partly supported by a Doctoral Scholarship Grant offered to V. Beiu by KULeuven. The scientific responsibility is assumed by the authors. Senior Research Assistant of the Belgian National Fund for Scientific Research.

Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

Close Approximations of Sigmoid Functions

by Sum of Steps for

VLSI Implementation of Neural Networks①

Valeriu Beiu†,‡

, Jan Peperstraete†, Joos Vandewalle

and Rudy Lauwereins†,②

† Katholieke Universiteit Leuven, Department of Electrical EngineeringDivision ESAT, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium

‡ on leave of absence from “Politehnica” University of BucharestDepartment of Computer Science, Spl. Independentei 313, 77206 Bucharest, România

Abstract – This paper is devoted to show that there are simple and accurate ways tocompute a sigmoid nonlinearity and its derivative in digital hardware by sum of steps,and that threshold gate implementation of such algorithms are area-efficient when com-pared to other known methods.

§1. OVERVIEW

The paper starts by describing classical solutions for digital hardware implementation of the non-linear activation functions used by artificial neurons, the accent falling on sigmoid nonlinearities.Fresh results from the known literature are mentioned and shortly compared (§2. ClassicalSolutions). But even if approximation techniques are used, the computations involved are quite com-plex. That is why we introduce a particular sigmoid function (§3. A Particular Sigmoid Function).It is not very difficult to show that this particular sigmoid function is equivalent with the classicalsigmoid function if the amplification factor (gain) is changed by a constant. As this constant can beused to multiply all the incoming weights, the input-output behavior of one artificial neuron will bethe same. The change in weights is not relevant as anyhow they will be represented in a fixed-pointformat, limited by the accuracy of the technology.

We prove that the exact computation of the particular sigmoid function can be done without com-plex operations, as it always gives rise to periodic binary numbers satisfying a very simple rule (§4.Mathematical Considerations). Before investigating on a hardware implementation for the particu-lar sigmoid function, we shall also prove that it cannot be efficiently computed by a systolic orsemi-systolic system.

The same idea will then be taken to another sigmoid function: the hyperbolic tangent, where asimilar, but more complex rule can be deduced (§5. Other Sigmoid Functions). Two more sigmoidfunctions are introduced to the reader: the fast sigmoid, and the error function. Approximations forall these functions by modifying the gain of the particular sigmoid function can be an alternative tofinding dedicated algorithms.

The Scientific Annals, Section:Informatics, vol. 40 (XXXX), no. 1, 1994.

31

① This research work was partly carried out in the framework of a Concerted Action Project of the FlemishCommunity, entitled: “Applicable Neural Networks”, and partly supported by a Doctoral Scholarship Grantoffered to V. Beiu by KULeuven. The scientific responsibility is assumed by the authors.

② Senior Research Assistant of the Belgian National Fund for Scientific Research.

Page 2: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

As the features of the sigmoid function are important in learning methods based on gradient descent,we present several results which link the previous ones with those encountered while learning (§6.Relations with Learning). Some learning algorithms require the precise estimation of the sigmoidderivative – back-propagation included. We will show that the periodic structure of the rationalnumbers for representing the derivatives of the particular sigmoid function and hyperbolic tangent,is much more intricate in both cases.

Going into details, an example of a parallel implementation for the particular sigmoid functionfollows (§7. Example of Implementation). It compares well with respect to the results obtainedfrom an artificial neural network trained to learn the same function. It is clear that a designed solutionfor this problem is, at least for the moment, much better than a “trained” one. This is in no case adisadvantage, as the design should be done only once, and the know-how can then be crafted in thelayout of a dedicated circuit.

Conclusions and some open questions which are left for further research end the paper.

§2. CLASSICAL SOLUTIONS

One of the difficult problems encountered when implementing neural networks is the nonlinearityused after the weighted summation of the inputs. There are three main nonlinear activation functions:(i) threshold (hard limiter); (ii) linear; (iii) sigmoid. We should also keep in mind that the sigmoidnonlinearity can be realized in many different ways, out of which two are widespread: (i) hyperbolictangent; (ii) classical sigmoid.

For analog implementations the following established methods exist:● the most simple comparator can be used for threshold function;● a follower with threshold can easily implement the linear transfer function;● the nonlinearity of a simple device (transistor, diode, ...) is normally used to match one of the

common sigmoid nonlinearities.None of these is complicated, and from this point of view analog solutions seem to have an ad-

vantage, their only problem being accuracy. From the published literature, it appears that this is notas difficult a problem as the precision required by the weights (it is known that 4 ÷ 6 bits are needed,while present day analog technologies can achieve up to 8 bits of precision). In spite of these, thereis one disadvantage: the fact that the shape of the sigmoid can hardly be controlled by means of anyparameter, as it relies on the nonlinearity of an elementary device. We should also mention the dif-ficulties arose by the computation of the derivative of the sigmoid for the learning phase.

For digital techniques, only the threshold nonlinearity can be matched exactly in hardware (as beinga comparison); the linear or sigmoidal nonlinearties have to be approximated up to a certain accuracy.As the real power of neural networks stems from their nonlinearity, we will be interested only bythe most complex form: the sigmoid kind of nonlinearity. The classical digital solutions to the sigmoidnonlinearity, which have been either investigated, or even implemented, fall into several directions.As a clear cut we can mention two main trends: (i) look-up tables (also ROM tables) [Nigri 1991a,Nigri 1991b]; (ii) summing a truncated Taylor series expansion. The second trend (expansion into aseries) can be further divided in several sub-classes, by the way the approximation has been made:

● sum of steps approximation [Alippi 1990a, Alippi 1990b, Alippi 1991a, Beiu 1992a];● piece-wise linear approximations [Alippi 1991b, Krikelis 1991, Myers 1989, Siggelkow 1991,

Spaanenburg 1991];● combination of the first two;● other dedicated approximations [Pesulima 1990, Saucier 1990].

Table 1 surveys some of the recently published results, dealing with digital implementations of thesigmoid function. It becomes clear that the look-up table approach falls short from the aim of havinga “good approximation” (performance) in a “small area” on the final chip (price). The secondapproach (Taylor series expansion) gives a better performance/price tradeoff over look-up tables.This is also proven by the numerous articles exploring on that line.

For a global description and analysis of sigmoid activation functions, a general class of functionshas been proposed [Alippi 1991a]:

f (x,k,b,T,c) = k + c

1 + b eTx , ∀ x ∈ IR.

(1)

32

Page 3: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

It has four parameters: k ∈ IR, b ∈ IR+, and T, c ∈ IR \ {0}, and three of the most used sigmoidnonlinearities belong to this class:

● the classical sigmoid function (obtained by choosing k = 0, c = b = 1, and T = − 1)

f (x) = 1

1 + e− x ;

(2)

● the thermodynamic-like function (for k = 0, c = b = 1, and T = − 1⁄ T ′ )

t (x,T ′) = 1

1 + e − x⁄T ′ ;

(3)

● the hyperbolic tangent (by tuning k = 1, c = − 2, b = 1, and T = 2)

h (x) = e x − e− x

e x + e− x .

(4)

A linear transformation of (1) [Alippi 1991a], which leads to a two parameter set of activationfunctions:

F (X,c) = c2

⋅ 1 − e X

1 + e X

(5)

is shown in figure 1.

Who Where Result(s) Remark(s)

Alippi TR-CS, Univ. CollegeLondon, 1991.

Relations between converg-ence and precision.

Introduces a general class ofnonlinear functions.

Alippi, Bonfanti,Storti-Gajani

TR-EE, Polytech. ofMilano, 1990.

Approximations of classicalsigmoidal function. Hardware.

Sum of 1 ÷ 5 steps. Error≤ ± 13.1% with 5 steps.

Alippi, Bonfanti,Storti-Gajani

Proc. Micro-Neuro’90,165-170.4, 1990.

Approximations of classicalsigmoidal function. Hardware.

Sum of 1 ÷ 5 steps (seeprevious article).

Alippi, Storti-Gajani Proc. ISCAS’91,1505-1508, 1991.

Approximations of classicalsigmoidal function. Hardware.

Piece-wise by the set of points ± n, 1⁄ 2n+1

.

Höhfeld, Fahlman Proc. Micro-Neuro’91,1-8, 1991.

Probabilistic weight updates(down to 4 bits).

Needed precision for sigmoid4 ÷ 6 bits.

Krikelis Proc. ICASSP’91,1057-1060, 1991.

Approximation of the classicalsigmoid function.

Piece-wise linearization in [–4, 4], error ≥ ± 5.07%).

Myers, Hutchinson Electronics Lett.25(24), 1662-1663,1989.

Approximation of an A-lawsigmoid-like function.Hardware.

7 segments piece-wise. Error≥ ± 4.89% for [–8, 8](≥ ± 12% for the derivative).

Nigri TR-CS, Univ. CollegeLondon, 1991.

Precision required for back-propagation. Hardware.

Look-up table for 8 bits;“exact” only for [–2, 2].

Pesulima, Pandya,Shankar

Proc. IJCNN’90, vol.II, 187-190, 1990.

Good approximation of theclassical sigmoid by twoexponentials. Hardware.

Two exponential appro-ximation. Error ≥ ± 2.45% for[–8, 8].

Saucier, Ouali Proc. IJCNN’90, vol.II, 557-561, 1990.

Silicon compilation. Quadratic approximation ofthe sigmoid (Taylor series).

Siggelkow, NijhuisNeußer, Spaanenburg

Proc. ICANN’91,697-702, 1991.

Analyzes accuracy, and showsthat a problem depen-dentsynthesis is required.

Piece-wise linearization with 5seg-ments. No hardwaresuggested.

Spaanenburg,Hoefflinger, NeußerNijhuis, Siggelkow

Proc. Micro-Neuro’91,281-289, 1991.

Bit-serial approximation ofbinary logarithmic com-putations (problem dependentcomplex parameters).

Piece-wise linearization of thesigmoid with 5 segments ➠4 ÷ 6 bits. No hardwaresuggested.

Table 1.

Recent published results about digital implementation of the sigmoid nonlinearity.

33

Page 4: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

§3. A PARTICULAR SIGMOID FUNCTION

3.1. Definitions

The sigmoid function to be studied is the classical one (2), but having an amplification factor G(gain), which makes it look like the thermodynamic one (3):

f (x) = 1

1 + e− Gx , ∀ x ∈ IR , G ∈ IR+.

(3′)

which will be further called “sigmoid”. Firstly we will be interested in finding out an efficientway to implement this function digitally; and secondly to show a simple way of mapping thealgorithm in a threshold gates network. We shall also suggest how to extended these results toother sigmoid functions.

There are two starting ideas:● substitute e (the base of the natural logarithms) by 2, which seemed quite normal from a first

sight, as we are using the binary system in computers;● replace x by n, as we are not going to use a floating-point representation, but a fixed-point one;

in the case of a fixed-point representation, by a proper choice of a scaling factor K = 2k wecan have only integers, and it has been shown that one can replace each of the real weights byintegers [Raghavan 1988].

So, including the gain G in the scaling factor K, the particular sigmoidal function is:

f ∗ (n) = 1

1 + 2− n , ∀ n ∈ ZZ .

(6)

It can easily be proved that the difference between eq. 6 and eq. 3′ is only a constant factor

e − z = 2 − z⁄ln2

, so the gain has to be changed to include this multiplicative factor:

f ∗ (n) = 1

1 + 2− n =

1

1 + 2 − z⁄ln2

(7)

It can be shown that eq. 3′ and eq. 7, are equivalent to one another:

f ∗ (n) = 1

1 + 2 − z⁄ln2

=

1

1 + e − z =

1

1 + e − G ⋅ (z⁄G ) =

1

1 + e − Gx = f (x). (8)

-10-5

05

10

-10

-5

0

5

10-5

0

5

F(X

,c)

X

c

Figure 1. A general two parameters set of activation functions: F (X,c).

34

Page 5: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

3.2. Error estimation

Before going into details, an important thing we should clarify is an estimation of the error due tothe fact that a limited precision has to be used. This integer representation of n, obtained by scalinga fixed-point representation x with a predetermined factor K, can be done by: (i) truncation, or (ii)rounding. Of equal importance is the way x is represented: (i) either sign and magnitude, or (ii) two’scomplement, just concentrating on the most common digital representations. Combining them wehave four possible cases. As from the literature one can expect to limit the inputs to the {–8, 8}interval, a 4 bits representation has been considered.

If one uses “sign-and-magnitude”, and truncates the magnitude, the results will be like the onespresented in figure 2. The error has been computed as the difference between the continuous functionand the quantized one, and is δ ∈ [−0.16, 0.16] (figure 2b).

Next a “two’s complement” representation has been used for n. By truncating it to four bits, theparticular sigmoid function has the shape drawn in figure 3a. In this case the differences between thecontinuous and the discrete version are δ ∈ [0, 0.16] (figure 3b). It is interesting to see that just bychanging the way we represent the inputs, we can reduce these error interval by half.

If we round the input to four bits, we will have the same result both for a “sign-and-magnitude”representation and for a “two’s-complement” representation. Figure 4a shows the shape of the con-tinuous and the discrete version of the particular sigmoid function, while figure 4b displays the dif-ference between them which falls now in the range [−0.081, 0.081].

These results can be enhanced (± 4.61%) if a piece-wise linearization is done between two adjacentexact points. Still, for most applications a ± 8.1% error is acceptable, a rule-of-thumb being that a± 10% error is the maximum acceptable [Alippi 1990a, Alippi 1991a, Baker 1988, Höhfeld 1991,Holt 1990, Holt 1991, Nigri 1991].

-8 -6 -4 -2 0 2 4 6 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Par

ticul

ar s

igm

oid

func

tion

(a)

Figure 2. The particular sigmoid function with n in “sign-and-magnitude” representation: (a) the continuous sigmoidfunction f (x) as thin line, and the particular sigmoid function f ∗ (n) as bold stairs; (b) the errorδ = f (x) − f ∗ (n).

-8 -6 -4 -2 0 2 4 6 8-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

x

f(x)

- f(

n)

(b)

-8 -6 -4 -2 0 2 4 6 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Par

ticul

ar s

igm

oid

func

tion

(a)

Figure 3. The particular sigmoid function with n in “two’s complement”: (a) f (x) as thin line, and f ∗ (n) as bold stairs;(b) the error δ = f (x) − f ∗ (n).

-8 -6 -4 -2 0 2 4 6 80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

x

f(x)

-f(n

)

(b)

35

Page 6: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

§4. MATHEMATICAL CONSIDERATIONS

This part is dedicated to show that, fortunately, a simple way of computing the previously definedparticular sigmoid function exists. The simple periodical binary representation of the function allowsus to completely bypass the complex arithmetical operations from eq. 2, and to define a dedicatedalgorithm for computing the particular sigmoid function.

Before trying to implement it in hardware, we shall analyze which one out of the two classicalpossibilities: serial or parallel, to choose. We shall prove that unfortunately no efficient serial solutionexists, as practically we will have to wait till all the bits of the input have been received. This willforce us to put forth a parallel implementation.

Although in this section binary notation (base 2) will be frequently used, we shall scarcely markthat by a subscript as we do not want to complicate notations. The round parentheses ( ) are used toshow that we deal with a periodic number.

4.1. Computing the particular sigmoid function

Computing the particular sigmoid function (eq. 6) does not seem trivial at first sight: exponential,division, but ... . The special case n = 0 is easy, leading to f (n) = 1⁄2 = 0.510 = 0.100...02. For anyother case we can apply the following:

Theorem1

The particular sigmoid function is represented by the simple periodic binary num-ber:

0. s_1 s

_2 … s

_n sn+1 sn+2 … s2n

for ∀ n ∈ ZZ o (i.e. ZZ \ {0}), where s = s1 = s2 = … = s2n is the sign of n.

Proof As n ∈ ZZ o, we can distinguish two cases:■ n ∈ IN (n > 0), with s = 0 (positive sign), and■ n ∈ ZZ \ IN (n < 0), with s = 1 (negative sign),

and analyze each case separately (clearly s_ = 1 − s).

In the first case n ∈ IN (n > 0):

f ∗ (n) = 2n

1+ 2n =

2n 2

n− 1

22n− 1 =

2n− 1

2n ⋅ 1

1− 2−2n =

2n− 1

2n

∑k = 0

1

22kn

=

-8 -6 -4 -2 0 2 4 6 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Par

ticul

ar s

igm

oid

func

tion

(a)

Figure 4. The particular sigmoid function if the input value n has been rounded: (a) f (x) as thin line, and f ∗ (n) asbold stairs; (b) the difference (error) δ = f (x) − f ∗ (n).

-8 -6 -4 -2 0 2 4 6 8-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

x

f(x)

- f(

n)

(b)

36

Page 7: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

= ∑k = 0

2n− 1

2(2k+1)n = ∑

k = 0

1

22kn+1 ⋅ 2

n− 1

2n−1

= ∑

k = 1

∑i = 1

n

1

2 2kn + i

=

= 0. 11...11 00...00 . (9)

This concludes half of our proof. The second case n ∈ ZZ \ IN (n < 0) can either betackled in a similar way, or in a much simpler one. It can be easily proventhat f ∗ (n) has (0 , 1⁄2) as symmetric point, so it can be foreseen that the valueof f ∗ (n) for negative n, will be the complement of the function’s value in the positivesymmetric point (i.e. |n|). This results from: f ∗ (m) = 1⁄2 + v, f ∗ (−m) = 1⁄2 − v, and

f ∗ (−m) = 1 − f ∗ (m), meaning that f ∗ (n) = f ∗ (−n)______

, and can be proven as:

1

1 + 2 n +

1

1 + 2− n =

1

1 + 2 n +

2 n

1 + 2 n = 1

leading to:

f ∗ (n) = 0. 00...00 11...11 00...00 11...11 … = 0. 00...00 11...11 (10)

which concludes the proof.❏

Several comments are in order here to clarify any doubts. The result is such that one need notcompute anything at all, the algorithm being very simple:

❝ Generate a sign-dependent sequence of n zeros and n ones ❞ .

Corollary The following formula can be used to directly determine the value of the i th bitsi of the result i = 1,2,...,2n (s is the sign):

si = i⁄n + s

_ mod 2 (11)

where x is the integer part of x (floor).

4.2. Inefficiency of any serial implementation

As the result of the algorithm for computing the particular sigmoid function is formed by twosequences of n-bits each, it is normal to have in mind a possible serial solution. One advantage ofsuch a solution would be the reduction of the communication lines to just one line.

Theorem2

The particular sigmoid function cannot be time efficiently generated by a serialcircuit like, for example, a systolic or a semi-systolic one.

Proof If we suppose that the bits of n (nm-1,...,n2,n1,n0) arrive starting with the least signific-ant bit (n0), we will generate at the output only ones until the first ni=1 arrives (weneglect the trivial case n=0). If the following bits ni+1,ni+2,... are all zeros, but ni+k=1,we will know how many ones to generate after ni, only after ni+k has arrived. If ni+kis the most significant bit (nm-1), then we have to know all the bits of n.Similarly, for the case when the bits of n are coming in reverse order (the most signific-ant bit first), suppose that the most significant bit is 1, then 00...0, and the leastsignificant bit is 1. This concludes the proof.❏

37

Page 8: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

The two counter examples proving Theorem 2 are extreme cases, but it is clear that we will haveto give up finding a “good” serial solution. As things are standing, the balance is:

● either accept a “long” delay to absorb all the bits of n (which are 4 for the {–8, 8} interval),but reduce the connections to just one line, or

● process the bits of n in parallel, but have more connections (equal to the number of bits – i.e.4 !).

As n is represented only on several bits, both solutions can be taken into consideration:● the serial one, due to the fact that the delay will not be too long,● the parallel one, due to the fact that the burden of several more lines will not be too high.

To conclude, we advocate for the parallel solution with the argument that if about the same areawill be occupied on the final chip, we will still be able to deliver answers at a higher rate.

§5. OTHER SIGMOID FUNCTIONS

5.1. The hyperbolic tangent

Another important and quite spread nonlinear activation function is the hyperbolic tangent (eq. 4):

h (x) = tanh (x) = e x − e− x

e x + e− x ,

which has the advantage of having a symmetric output interval [−1, 1]. Some articles claim thatthe convergence of some learning algorithms can be sped up by using the hyperbolic tangent.Nevertheless, all of them are based only on simple [Akiyama 1989, Gao 1991, Xu 1991], or moreelaborate [Hopfield 1990, Savran 1991] examples.

We will be interested to extend the results of Theorem 1, about the particular sigmoid function, tothe hyperbolic tangent. First we rewrite eq. 4 such as to have a gain term G:

tanh (Gx) = e Gx − e− Gx

e Gx + e− Gx ,

(12)

It is clear that a quantized version of the hyperbolic tangent h ∗ : ZZ ➞ Ql will be:

h ∗ (n) = 2 n − 2− n

2 n + 2− n , ∀ n ∈ ZZ.

(13)

Theorem3

The particular hyperbolic tangent is represented in binary by:

0. 11...11 00...00 11...11

for n ∈ IN, having 2n − 1 leading ones, followed by a period of 4n bits: 2n zerosand 2n ones.

Proof The function can be rewritten:

h ∗ (n) = 2 n − 2− n

2 n + 2− n =

1 − 2− 2n

1 + 2− 2n =

1

1 + 2− 2n −

2− 2n

1 + 2− 2n =

= 1

1 + 2− m − 2− m ×

1

1 + 2− m

(14)

with m = 2n. But we have proved that 1⁄ 1 + 2− m

= 0. 11...11 00...00 and by sub-stitution, eq. 14 becomes:

38

Page 9: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

h ∗ (m⁄2) = 0. 11...11 00…00 − 2− m × 0. 11...11 00...00 . (15)

As multiplying by 2− m is equivalent to shifting with m positions to the right:

h ∗ (m⁄2) = 0.11...11 00...00 − 0. 00...0011...11 00...00 = (16)

= 0.11...11 00...00 − 0.00...00 11...11 = 0. 11...1 0 00...01 11...1 with m − 1 leading ones, followed by a period having length 2m: m zeros and m ones.The proof is concluded by the fact that m = 2n. ❏

An error analysis similar to the one performed for the classical sigmoid function would reveal thatthe errors between the particular hyperbolic tangent h ∗ (n) and its continuous version tanh (x), fall in[−0.33, 0.33], when using 4 bits and rounding (figure 5). As the error is too large to be acceptable,we should increase the number of bits for representing n.

Still another simple solution which uses the particular sigmoid function exists.

Lemma1

The particular sigmoid function can be used to exactly match the particularhyperbolic tangent.

Proof The two functions are linked by:

h ∗ (n) = 2 n − 2− n

2 n + 2− n = 1 +

2 n − 2− n

2 n + 2− n − 1 =

2⋅ 2 n

2 n + 2− n − 1 =

= 2

1 + 2− 2n − 1 = 2f ∗ (2n) − 1 .

This reduces to:■ shift the input value n with one position to the left,■ compute the particular sigmoid function,■ shift the result with one position to the left,■ subtract one (decrement) to obtain the particular hyperbolic tangent.

The proof is concluded.❏

-3 -2 -1 0 1 2 3-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

Par

icul

ar h

yper

bolic

tang

ent

(a)

Figure 5. The particular hyperbolic tangent: (a) thin line for the continuous version, and stairs for the discrete version;(b) the error introduced by quantization.

-3 -2 -1 0 1 2 3-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

x

h(x)

-h(n

)

(b)

39

Page 10: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

5.2. The fast sigmoid

Another sigmoid function is the so-called “fast sigmoid” [Georgiou 1992]:

fs(x) = x

1 + | x | ,

(17)

which can be seen in figure 6a. Here | x | is the modulus of x. A very simple approximation usingthe particular sigmoid function is: 2f (x) − 1. It leads to δ ∈ [ − 0.17, 0.17]. Unfortunately whenquantizing the errors grow too large to be acceptable δ ∈ [ − 0.32, 0.32] (figure 6b).

But the fast sigmoid belongs to Ql which implies that it has a periodic binary representation. Thiscan lead to crafting a dedicated algorithm for the fast sigmoid.

5.3. The error function

Also a used sigmoid function is the error function erf (see figure 7):

erf (x) = 2

√π ∫0

x

e − t 2

dt .

No period appears in this case, but a simple solution is to take the gain G = 3.47 and use theparticular sigmoid function. This is equivalent with 2f (3.47x) − 1 = 2⁄(1 + 2

− 3.47x ) − 1, which has a

very low error δ ∈ [ − 0.0189, 0.0189] (for the continuous case). When quantizing the errors in-crease from 1.89% to 51%, due to the steepness of erf. It becomes clear that a good digital approxima-tion of erf can be reached only by taking into account additional bits from the fractional part of x.

-8 -6 -4 -2 0 2 4 6 8-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

x/(1

+|x

|) a

nd 2

f(n)

-1

(a)

Figure 6. (a) The fast sigmoid x⁄( 1 + | x | ) function (thin line), and a quantized version of its simple approximation2f (n) − 1 = 2⁄(1 + 2

− n ) − 1 (stairs); (b) the error δ between them.

-8 -6 -4 -2 0 2 4 6 8-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

xx/

(1+

|x|)

- 2

f(n)

+ 1

(b)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

erf(

x)

and

2f(

3.47

*x)

- 1

(a)

Figure 7. (a) The erf function as thin line, and one possible approximation 2f (3.47x) − 1 = 2⁄(1 + 2− 3.47x

) − 1 as smallcircles; (b) the error after quantization.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

x

erf(

x) -

2f(

3.47

*x)

+ 1

(b)

40

Page 11: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

§6. RELATIONS WITH LEARNING

One interesting, and not always very obvious aspect of learning is that it is strongly related to thesigmoid nonlinearity used. It is true that this claim is not valid for any learning algorithm. But, forthe most used learning algorithm in present day applications – back-propagation – the derivatives ofthe sigmoid function are needed when back propagating the errors [Rumelhart 1986].

6.1. The derivate of the particular sigmoid function

Suppose that we have to compute the first derivative of the sigmoid function (that being the casefor back-propagation).

Lemma2

The computation of the first derivative of the particular sigmoid function leads tosquaring the particular sigmoid function.This is also true for the particular hyperbolic tangent.

Proof By simply computing the derivate the claim follows.❏

Lemma 2 shows that the next step to proceed to, is to find an algorithm for computing f 2(x). Weshall rule out multiplication, as we want a very simple solution and it is known thatf ′(x) = f(x) [1 − f(x)] [Alippi 1991a, Krikelis 1991, Myers 1989]. We will accept an approximativeresult, if the accuracy will be in accordance with the expected values.

As an immediate approximation we can use is:

f 2(x) = 1

1 + 2− x

2 =

1

2− 2x + 2 ⋅ 2− x + 1 ≅

1

2− x + 1 + 1 = f (x − 1). (18)

Unfortunately the solution is far from being good (see figure 8). It is known that while the requiredprecision for the execution phase of an artificial neural network is not very high, the one for thetraining phase (learning) has to be high. As the error given, using this approximation (eq. 18), is8.33% in the worst case, we will try to find a better algorithm.

6.2. Squaring the particular sigmoid function

We will do even better than approximating the square of the particular sigmoid, by showing howto exactly compute the value of this multiplication. We prove that in three steps:

● first one period of f ∗ (n) will be used as multiplier, leading to a finite sum (Lemma 3)● this result will be used to compute the infinite sum (Theorem 4)● the length of the period of the product will then be determined (Theorem 5).

-8 -6 -4 -2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

x

f’(x)

an

d f

(x-1

)

(a)

Figure 8. (a) The derivative f ′(x) as thin line and the straightforward approximation f (x − 1) as small circles; (b) theerror δ between them.

-8 -6 -4 -2 0 2 4 6 8-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

x

f’(x)

- f

(x-1

)

(b)

41

Page 12: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

Lemma3

The particular sigmoid function and the particular hyperbolic tangent differ by amultiplicative constant, having a value equal to that of the period of the particularsigmoid function (0. 11...1 00...0):

f ∗ (n) × 0. 11...1 00...0 = h ∗ (n⁄2) .

Proof Based on elementary mathematics 0. 11...11 00...0 = 1 − 2− n. So:

f ∗ (n) × 0. 11...11 00...0 = 1

1 + 2− n × 1 − 2 n

= 1 − 2− n

1 + 2− n =

= 1 − 2− 2 ⋅ (n⁄2)

1 + 2− 2 ⋅ (n⁄2) = h ∗ (n⁄2) ,

which concludes the proof. ❏

Corollary A finite sum of shifted particular sigmiod functions equals the particular hyperbolictangent.

Before computing the infinite sum of h ∗ (n⁄2), we explain two notations we shall use:

● pκ (n) represents the real value of the κ th sub-period of 2n bits, after summation;● π (ρ) is a function returning the 2n least significant bits of ρ (binary representation):

π (ρ) = ρ − ρ⁄ 22n × 22n.

As f ∗ (n) has a periodical binary representation, its square will also have a periodical binaryrepresentation. The period of [f ∗ (n)]2 will have a length of 2 Κn, being a sequence of Κ sub-periodsof length 2n. The following Theorem can be stated about the value of each pκ (n), for κ = 1,…, Κ:

Theorem4

The value of the κ th sub-period, from the period of [f ∗ (n)]2, can be computed onlyfrom the values of two adjacent sub-periods as:

pκ (n) = π

κ

2n− 1

2

+

(κ+ 1) 2n− 1

2

22n

by taking the last 2n bits using the π function.

Proof The proof is base on adding the value of the κ th sub-period with the carry from the(κ+ 1) th sub-period (i.e. the next sub-period). This can be seen from the alignmentwith 2n positions to the right of the h ∗ (n⁄2) representing the partial sum of products:

0. 11...10 00...01 11...10 00...01 11...10 00...01 11...10 00...01 … + 0. 11...10 00...01 11...10 00...01 11...10 00...01 … + 0. 11...10 00...01 11...10 00...01 … + 0. 11...10 00...01 … −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 0. p1 (n) p2(n) p3(n) p4(n) …

42

Page 13: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

As hypothesis we shall consider that the carry generated in sub-period κ+1 is killedin sub-period κ. This can be understood intuitively as each sub-period has a sequenceof n zeros which will kill the carry [Mead 1989]. Each sub-period pκ (n) has κ times11...10 00...01 (eq. 16), but we have to add the carry generated in pκ+1 (n):

pκ (n) = π κ

2n− 1

2

+ CARRY (κ+1) .

The carry can be computed as:

CARRY (κ+1) = (κ+1) × 11...10 00...01

22n .

(19)

Because the value 11...10 00...01 of one sub-period is:

11...10 00...01 = 22n−1+ … + 2n+1 + 1 = 22n− 2n+1+ 1 = 2n− 1

2

(20)

by substitution, the proof is concluded.❏

Due to the hypothesis, the proof of Theorem 4 is correct if and only if the carry generated insub-period κ+1 is killed in sub-period κ. Still, in sub-period κ, another carry is generated. The ques-tion “Why the carry generated in a sub-period will be killed in the next sub-period ?” still remains.At first sight this does not seem true, as for any n, there will be a sub-period having κ large enough(the sum we compute is infinite, so κ grows unlimited), such as the carry generated will propagateover the n zeros. What really happens is that the length of the period is 2 Κn, with Κ = 2n+ 1.

Theorem5

The period of the binary representation of [f ∗ (n)]2 is 2 Κn, with Κ the smallestsolution of the diophantine equation:

Κ 2

n− 12 +

(Κ+ 1) 2n− 1

2

22n = i ⋅ 22n ,

which is Κ = 2n+ 1.

Proof As the position of the decimal point is irrelevant, we shall consider that we have shiftedthe decimal point with 2n bits to the left (this is equivalent with shifting all the binarynumbers with 2n bits to the right). Because we add 11...10 00...01 to itself in eachsub-period, the least significant 1 will do a counting and the sub-periods will differtill the moment when one sub-period will “reset” itself. Meanwhile, the more signific-ant half part of the sub-period will lead also to a sequence of zeros, as11...10 = 1 00…00 − 00…10 (the first one is a carry, while the second part containsjust a 1, which again will do a counting). So, it is certain that a particular sub-periodpκ

∗ (n) will be identical to a sequence of zeros. Taking into account that the result startswith a sequence of zeros (due to the artificial shifts done at the beginning of thisproof), we have a recurrence of sub-periods.To prove that these sub-periods are limitrophes to the long resulting period, we haveto show that:

■ between two “adjacent” sub-periods of zeros there are no equal sub-periods;■ after a sub-period of zeros, subsequent sub-periods start to repeat.

Both of these follow from the fact that in each sub-period we are doing a counting.Being a counting process, the resulting sub-periods cannot be equal, thus proving thatbetween two “adjacent” sub-periods of zeros there are no equal sub-periods. The sub-periods following a sub-period of zeros will start “counting” again, so they will repeat

43

Page 14: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

themselves.This proves that pΚ (n) = pκ

∗ (n) = 00…00. But the sequence 00…00 is a multipleof 22n, so pΚ (n) = pκ

∗ (n) = 00…00 = π i ⋅ 2

2n concluding half of the proof.

Proving now that Κ = 2n+ 1, can be done based on the counting argument: the factthat we do a counting implies a repetition after 2n sub-periods. But all these sub-peri-ods have a value pκ (n) ≠ 00…00 (this is possible as the length of sub-periods is 2n),so we have to add one more for p Κ (n) = 00…00. ❏

Corollary The multiplication of the particular sigmoid function by its period, is a periodicalbinary number having the same period length: 2n bits.

Corollary An infinite sum of properly shifted particular hyperbolic tangents equals theparticular sigmoid function squared.

6.3. A simple algorithm for the derivative

Based on these successive developments we are able to establish an algorithm and make somecomments with respect to precision. The result is that one should use Theorem 4 to determine thevalue of one sub-period and stop when enough sub-periods have been computed (their number beinggiven by Theorem 5). This is still complicated and multiplication is necessary in the computation ofeach sub-period. The only advantage is that now the length of the numbers to be multiplied has beenreduced to that of one sub-period: 2n bits.

Fortunately a better result can be obtained. First we will make the observation that the length2 Κn, with Κ = 2n+ 1, grows very quickly (as exponential). It is known that a precision of around16 bits is required for the learning phase. For n = 2 we already have a period of 20 bits. This makesit possible to compute just several sub-periods. Using Theorem 4 the sequence of bits which formsthe result has been determined and is p1(n), p2(n), p3(n), p4(n), …, or detailed:

0. 11...1110 00...0010 11...1100 00...0100 11...1010 00...0110 11...1000 00...1000.

The algorithm can be stated as:

❝ For any n > 2 the following sequence of bits: n − 1 ones, n − 1 zeros, 10, n − 2ones, n − 1 zeros, 100, n − 3 ones, 01, n − 2 zeros, 110, n − 3 ones,

n − 1 zeros, 1000, chopped to 16 bits, will represent f 2(n) ❞ .

A plot can be seen in figure 9. The errors are only due to quantizing, and are in the range[ − 0.031, 0.031] (still 3% is better than 8.3% given by the approximation of eq. 18). The result canbe improved only if we consider more than 16 samples for quantizing, which is equivalent to apice-wise linearization between adjacent points.

-8 -6 -4 -2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

x

Der

ivat

ive

of th

e pa

rtic

ular

sig

moi

d fu

nctio

n

(a)

Figure 9. The derivative of the particular sigmoid: (a) continuous function as solid line, and discrete version as boldstairs; (b) the difference (errors) between them.

-8 -6 -4 -2 0 2 4 6 8-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

x

f ’(x

) -

f ’(n

)

(b)

44

Page 15: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

§7. EXAMPLE OF IMPLEMENTATION

We will be interested in optimizing certain parameters, out of which of main importance are:● the solution should have a number of threshold gates as small as possible;● each threshold gate should have a number of inputs (fan-in) as small as possible;● the values of the threshold gates’ weights and thresholds should be integers and having a range

as small as possible.We shall investigate two completely different ways of designing: (i) a crafted solution, and (ii) a

solution obtained by learning and enhanced after that. Both solutions will be designed for only the[0, 8] interval, as the output for negative inputs can be obtained simply by negation (one’s comple-ment), as has been proven in Theorem 1. Still, if needed, a more elaborate, or precise solution canbe designed in a similar way.

7.1. Designed solution

For designing the circuit, we use well known design methodologies for digital circuits [Hill 1974],and for threshold circuits [Hu 1965, Muroga 1971, Muroga 1979, Sheng 1969]. We start from thetruth table of the function f ∗ (n), which will have 3 input bits n = x2x1x0 and will be represented ononly 7 output bits f ∗ (x2x1x0) = 0.y1y2y3y4y5y6y7. The truth table will be:

n f ∗ (n) x2 x1 x0 0. y1 y2 y3 y4 y5 y6 y7

0 0.5000 0 0 0 1 0 0 0 0 0 01 0.6666 0 0 1 1 0 1 0 1 0 12 0.8000 0 1 0 1 1 0 0 1 1 03 0.8888 0 1 1 1 1 1 0 0 0 14 0.9411 1 0 0 1 1 1 1 0 0 05 0.9696 1 0 1 1 1 1 1 1 0 06 0.9846 1 1 0 1 1 1 1 1 1 07 0.9922 1 1 1 1 1 1 1 1 1 1

Out of the seven output functions of three variables, only five have to be computed. From the othertwo: one is constant (y1 = 1) and the other one is equal to one of the input variables (y4 = x2).

For the logical operators AND and OR, and for ease of notations, we use × (⋅ ) and + , instead of/\ and \/, or ∩ and ∪ while for the threshold gates we use the following notation [Sheng 1969]:⟨w1⋅ x1+ w2⋅ x2+ …+ wn⋅ xn ⟩T where all the weights (wi , i = 1,2,…,n) are integers (wi ∈ IN )andonly the value of the threshold is not an integer. We shall also consider T = θ + 1⁄2, thus simplifyingthe more encumbering notation ⟨w1⋅ x1+ w2⋅ x2+ …+ wn⋅ xn ⟩ θ : θ + 1 which shows explicitly thatT ∈ (θ, θ + 1). The output functions are:

y2 = x2 + x1 = ⟨x2 + x1⟩0.5

y3 = x2 + x0 = ⟨x2 + x0⟩0.5

y6 = x2⋅ x1+ x1⋅ x0__

= x1⋅ x2+ x0__

= ⟨x2 + 2x1 − x0⟩1.5

y7 = x2__

⋅ x0+ x1⋅ x0 = x0⋅ x1+ x2__

= ⟨− x2 + x1 + 2x0⟩1.5

y5 = x1⋅ x0__

+ x2⋅ x0 + x1__

⋅ x0 or y5 = x1⋅ x0__

+ x2⋅ x1 + x1__

⋅ x0

The last function is not linear separable. Since we do not know how to split the input variablessuch as to get the minimum multi-threshold gate solution, we shall analyze all the alternatives. Thisis possible as the number of input variables is small. If we choose x2 as one of the two sub-sets (seethe left drawing on next page), an exclusive-OR has to be implemented, thus leading to a three layeredcircuit (it is known that a XOR cannot be realized by only one threshold gate). If we chose x1 (seethe middle drawing on next page) or x0 (see the right drawing on next page), we can implement y5in two layers. From the two drawings representing these cases (middle and right), it seems that both

45

Page 16: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

solutions are perfectly equivalent. And that is true if we look only at minimizing y5. But we areinterested in the global minimization of seven functions (y1, y2, …, y7). That is why, a closer lookwill reveal that one solution is better than the other. This results from the fact that y6 ⊂ y5. Thus thedecomposition along x1 (the middle drawing) will lead to the global minimum:

y5 = x1⋅ x2 + x0__

+ x1__

⋅ x0 = y6 + x1__

⋅ x0 = ⟨ y6 + ⟨− x1 + x0⟩0.5 ⟩0.5

If we decompose along x0 (the right drawing), we obtain y5 = x0⋅ x2 + x1__

+ x0__

⋅ x1, whichwill need one more threshold gate for the first part of the function. The final solution can be seen infigure 10, where each threshold gate has been represented by its Karnaugh map.

At this point we can compute the “cost” of this implementation. The cost is associated with thearea on the final chip and can be computed in different ways: (i) size (number of gates) [Beiu 1992b];(ii) number of connections; (iii) sum of all the weights; (iv) sum of all the weights and thresholds.As the weights and the threshold have to be physically realized on the final chip, the last “cost”comes closer to the real area than the others [Beiu 1992a, Beiu 1993].

SIZE Designed = 6;

CONN Designed = 14;

WCON Designed = 16;

W&TH Designed = 24.

x2

XOR

x1 x0

⟨x2+x1⟩0.5 ⟨x2+x0⟩0.5

⟨−x1+x0⟩0.5

⟨y6+⟨−x1+x0⟩0.5⟩0.5 ⟨x2+2x1−x0⟩1.5 ⟨−x2+x1+2x0⟩1.51 x2

x2 x1 x0

Weights:

+ 1+ 2− 1

Figure 10. Designed solution for the particular sigmoid function, for three input bits and seven output bits. Solid linesrepresent positive connections, while tinted ones represent negative connections.

46

Page 17: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

7.2. Trained solutions

A completely different approach for designing the circuit has also been considered: training a threelayers neural network to learn the completely specified function. NeuralWorks Professional [***1989 – Klimasauskas et al.] has been used for the simulations. The training set was represented bythe values from the truth table and the learning rule was back-propagation.

The number of threshold gates (artificial neurons) from the hidden layer has been taken 2 at thebeginning, and successively incremented to 3, 4, and 5 (from the designed solution we knew that 5is enough: 2 functions and 3 copies of the inputs). To discern among the simulated networks, weshall denote each one by three numbers: the threshold gates in the input, hidden and output layers.The results have been:

● the 3-2-7 network starts to oscillate almost immediately (after less than 100 epochs);● the 3-3-7 network was not able to learn the training set exactly (even after 250,000 epochs !);

it is difficult to state if it oscillates or not;● the 3-4-7 network was able to learn the training set perfectly, but needs on the order of 10,000–

20,000 epochs to achieve perfection (see figure 11);● the 3-5-7 network was able to learn the training set perfectly, in a reasonable amount of time

(only after 1,000–5,000 epochs); as one threshold gate from the hidden layer was permanentlyinhibited, a pruning technique could have remove it.

Pruning 3-5-7 we have obtained another solution with 4 threshold gates in the hidden layer. Weidentify it as 3-5[4]-7, showing that it comes from pruning of the 3-5-7 network. A pruning of weightshas also been performed to remove “zero” weights. The weights have been computed as floating-pointnumbers by back-propagation. When learning has been finished, the real weights have been convertedto integers. The weights connecting the hidden to the output layer were small for both solutions (inthe ± 4 range), but large between the input and the hidden layer (in the ± 200 range for 3-4-7, andin the ± 400 range for 3-5[4]-7). This fact can be explained by the way the back-propagation algorithmworks (propagating the errors backwards). We have tried to limit the weights while learning, butunfortunately back-propagation fails to converge if the weights are limited to less than ± 200. Asolution we have investigated was to reduce the incoming weights to each threshold gate by dividingall the weights to the greatest common divisor. More precisely we have allowed each weight to varywith ± ε of its own value and the greatest common divisor between these resulting sets has beendetermined. The range of weights has been reduced by this method to ± 36 for 3-4-7, and to ± 400for 3-5[4]-7.

Finally, an interesting thing we have observed, was that the functions learned by the hiddenthreshold gates were different in the two networks that did converge. The 3-4-7 network has learnedthe weights shown in figure 12a, while the 3-5[4]-7 network has learned the solution presented infigure 12b.

The values for the different cost functions have been computed, and can be seen in Table 2.

(a)

Figure 11. The 3-4-7 network: (a) while the learning process is still adjusting the weights; and (b) at the end of thelearning phase, when the RMS error is zero for all the data from the training set.

(b)

47

Page 18: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

Using the algorithm, the general set of activation functions presented in figure 1 becomes “rougher”.As an example, the three dimensional thermodynamic-like function computed using floating pointcan be seen in figure 13a, while in figure 13b we have rounded it to four bits.

3-5[4]-7 3-5[4]-7

reduced

3-4-7 3-4-7

reduced

ROM table Designed

SIZE 10 10 10 10 6

CONN 33 33 32 32 14

WCON 2207 921 892 179 minimum area 16

W&TH (area) 3085 1338 1354 258 24 × 5 = 80 24

Table 2.

Different costs for the analysed solutions (occupied are).

y2 y3y1 y6 y7

x2 x1 x0

y4 y5

2 2 0 2 2 3 2 0 4 2 2 2 0 −1 3 2 −1 −1 −1 3 −4 2 1 0

0 0 −2 0 0 0

74 68 136107 −34 −48 −6 −40 136 80 108 −13

−188 −114 −105−53

Pruned neuron

y2 y3y1 y6 y7

x2 x1 x0

y4 y5

2 0 2 2 0 2 2 2 0 −3 4 −2 2 2 1 −1 3 −4 1 −1 1 2 −3 4

0 0 −10 −2 −1

−294 −293 397114 305 −4 314 152 98 11 91 88

−100 −312 −158−304

Pruned neuron

Pruned neuron

(a)

(b)

⟨2x2−x1−x0⟩0.5 ⟨x2+x1+2x0⟩2.5 ⟨x2+2x1−x0⟩1.5⟨−x1+x0⟩0.5

⟨x2+2x1−x0⟩1.5 ⟨x1+x0⟩1.5x2⟨−x2−x1+2x0⟩0.5

Figure 12. (a) the 3-4-7 solution is close to the designed solution; and (b) the 3-5[4]-7 solution: only one thresholdgate from the hidden layer has an equivalent gate in the designed solution. Weights connect to thecorresponding threshold gate from the lower level (threshold values are on the left of each Karnaugh map).

48

Page 19: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

§8. Conclusions and Open Problems

In this paper we have:● presented known solutions for hardware implementations of the sigmoid functions;● introduced a particular sigmoid function, and analyzed its errors for different binary

representations (the particular sigmoid function achieves the lowest error among all the knownsum of steps approximations);

● defined and proved the correctness of a simple algorithm to compute this function;● proved that no efficient serial implementation for this function can be found;● extended these results to the hyperbolic tangent;● suggested how other sigmoid shaped functions can be approximated by the particular sigmoid

function (modifying the gain);● defined and proved a more complicate algorithm for computing the derivative of the particular

sigmoid function – needed in the learning phase (without multiplications);● detailed a small example seen from two very different angles: craft-worked and trained;

Further research on the same line one should concentrate on:● finding a simple way of rounding the last bit of the result for the particular sigmoid;● try to develop algorithm(s) for a more precise approximation like piece-wise linearization.

Final remarks to the problem of finding good sum-of-steps approximation algorithms:● each sigmoid has to be analyzed individually;● there are common denominators for some sigmoid functions;● in principle, approximative dedicated algorithms can be found, but● the algorithms are not always simple enough for a hardware digital implementation.

Other ares to be investigated more deeply should be the basic choice for numerical representation.From a first point of view logarithmic representation which drastically simplifies multiplication couldbe considered (as some authors have already done [Spaanenburg 1991]). Also more exact representa-tion could be thought of [Beiu 1985]. For each one algorithm(s) for computing a sigmoid functionhave to be found, beside efficient algorithms for multiplication and addition.

References

Akiyama 1989 : Y. Akiyama, A. Yamashita, M. Kajiura, and H. Aiso. Combinatorial Optimization with GaussianMachines. Proc. IJCNN’89, IEEE Press, vol. I, pp. 533-540, 1989.

Alippi 1990a: C.Alippi, S. Bonfanti, and G. Storti-Gajani. Some Simple Bounds for Approximations of SigmoidalFunctions in Layered Neural Nets. Report n. 90-022, Dipartimento di Elettronica, Politecnico diMilano, pp. 1-25, 1990.

Alippi 1990b: C. Alippi, S. Bonfanti, and G. Storti-Gajani. Approximating Sigmoidal Functions for VLSI Im-plementation of Neural Nets. Proc. IWMNN’90, Dortmund, Germany, Univ. of Dortmund, pp. 165-170.4, June 1990.

-4-2

02

4

-4

-2

0

2

4-0.5

0

0.5

1

1.5 t

(x,T

)

x

T(a)

Figure 13. The three dimensional termodynamic-like function: (a) floating point simulation; (b) simulation using onlyfour bits precision.

-4-2

02

4

-4

-2

0

2

4-0.5

0

0.5

1

1.5

t(x

,T)

x

T(b)

49

Page 20: Close Approximations of Sigmoid Functions by Sum of Steps for€¦ · Close Approximations of Sigmoid Functions by Sum of Steps for VLSI Implementation of Neural Networks① Valeriu

Alippi 1991a: C. Alippi. Weight Representation and Network Complexity Reductions in the Digital VLSI Implementa-tion of Neural Nets. Res. Note RN/91/22, Dept of Comp.Sci., Univ. College London, February 1991– submitted to IEEE Trans. on Neural Networks.

Alippi 1991b: C. Alippi, G. Storti-Gajani. Simple Approximation of Sigmoidal Functions: Realistic Design of DigitalNeural Networks Capable of Learning. Proc. ISCAS’91, Singapore, IEEE Press, pp. 1505-1508, June1991.

Baker 1988 : T. Baker, and D. Hammerstrom. Modifications to Artificial Neural Network Models for DigitalHardware Implementation. Tech. Rep. CS/E 88-035, Dept. Comp. Sci. and Eng., Oregon GraduateCenter, 1988.

Beiu 1985 : V. Beiu, and A. Neagu. Algorithm for Adding Universal Represented Real Numbers. Proc. INFO-IASI’85, vol. II, pp. 760-768, 1985.

Beiu 1992a: V. Beiu, J.A. Peperstraete, and R. Lauwereins. Using Threshold Gates to Implement Sigmoid Non-linearity. Proc. ICANN’92, Elsevier Science Publishers, Amsterdam, vol. II, pp. 1447-1450, 1992.

Beiu 1992b: V. Beiu, J.A. Peperstraete, and R. Lauwereins. Algorithms for Fan-In Reduction. Proc. NeuroNimes’92,EC2, pp. 543-554, 1992.

Beiu 1993 : V. Beiu, J. Peperstraete, J. Vandewalle, and R. Lauwereins. Efficient Decomposition of Comparisonand Its Applications. In M. Verleysen (ed.): Proc. of the European Symposium on Artificial NeuralNetworks ESANN’93, Brussels, D facto, pp. 45-50, April 1993.

Gao 1991 : K. Gao, M.O. Ahmad, and M.N.S. Swamy. Neural LS Estimator with a Non-Quadratic Energy Func-tion. Proc. of the Intl. Conf. on Acoustics, Speech, and Signal Processing ICASSP’91, Toronto, Canada,IEEE Press, vol. II, 1041-1044, May 1991.

Georgiou 1992 : G.M. Georgiou. Parallel Distributed Processing in the Complex Domain. Ph.D. dissertation, Tulane,1992.

Hill 1974 : F.J. Hill, and G.R. Peterson. Introduction to Switching Theory and Logical Design (Chapter 17:“Combinational Functions with Special Properties”, and Chapter 18: “Threshold Logic”). John Wileyand Sons, New York, 1974.

Höhfeld 1991a: M. Höhfeld, and S.E. Fahlman. Learning with Limited Numerical Precision Using the Cascade-Correlation Algorithm. Tech. Rep. CMU-CS-91-130, Sch. of Comp. Sci, Carnegie Mellon, May 1991.

Höhfeld 1991b: M. Höhfeld, and S.E. Fahlman. Probabilistic Rounding in Neural Network with Limited Precision.Proc. MICRO-NEURO’91, Kyrill & Method Verlag, pp. 1-8, 1991.

Holt 1990 : J.L. Holt, and T.E. Baker. Back Propagation Simulations Using Limited Precision Calculations. Tech.Rep., Univ. of Washington, Seattle, 1990.

Holt 1991 : J.L. Holt, and J.-N. Hwang. Finite Precision Error Analysis of Neural Network Hardware Im-plementations. Tech. Rep., Univ. of Washington, Seattle, 1991.

Hopfield 1990 : J.J. Hopfield. The Effectiveness of Analogue ‘Neural Network’ Hardware. Network, 1, pp. 27-40, 1990.Hu 1965 : S. Hu. Threshold Logic. Univ. of California Press, Berkeley and Los Angeles, 1965.Krikelis 1991 : A. Krykelis. A Novel Massively Parallel Associative Processing Architecture for the Implementation

of Artificial Neural Networks. Proc. of the Intl. Conf. on Acoustics, Speech, and Signal ProcessingICASSP’91, Toronto, Canada, IEEE Press, vol. II, pp. 1057-1060, May 1991.

Mead 1980 : C.A. Mead, and L. Conway. Introduction to VLSI Systems. Addison-Wesley, Reading, Massachusetts,1980.

Muroga 1971 : S. Muroga. Threshold Logic and Its Applications. John Wiley & Sons, New York, 1971.Muroga 1979 : S. Muroga. Logic Design and Switching Theory (Chapter 5: Advanced Simplification Techniques and

Basic Properties of Gates). John Wiley & Sons, New York, 1979.Myers 1989 : D.J. Myers, and R.A. Hutchinson. Efficient Implementation of Piecewise Linear Activation Function

for Digital VLSI Neural Networks. Electronics Letters, 25(24), pp. 1662-1663, 1989.Nigri 1991a: M.E. Nigri, P. Treleaven, and M. Vellasco. Silicon Compilation of Neural Networks. Proc.

CompEuro’91, IEEE Press, pp. 541-546, 1991.Nigri 1991b: M.E. Nigri. Hardware Emulation of Back-Propagation Neural Networks. Res. Note RN/91/21, Dept.

of Comp.Sci., Univ. of College London, February 1991.Pesulima 1990 : E.E. Pesulima, A.S. Pandya, and R. Shankar. Digital Implementation Issues of Stochastic Neural

Networks. In M. Caudill (ed.): Proc. IJCNN-90-WASH-DC, vol. II, pp. 187-190, 1990.Raghavan 1988 : P. Raghavan. Learning in Threshold Networks: A Computational Model and Applications. Tech. Rep.

RC 13859, IBM Res., July 1988.Rumelhart 1986 : D.E. Rumelhart, J.L. McClelland, and the PDP Research Group (eds.). Parallel Distributed Process-

ing: Explorations in the Microstructure of Cognition. Vol. I: Foundations. A Bradford Book, TheMIT Press, Cambridge, MA, 1986.

Saucier 1990 : G. Saucier, and J. Ouali. Silicon Compiler for Neuron ASICs. Proc. IJCNN’90, IEEE Press, vol. II, pp.557-561, 1990.

Savran 1991 : M.E. Savran, and . Morgül. On the Associative Memory Design for the Hopfield Neural Network. Proc.IJCNN’91, IEEE Press, Singapore, vol. II, pp. 1166-1171, November 1991.

Sheng 1969 : C.L. Sheng. Threshold Logic. Academic Press, New York, 1969.Siggelkow 1991 : A. Siggelkow, J. Nijhuis, S. Neuer, and B. Spaanenburg. Influence of Hardware Characteristics on the

Performance of a Neural System. Proc. ICANN’91, Elsevier Science Publishers, Amsterdam, pp. 697-702, 1991.

Spaanenburg 1991 : L. Spaanenburg, B. Hoefflinger, S. Neusser, J.A.G. Nijhuis, and A. Siggelkow. A Multiplier-Less Digi-tal Neural Network. Proc. MICRO-NEURO’91, Kirill & Method Verlag, pp. 281-289, 1991.

Xu 1991 : L. Xu. Lest MSE Reconstruction for Self-Orgnization: Further Theoretical and Experimental Studieson One Layer Nets. Proc. IJCNN’91, IEEE Press, Singapore, vol. III, pp. 2368-2373, November 1991.

******** 1989 : C. Klimasauskas, J. Guiver, and G. Pelton; with significant contributions from K. Kleinberg, P.DuBose, and D. Stork. NeuralWorks Professional – Neural Computing, User’s Guide, and DesignerPack. NeuralWare, Pittsburgh, PA, 1989.

50