A Non-Sigmoidal Activation Function for Feedforward Artificial Neural Networks

A Non-Sigmoidal Activation Function for Feedforward Artificial Neural Networks

Pravin Chandra, Udayan Ghose and Apoorvi Sood

Abstract-For a single hidden layer feedforward artificial neural network to possess the universal approximation property, it is sufficient that the hidden layer nodes activation functions are continuous non-polynomial function. It is not

required that the activation function be a sigmoidal function. In

this paper a simple continuous, bounded, non-constant, differentiable, non-sigmoid and non-polynomial function is proposed, for usage as the activation function at hidden layer nodes. The proposed activation function does require the computation of an exponential function, and thus is computationally less intensive as compared to either the log-sigmoid or the hyperbolic tangent function. On a set of 10 function approximation tasks we demonstrate the efficiency and efficacy of the usage of the proposed activation functions. The results obtained allow us to assert that, at least on the 10 function approximation tasks, the results demonstrate that in equal epochs of training, the

networks using the proposed activation function reach deeper minima of the error functional and also generalize better in most of the cases, and statistically are as good as if not better than networks using the logistic function as the activation function at the hidden nodes.

I. INTRODUCTION

Artificial Neural Networks (ANN) are inspired by the computational paradigm of the biological neural networks. Thus, the initial thrust was to develop a structure based on interconnection of simple computational units I. The simplest node used for approximating the behaviour of the biological neuron is a threshold device also called the McCulloch-Pitts node [1]. The net input to a node for inputs [Xl, X2, ... , xnV is given by:

n

Z = L WiXi +() i=l

(1)

where Wi is the strength of connection for the Xi input and () is the bias of the node2• The bias decides at what point the transition of the node state from dormant to exited takes place. This can be seen from the output of the node which is:

(2)

P. Chandra and Udayan Ghose are faculty at the University School of Information and Communication Technology, Guru Gobind Singh Indraprastha University, Dwarka, Sector 16C, New Delhi (INDIA) - 110078 (email:P. Chandra: [email protected]. in, [email protected]; U. Ghose: [email protected]. in, g"[email protected];. A. Sood is a Ph.D. scholar at the University School of lnfonnation and Communication Technology, Guru Gobind Singh lndraprastha University, Dwarka, Sector l6C, New Delhi (INDIA) - 110078 (email: [email protected])

I Generally this unit is called a node or neuron while the interconnection strengths are called weights.

2By a abuse of terminology, the collection of weights and biases together are also referred to as weights.

978-1-4799-1959-8/15/$31.00 @2015 IEEE

This output function for the node is discontinuous. A continuous variant / extension of this function, which is also bounded, differentiable and monotonically increasing, is the sigmoidal class of functions. A sigmoidal function may be defined as [2]:

Definition 1. A sigmoidal function 0' ( . ) is a map 0' : R ----+ R, where R is the set of all real numbers, having the following limits for the argument to the function (x) tending to ±oo:

lim O' (X) = (3 x-+ <X!

(3)

lim O' (x) = 0:; 0: < (3 x---+ - 00

(4)

The generally observed values for 0: and (3 are 0: E { -1,0} and (3 = 1.

The most commonly used sigmoidal functions are the hyperbolic tangent function, and the log-sigmoid or the logistic function which is defined as:

1 O' (x) - (5)

1 +e-x

The nodes using a specific non-linearity as its output function is arranged in a structure which is broadly classified as a feedforward architecture. The estimation of the weights and/or biases of this Feedforward ANN (FFANN) is estimated by using a non-linear optimization technique (in general) to minimize the error between the desired and the obtained response to the FFANN. These methods range from first-order methods( only using the gradient of the error functional) to second-order methods (using the explicit Hessian or some estimate of the Hessian of the error functional). For example, the classical error Backpropagation method [3], [4], [5] and the Resilient Backpropagation (RPROP) [6], [7], [8] methods are gradient based (first-order) methods, while the Levenberg-Marquardt [9] and the conjugate gradient [10], [11] training algorithms may be classified as second-order methods. By themselves these training mechanism do not impose any condition on the activation functions or the nonlinearity used as the output function of the nodes'. The only additional algorithmic requirement is that these functions should be differentiable [2].

The initial set of results about the universal approximation property (UAP)3 was obtained in context to hidden (layer) nodes that were sigmoidal in nature, under a variety of conditions [12], [13], [14], [15], [16]. Stinchcombe and

3The property that a FFANN, with at least one hidden layer, with nonlinear nodes in the hidden layer can approximate any function arbitrarily well, provided there are sufficient number of hidden nodes.

y

Fig. I. The schematic diagram of a single hidden layer FFANN.

White (1990) showed that the UAP for a FFANN depends not on the sigmoidality of the non-linearity at the hidden nodes for FFANNs, but on the feedforward structure; and the properties of the sigmoidality of the hidden nodes are not as crucial for UAP [17]. Hornik (1991) established that the UAP exists for a FFANN if the non-linearity (henceforth called the activation) function at the hidden layer nodes is any continuous, non-constant and bounded function [18]. Leshno [19] established that the UAP is obtained if the activation function of the hidden nodes is not a polynomial. See [20] for a survey of these UAP results. Thus, from these results (obtained after 1989), we may impose the following conditions on an activation function, for the networks using them at the hidden layer nodes, as - The activation function must be a continuous, non-constant and bounded non-polynomial function4•

These set of results, that have expanded the potential classes of function that can be used as activation function, have not been reflected in the empirical works reported. Some of the results have been reported using polynomial activations [21], [22], but as the result in [19] established, these networks do not possess the UAP. In [23] Hermite polynomials are used in the constructive approach to neural networks. Most of the research on activation functions role in training of FFANNs have concentrated on sigmoidal activations [24], [25], [26]. The activation functions used in the learning algorithms for FFANN training, play an important role in determining the speed of training [27], [24]. In this paper we use a simple non-sigmoidal function that is continuous, differentiable, bounded, non-constant and non-polynomial and study its efficiency and efficacy as an activation function for hidden nodes in a single hidden layer FFANN over 10 function approximation tasks.

The paper is organized as follows: Section II describes the FFANN architecture and the activation function used. Section III describes the design of the experiments. Section IV presents the the results while conclusions are presented in Section V.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 .... - .. "'....-dcr(x) I dx

;,;; ...... 0.1 , "

25 -4 -3 -2 -1 0

Fig. 2. The function <:r(x) (5) and its derivative (8).

4

II. FFANN ARCHITECTURE AND ACTIVATION

FUNCTION

The schematic diagram of a single hidden layer FFANN is shown in Fig. l. The numbers of inputs to the network is I, the inputs are labeled as Xi where i E {I, 2, ... , I} , the number of hidden nodes in the single hidden layer is H, the weight connecting the ith input with the jth hidden node is Wji where i E {I, 2, . . . ,I} and j E {I, 2, . . . ,H}, the threshold of the jth node is {}j with j E {I, 2, ... ,H} and the connection strength between the output of the jth hidden node and the output node is aj where j E {I, 2, . . . ,H} while '/ is the threshold or the bias of the output node. With this structure the net input to the jth hidden node is:

I

nj = L WjiXi + {}j; j E {I, 2, . . . , H} (6) i=l

If ¢(.) represents an arbitrary activation function used at the hidden layer nodes, we may write the net output of the network as:

H

Y = L a j ¢ ( nj) + '/

j=l (7)

The activation functions used in this work as a base to compare the experimental results is the logistic function (5). The derivative of this function is:

(T'(x) =

d(T(x) = (T(x) (1 - (T(x)) (8)

dx This function (T( x) and its derivative is shown in Fig. 2.

The proposed activation is:

1 'Ij;(x) =

x2 + 1; x E R (9)

while its derivative is:

'Ij;'(x) =

d'lj;(x) = -2x('Ij;(X))2 (10)

dx The function 'Ij;(.) and its derivative is shown in Fig. 3. This function is seen to be continuous, bounded, non-constant, non-sigmoidal and a non-polynomial function. The shape of the function in Fig. 3 is similar in shape to the bell

4 [ncidentally, all sigmoidal functions reported belong to this class, but are not the only type of function in this class.

-0.6

-O·!ls -4 -3 -2 -1 2 3 4 S

Fig. 3. The function 4;( X) (9) and its derivative (10).

shaped functions used in Gaussian Potential Function Networks (GPFN) [28] or the Gaussian functions used in Radial Basis Function (RBF) Networks [29]. But, in contrast to the proposed network activation function where the input to the node / activation function is given by (6) as a scalar product of the inputs with the associated weights, in the GPFN or RBF networks the input to the activation function is a essentially the distance between the input vectors and the mean of the input vectors (called the center of the radial basis function) [28], [29]. Thus, the proposed activation is not equivalent to a radial basis function. It can easily be seen that the first as well as the second derivative of the proposed function is bounded. This implies that the training algorithms based on the gradient (first order methods) and / or methods based on (an estimate of the) Hessian (second order methods) can be used for training the networks using the proposed function as activation functions.

III. EXPERIMENT DESIGN

The following functions are used to construct 10 function approximation tasks:

1)

2)

3)

4)

One dimensional input function taken from MATLAB sample file humps.m.

h(x) 1

(x - 0.3)2 + 0.01 1

+ - 6 (x - 0.9)2 + 0.04

where x E (0,1).

(11)

Two dimensional input function taken from MATLAB sample file peaks.m.

h(x,y) 3(1 - x)2e(-x2-(Y+l)2)

10(x/5 _ x 3 _ y5)e(-x2_y2)

(1/3)e(-(X+l)2-y2) (12)

where x E (-3,3) and y E (-3,3). Two dimensional input function from [30], [31], [32].

h(x,y) = sin(x y)

where x E (-2,2) and y E (-2,2). Two dimensional input function from [32].

(13)

[30], [31],

(14)

where x E (-1,1) and y E (-1,1). 5) Two dimensional input function from [33], [31],

[32].

6)

7)

8)

9)

1.3356(1.5(1 - x) + e2x-1 sin(37r(x - 0.6)2) + e3(y-05) sin( 47r(Y - 0.9)2) )(15)

where x E (0,1) and y E (0,1). Two dimensional input function from [33], [31], [32].

1.9(1.35 + eX sin(13(x - 0.6)2)

xe-y sin(7y)) (16)

where x E (0,1) and y E (0,1). Two dimensional input function from [33], [31], [32].

h(x,y) = 42.659(0.l+x(0.05+x4-lOx2y2+5y4)) (17)

where x E (-0.5,0.5) and y E (-0.5,0.5). Two dimensional input function from [31], [32].

f ( ) - 1 + sin(2x + 3y)

8 x,y - 3.5 +sin(x - y)

where x E (-2,2) and y E (-2,2).

(18)

Four dimensional input function from [31], [32].

f9(Xl,X2,X3,X4) 4(Xl - 0.5)(X4 - 0.5)

x sin(27rV xi + x�) (19)

where x E (-1,1) and y E (-1,1). 10) Six dimensional input function from [31], [32].

flO(-) 10sin(7rXlx2) + 20(X3 - 0.5)2 + + 10x4 + 5X5 + OX6 (20)

where x E (-1,1) and y E (-1,1).

For each of the above enumerated function approximation problems, a set of 900 points are generated from the input domain of the functions and the corresponding outputs generated to form the data set used for learning , out of this 200 tuples are used for training the FFANN (and called the training data set) while the the rest of the 700 data points form the test data set.

The architecture of the FFANNs used is decided by varying the number of hidden nodes in the single hidden layer until a satisfactory solution is reached. The architectural parameters and the data sizes of the training and test data sets is reflected in Table I.

The data set used for training and testing the networks are all scaled in the range [-1, 1], and the error measured on the training data set and the test data set by the mean squared error (MSE) over the data set:

1 p 2 E =

2P L (t(k) - y(k») k=l

(21)

TABLE I. DATA AND NETWORK SIZE SUMMARY FOR TASKS 11(1: NUMBER OF INPUTS, H: NUMBER OF HIDDEN NODES, 0: NUMBER OF

OUTPUTS).

Sr No I Data-set I I I H I 0 I Train Set Size I Test Set Size I. h I 8 I 200 700 2. h 2 10 I 200 700 3. h 2 9 I 200 700 4. i4 2 10 I 200 700 5. is 2 to I 200 700 6. i6 2 12 I 200 700 7. h 2 15 I 200 700 8. is 2 15 I 200 700 9. i9 4 6 I 200 700 10. ho 6 5 I 200 700

where P is the number of data pointsltuples in the training/test data set. t(k) and y(k) represent the desired output and the obtained output from the FFANN for the kth input.

For each of the learning tasks an ensemble of fifty initial weight configurations are generated by using uniform random values in the range (-1,1). For each task, the initial weight ensemble is used to construct two FFANN ensembles where the initial weights are equal but the activation functions used differ. One ensemble uses the activation function 0' ( . ) (5) while the other uses the activation function 1.j;(.) (9), as the activation function at the hidden layer nodes. Thus, in all 10 x 2 x 50 = 1000 networks are trained. Each network is trained for 1000 epochs, on the training data set.

The networks are trained using a variant of the resilient backpropagation algorithm called the resilient backpropagation algorithm with improved weight-backtracking (iRPROP2) [8]. This algorithm implements an improved mechanism for weight update reversal in the case it obtains a higher error on a weight update. It shares the property of the resilient backpropagation algorithm (RPROP) as given in [6] of being a fast first-order algorithm and its space and time complexity scales linearly with the number of parameters to be optimized. The minima of the error functional achieved by this algorithm is comparable to second order methods like Levenberg Marquardt based or scaled conjugate algorithm (for example, see the results reported in [26]), for learning tasks.

The experiments were done using Matlab version 2013a on a 64-bit Intel i7 based Microsoft Windows 7 system with 6GB RAM.

We report the ensemble average of the error metric (MMSE), the standard deviation of the ensemble networks' error (STD), the median of the ensemble error values (MeMSE), the minimum MSE achieved by a network in the ensemble (MIN), and the maximum MSE achieved by a network in the ensemble (MAX); for both the training and the test data set. We also count the number of problems in which the MMSE value for ensemble using the activation function 0' (5) is smaller than the MMSE for the ensemble corresponding to the usage of the non-sigmoidal activation 1.j; (9) and vice versa, for both the training and the test set. We use the I-sided Student's t-test [34] to find the number of problems in which the MMSE corresponding to an ensemble using the activation function 0' has statistically significant smaller value than the ensemble characterized by the activation function 1.j; (and vice-versa), over both the

training and the test data set. The t-test is performed at a significance level of a: = 0.05. Similarly we use the I-tailed Wilcoxon rank sum test [34] to find the number of problems in which the MeMSE corresponding to an ensemble using the activation function 0' has statistically significant smaller value than the ensemble characterized by the activation function 1.j; (and vice-versa), over both the training and the test data set. The Wilcoxon rank sum test is performed at a significance level of a: = 0.05

IV. RESULTS

Due to the volume of data obtained in the experiment process, we report the summary of the results only. First the training data set results are presented followed by the test data set results.

A. Training Data Set Results

The summary of the results for the data set used for training is presented in Table II, III and IV. From the obtained results we may infer the following:

1) On inspection of the values for MMSE, for all the tasks, we find, that in all cases, the MMSE obtained for the ensemble using the function 1.j;(.) (9) (hereafter, we will call this the ensemble identified by 1.j;) has a lower value as compared to the ensemble using 0' ( . ) (5) (hereafter called the ensemble identified by 0'). There is substantial decrease in the value of the MMSE in almost all cases (except flO) for the ensemble identified by 1.j; as compared to the MMSE observed for the ensemble identified by 0'.

2) The values of MMSE reported in Table II allows us to only claim that the MMSE of the ensemble identified by 1.j; is smaller than the ensemble identified by 0'. It does not allow us to conclude whether this difference is statistically significant or not. To check for statistical significance of this difference, we perform a I-tail t-test to test the alternative hypothesis that the MMSE of the ensemble identified by 1.j; is smaller than the MMSE of the ensemble identified by 0', at a significance level a: = 0.05. We report the values of the indicator h and the probability of observing the alternative hypothesis. The value h = 1 indicates that the alternative hypothesis holds with p being the probability of observing the alternative hypothesis (this corresponds to 1- q, where q is the probability of observing a test statistics as extreme as, or more extreme, than the observed value under null hypothesis). The obtained results are tabulated in Table III. From the Table we may infer that in 9 tasks, the ensemble identified by 1.j; has a statistically significant lower value of the MMSE as compared to the ensemble identified by 0' (for II - f9), while for the function flO approximation the ensemble identified by 1.j; has a MMSE that is not statistically significantly different than the MMSE of the ensemble identified by 0' at the a: = 0.05 significance level.

3) In no case it is found that ensemble identified by 0' has a statistically significant lower MMSE value

as compared to the ensemble identified by 1/J, at a significance level of a = 0.05. There is substantial decrease in the value of the MeMSE in almost all cases (except fIo) for the ensemble identified by 1/J as compared to the MeMSE observed for the ensemble identified by (7.

4) For all tasks, we observe (Table II) that the MeMSE value of the ensemble identified by 1/J has a lower MeMSE value as compared to the ensemble identified by (7. To check for statistical significance of this difference, we perform a I-tail Wilcoxon rank sum test to test the alternative hypothesis that the MeMSE of the ensemble identified by 1/J is smaller than the MeMSE of the ensemble identified by (7, at a significance level a = 0.05. We report the values of the indicator h and the probability of observing the alternative hypothesis. The value h = 1 indicates that the alternative hypothesis holds with p being the probability of observing the alternative hypothesis (this corresponds to 1- q, where q is the probability of observing a test statistics as extreme as, or more extreme, than the observed value under null hypothesis). The obtained results are tabulated in Table IV. From the Table we may infer that in 9 tasks, the ensemble identified by 1/J has a statistically significant lower value of the MeMSE as compared to the ensemble identified by (7 (for fI - f9), while for the function fIo approximation the ensemble identified by 1/J has a MeMSE that is not statistically significantly different than the MMSE of the ensemble identified by (7 at the a = 0.05 significance level.

5) In no case it is found that ensemble identified by (7 has a statistically significant lower MeMSE value as compared to the ensemble identified by 1/J, at a significance level of a = 0.05.

6) In all tasks the MIN value is lower for the ensemble identified by 1/J as compared to the MIN value for the ensemble identified by (7.

7) The MAX value for the ensemble identified by 1/J is lower than the MAX value for the ensemble identified by (7 in 7 cases out of lO, except for the approximation of 13, f9 , and flO.

8) From these results we assert that the single hidden layer FFANN using the proposed non-sigmoidal and non-polynomial activation function 1/J (9) is as good as if not better than the standard logistic function (7 (5) for at least the function approximation tasks reported herein, for training of FFANNs.

B. Test Data Set Results

The summary of the results for the test data set is presented in Table V, VI and VII. The trend observed for this data set is similar to the results obtained for training data set. From the obtained results we may infer the following:

1) On inspection of the values for MMSE, for all the tasks, we find, that in the first 9 cases, the MMSE obtained for the ensemble using the function 1/J(.) (9) has a lower value as compared to the ensemble using (7( . ) . Only for the case of flO the MMSE of

TABLE II. SUMMARY OF NETWORK TRAINING DATA SET RESULTS FOR ACTIVATION FUNCTIONS aO AND '!j;O- ALL VALUES OF THE

STATISTICS ARE REPORTED X 10-3.

I Sf No I Task I Staristics II II-I -:-' i��e�

;-----1 -

I

I

I

I

I

I

I I I I I I I

I MMSE 0-09906 0-01525 STD 0-10685 0-05379

I h MeMSE 0-06067 0-00505 MAX 0 59192 0 44641 MIN 0 00945 0 00007 MMSE 8043208 4-95639 STD 2-95004 1-77278

2 I h MeMSE 8-18854 4-50126 MAX 24 43177 17 96598 MIN 6 42596 4 26325 MMSE 9-25466 6044859 STD 1-94706 6-37810

3 I h MeMSE 8-95572 6-18062 MAX 29 17462 74 04550 MIN 11 42923 3 27322 MMSE 1-03489 0-70685 STD 0045614 0043669

4 I /4 MeMSE 0 99236 0 55511 MAX 3 83818 2 66605 MIN 0 80597 0 40116 MMSE 1-36400 0-98380 STD 0-56410 0040786

5 I /5 MeMSE 1-16883 0-90764 MAX 4 01896 2 95166 MIN 0 65119 0 20847 MMSE 6-08839 2-04028 STD 3-82751 1-75875

6 I /0 MeMSE 5-86031 1048792 MAX 20 98215 10 54470 MIN 2 45989 1 03209 MMSE 4-22492 1-56735 STD 1-62919 0-65710

7

I h MeMSE 4 03360 1 39689

MAX 16 55252 11 46991 MIN 5 02941 2 06328

I MMSE 19-50338 2-91343 STD 20-09074 1-05437

8

I /s MeMSE 12 04759 2 89007

MAX 72 74960 10 59133 MIN 6 75155 2 18388

I MMSE 10-91532 9-37167 STD 2-39587 2043751

9

I /9 MeMSE 10 18132 9 15631

MAX 29 91510 33 33348 MIN 15 48426 8 88274

I MMSE 1-26066 1-16436 STD 0-76609 0-87291

10

I ho MeMSE 0 85646 0 79360

MAX 5 16208 9 24205 MIN 1 08167 0 92260

the ensemble identified by 1/J has a slightly higher value as compared to the MMSE of the ensemble identified by (7.

2) There is substantial decrease in the value of the MMSE in almost all cases (except flO) for the ensemble identified by 1/J as compared to the MMSE observed for the ensemble identified by (7.

3) The values of MMSE reported in Table V allows us to only claim that the MMSE of the ensemble identified by 1/J is smaller than the ensemble identified by (7 in 9 cases. It does not allow us to conclude whether this difference is statistically significant or not. Also, in the case of fIo though the MMSE of the ensemble identified by 1/J has a higher value as compared to the MMSE of the ensemble identified by (7, we cannot make a comment on

TABLE lIl. t--TEST RESULTS TO TEST THE ALTERNATIVE HYPOTHESIS THAT THE MMSE OF THE ENSEMBLE IDENTIFIED BY 'IjJ IS

SMALLER THAN THE MMSE OF THE ENSEMBLE IDENTIFIED BY u, FOR

TRAINING DATA SET, AT A SIGNIFICANCE LEVEL OF a = 0.05.

I Sr No I Task I Statistics II h II p I h MMSE I 1 00000 2 h MMSE I 1 00000 3 Is MMSE I 0·99816 4 /. MMSE I 0·99980 5 Is MMSE I 0 99990 6 /0 MMSE I 1 00000 7 h MMSE I 1 00000 8 /s MMSE I 1·00000 9 /9 MMSE I 0·99906 10 ho MMSE I 0 72050

TABLE IV. WILCOXON RANK SUM TEST RESULTS TO TEST THE ALTERNATIVE HYPOTHESIS THAT THE MEMSE OF THE ENSEMBLE

IDENTIFIED BY 'IjJ IS SMALLER THAN THE MEMSE OF THE ENSEMBLE

IDENTIFIED BY u, FOR TRAINING DATA SET, AT A SIGNIFICANCE LEVEL

OF a = 0.05.


the statistical significance on the basis of only the MMSE value(s). Thus, similarly to the case of training data set result, we perform a I-tailed t-test. The obtained results are tabulated in Table VI. From the Table we may infer that in 9 tasks, the ensemble identified by 1j; has a statistically significant lower value of the MMSE as compared to the ensemble identified by IJ (for II - f9), while for the function flO approximation the ensemble identified by 1j; has a MMSE that is not statistically significantly different than the MMSE of the ensemble identified by IJ at the a = 0.05 significance level.

4) In no case it is found that ensemble identified by IJ has a statistically significant lower MMSE value as compared to the ensemble identified by 1j;, at a significance level of a = 0.05. There is substantial decrease in the value of the MeMSE in almost all cases (except lIo) for the ensemble identified by 1j; as compared to the MeMSE observed for the ensemble identified by IJ.

5) For 9 tasks, we observe (Table II) that the MeMSE value of the ensemble identified by 1j; has a lower MeMSE value as compared to the ensemble identified by IJ. To check for statistical significance of this difference, we perform a I-tail Wilcoxon rank sum test, similar to the case of training data set result, at a significance level a = 0.05. The obtained results are tabulated in Table VII. From the Table we may infer that in 9 tasks, the ensemble identified by 1j; has a statistically significant lower value of the MeMSE as compared to the ensemble identified by IJ (for II - f9), while for the function

TABLE V. SUMMARY OF NETWORK TEST DATA SET RESULTS FOR ACTIVATION FUNCTIONS u( - ) AND 'IjJ( - ) . ALL VALUES OF THE

STATISTICS ARE REPORTED X 10-3.

I Sr No I Task I Staristics II -

I

I

I

I

I

I

I I I I I I I

I MMSE 0·11361 0·01780 STD 0·12021 0·06277

I h MeMSE 0·06839 0·00589 MAX 0 59192 0 44641 MIN 0 00945 0 00007 MMSE 14·08138 9·79514 STD 4·40695 3·07308

2 I h MeMSE 13·89667 8·92949 MAX 24 43177 17 96598 MIN 6 42596 4 26325 MMSE 19·27182 12·51660 STD 3·75258 9·55968

3 I Is MeMSE 19·66443 12·18528 MAX 29 17462 74 04550 MIN 11 42923 3 27322 MMSE 1·67130 1·13991 STD 0·61593 0·59033

4 I /4 MeMSE 1 64902 0 98643 MAX 3 83818 2 66605 MIN 0 80597 0 40116 MMSE 1·77957 1·29828 STD 0·85407 0·54291

5 I Is MeMSE 1·39119 1·20357 MAX 4 01896 2 95166 MIN 0 65119 0 20847 MMSE 7-47084 2·68920 STD 4·39678 2·06216

6 I /0 MeMSE 7·21906 2·04259 MAX 20 98215 10 54470 MIN 2 45989 1 03209 MMSE 11·71642 5·14925 STD 2·89113 2·17155

7

I h MeMSE 12 02735 4 52178

MAX 16 55252 11 46991 MIN 5 02941 2 06328

I MMSE 22·03466 5-41058 STD 15·35974 1·71369

8

I /s MeMSE 16 88683 5 05453

MAX 72 74960 10 59133 MIN 6 75155 2 18388

I MMSE 24·07256 21·90770 STD 3-45818 4·87731

9

I /9 MeMSE 23 78736 2177558

MAX 29 91510 33 33348 MIN 15 48426 8 88274

I MMSE 2·26039 2·30840 STD 1-42010 1·62316

10

I ho MeMSE 1 42869 1 70628

MAX 5 16208 9 24205 MIN 1 08167 0 92260

flO approximation the ensemble identified by 1j; has a MeMSE that is not statistically significantly different than the MMSE of the ensemble identified by IJ at the a = 0.05 significance level.

6) In no case it is found that ensemble identified by IJ has a statistically significant lower MeMSE value as compared to the ensemble identified by 1j;, at a significance level of a = 0.05.

7) In all tasks the MIN value is lower for the ensemble identified by 1j; as compared to the MIN value for the ensemble identified by IJ.

8) The MAX value for the ensemble identified by 1j; is lower than the MAX value for the ensemble identified by IJ in 7 cases out of 10, except for the approximation of 13, f9, and flO.

9) From these results we assert that the single hidden

TABLE VI. t--TEST RESULTS TO TEST THE ALTERNATIVE HYPOTHESIS THAT THE MMSE OF THE ENSEMBLE IDENTIFIED BY 'IjJ IS

SMALLER THAN THE MMSE OF THE ENSEMBLE IDENTIFIED BY rJ, FOR

TEST DATA SET, AT A SIGNIFICANCE LEVEL OF Q = 0.05.


TABLE VII. WILCOXON RANK SUM TEST RESULTS TO TEST THE

ALTERNATIVE HYPOTHESIS THAT THE MEMSE OF THE ENSEMBLE IDENTIFIED BY 'IjJ IS SMALLER THAN THE MEMSE OF THE ENSEMBLE

IDENTIFIED BY rJ, FOR TEST DATA SET, AT A SIGNIFICANCE LEVEL OF

Q = 0.05.

I Sr No I Task I Starisrics II h II p I h MMSE I 1 00000 2 h MMSE I 1·00000 3 Is MMSE I 1·00000 4 /. MMSE I 0 99999 5 Is MMSE I 0 99786 6 /6 MMSE I 1 00000 7 h MMSE I 1·00000 8 /s MMSE I 1·00000 9 /9 MMSE I 0 99693 10 ho MMSE I 0 16984

layer FFANN using the proposed non-sigmoidal and non-polynomial activation function 1./J (9) is as good as if not better than the standard logistic function (J (5) for at least the function approximation tasks reported herein, for test data sets.

V. CONCLUSION

The possession of the universal approximation property for single hidden layer FFANNs was established for continuous, bounded and non-constant activation functions at the hidden layer nodes in [18]. This result was extended in [19] wherein it was established that any continuous function that is not a polynomial can be used as an activation function at the hidden layer nodes of a single hidden layer FFANN, and the FFANN would posses the UAP. Based on these result a simple continuous, bounded, non-constant, differentiable, non-sigmoid and non-polynomial function is proposed in this work, for usage as the activation function at hidden layer nodes. The efficiency and efficacy of the usage of this function as an activation function at the hidden layer nodes of a single hidden layer FFANN is demonstrated in this work, over a set of lO function approximation tasks. These networks are statistically as good as if not better than the networks using the logistic function as the activation in terms of achieving the training error values or generalization error values. Since activation functions have an important role to play in determining the speed of training of FFANNs [27], [24], [25], it is conjectured that the usage of new activations which are non-sigmoidal in nature may also be shown to have a beneficial consequence for finding a fast training mechanism. Moreover, the proposed function does not involve the calculation of an exponential term, and thus is

computationally less intensive than the calculation of either the logsigmoid activation function or the hyperbolic tangent activation function.

REFERENCES

[I] W. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943.

[2] P. Chandra and Y Singh, "Feedforward sigmoidal networks -equicontinuity and fault-tolerance," IEEE Transactions on Neural Networks, vol. 15, no. 6, pp. 1350-1366,2004.

[3] D. E. Rumelhart, G. E. Hinton, and R. 1. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, pp. 533-536, Oct. 1986.

[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Process

ing: Volume I: Foundations, D. E. Rumelhart, 1. L. McClelland, and The PDP Research Group, Eds. Cambridge: MIT Press, 1987, pp. 318-362.

[5] --, "Learning representations by back-propagating errors," in Neu

rocomputing: foundations of research, J. A. Anderson and E. Rosenfeld, Eds. Cambridge, MA, USA: MIT Press, 1988, pp. 696-699.

[6] M. Riedmiller and H. Braun, "A direct adaptive method for faster backpropagation learning: The RPROP algorithm," in Proc. of IEEE conference on Neural Networks, vol. I, San Francisco, 2010, pp. 586-591.

[7] M. Riedmiller, "Advanced supervised learning in multi-layer perceptrons from backpropagation to adaptive learning algorithms," Computer Standards & Interfaces, vol. 16, no. 3, pp. 265 - 278, 1994.

[8] C. Igel and M. Husken, "Empirical evaluation of the improved rprop learning algorithms," Neurocomputing, vol. 50, no. 0, pp. 105 - 123, 2003.

[9] M. T. Hagan and M. B. Menhaj, "Training feed forward networks with the Marquardt algorithm," IEEE Transactions on Neural Networks, vol. 5, pp. 989-993, 1994.

[10] M. Moller, "A scaled conjugate gradient algorithm for fast supervised learning," Aarhus University Computer Science Department, Aarhus, Denmark, Tech. Rep. Technical Report PB-339" 1990.

[11] R. Battiti, "First and second order methods for learning: btetween steepest descent and Newton's method," Neural Computation, vol. 4, no. 2, pp. 141-166, 1992.

[12] A. R. Gallant and H. White, "There exists a neural network that does not make avoidable mistakes," in Proceedings of the Second International Joint Conference on Neural Networks, vol. 1, 1988, pp. 593-606.

[13] S. M. Carroll and B. W. Dickinson, "Construction of neural networks using the Radon transform," in Proc. of the IJCNN, vol. 1, 1989, pp. 607-611.

[14] G. Cybenko, "Approximation by superposition of a sigmoidal function," Mathematics of Control, Signal and Systems, vol. 5, pp. 233-243, 1989.

[15] K. Funahashi, "On the approximate realization of continuous mappings by neural networks," Neural Networks, vol. 2, pp. 183-192, 1989.

[16] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, pp. 359-366, 1989.

[17] M. Stinchcombe and H. White, "Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions," in Neural Networks, 1989. IJCNN., International Joint Conference on, 1989, pp. 613-617 vol. I.

[18] K. Hornik, "Approximation capabilities of multilayer feed forward networks," Neural Networks, vol. 4, no. 2, pp. 251 - 257, 1991.

[19] M. Leshno, V. Y Lin, A. Pinkus, and S. Schocken, "Multilayer feedforward networks with a non-polynomial activation function can approximate any function," Neural Networks, vol. 6, pp. 861-867, 1993.

[20] A. Pinkus, "Approximation theory of the MLP model in neural networks," Acta NlImerica, vol. 8, pp. 143-195, 1999.

[21] S. Guarnieri, F. Piazza, and A. Uncini, "Multilayer feed forward networks with adaptive spline activation function," Nellral Networks, IEEE Transactions on, vol. 10, no. 3, pp. 672-683, May 1999.

[22] M. Solazzi and A. Uncini, "Artificial neural networks with adaptive multidimensional spline activation functions," in Nellral Networks, 2000. UCNN 2000, Proceedings of the IEEE-INNS-ENNS Interna

tional Joint Conference on, vol. 3, 2000, pp. 471-476 vol.3.

[23] J.-N. Hwang, S.-R. Lay, M. Maechler, R. Martin, and J. Schimert, "Regression modeling in back-propagation and projection pursuit learning," Nellral Networks, IEEE Transactions on, vol. 5, no. 3, pp. 342-353, May 1994.

[24] W. Ouch and N. Jankowski, "Survey of neural network transfer functions," Neural Computing Surveys, vol. 2, pp. 163-212, 1999.

[25] P. Chandra, "Sigmoidal function classes for feedforward artificial neural networks," Nellral Processing Letters, vol. 18, no. 3, pp. 205-215,2003.

[26] S. S. Sodhi and P. Chandra, "Bi-modal derivative activation function for sigmoidal feed forward networks," Nellrocompllting, vol. 143, no. 0, pp. 182 - 196,2014.

[27] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, "Efficient backprop," in Nellral Networks: Tricks of the trade, ser. LNCS:1524, G. B. Orr and K.-R. MUlier, Eds. Berlin: Springer, 1998, pp. 9-50.

[28] S. Lee and R. M. Kil, "A gaussian potential function network with hierarchically self-organizing learning," Nellral Networks, vol. 4, no. 2, pp. 207 - 224, 1991.

[29] S. Haykin, Nellral Networks: A Comprehensive Foundation. New Jersey: Prentice Hall, Inc. , 1999.

[30] L. Breiman, "The PI method for estimating multivariate functions from noisy data," Technometrics, vol. 3, no. 2, pp. 125-160, 1991.

[31] V. Cherkassky, D. Gehring, and F. MUlier, "Comparison of adaptive methods for function estimation from samples," IEEE Transactions

on Nellral Networks, vol. 7, no. 4, pp. 969-984, 1996.

[32] V. Cherkassky and F. MUlier, Learning from Data - Concepts, Theory

and Methods. New York: John Wiley, 1998.

[33] M. Maechler, O. Martin, J. Schimert, M. Csoppenszky, and J. Hwang, "Projection pursuit learning networks for regression," in Proc. of the 2nd International IEEE Conference on Tools for Artificial Intelligence, 1990, pp. 350-358.

[34] J. O. Gibbons and S. Chakraborti, Nonparametric Statistical Infer

ence. New York: Marcel Dekker, Inc., 2003.

Documents

A Non-Sigmoidal Activation Function for Feedforward Artificial Neural Networks