Upload
r-n-yadav
View
219
Download
2
Embed Size (px)
Citation preview
Soft Comput (2006) 10:257–263DOI 10.1007/s00500-005-0479-7
ORIGINAL PAPER
R. N. Yadav · Prem K. Kalra · Joseph John
Neural network learning with generalized-mean basedneuron model
Published online: 27 April 2005© Springer-Verlag 2005
Abstract The advances in biophysics of computations andneurocomputing models have brought the foreground impor-tance of dendritic structure of neuron. These structures areassumed as basic computational units of the neuron, capableof realizing the various mathematical operations. The wellstructured higher order neurons have shown improved com-putational power and generalization ability. However, thesemodels are difficult to train because of a combinatorial explo-sion of higher order terms as the number of inputs to theneuron increases. In this paper we present a neural networkusing new neuron architecture i.e., generalized mean neuron(GMN) model. This neuron model consists of an aggregationfunction which is based on the generalized mean of all theinputs applied to it. The resulting neuron model has the samenumber of parameters with improved computational power asthe existing multilayer perceptron (MLP) model. The capa-bility of this model has been tested on the classification andtime series prediction problems.
Keywords Generalized-mean neuron · Classification ·Function approximation · Multilayer perceptrons
1 Introduction
An artificial neuron is a mathematical model for biologicalneuron that can approximate its functional capabilities. Themajor issue in artificial neuron models is the description ofsingle neuron computation and interaction among the neu-rons with the application of input signals. In literature vari-ous neuron models [1–9,17] and their application for solving
R. N. Yadav (✉) · P. K. Kalra · J. JohnACES-107, Department of Electrical Engineering,Indian Institute of TechnologyKanpur, IndiaE-mail: [email protected].: +91-512-2597007Fax: +91-512-2590063
R. N. YadavDepartment of Electronics and Communication Engineering,Maulana Azad National Institute of TechnologyBhopal, India
linear and nonlinear problems have been presented.The func-tion of neurons has been clearly explained by the author in[15]. The McCulloch-Pitts [1] neuron model initiated theuse of summing units as the neuron model, while neglectingall possible nonlinear capabilities of single neuron and therole of dendrites in information processing in neural system.However, this model makes the use of several drastic sim-plification. It allows binary 0,1 states only, operates under adiscrete-time assumption and assumes synchrony of opera-tion of all neurons in a larger network. This neuron modelhas an aggregation function which, in a sense, is a weightedmean of all the inputs applied to it.We propose a simple modelfor the generalized mean neuron (GMN) with a well-definedtraining procedure based on standard back-propagation. Theproposed GMN considers a weighted generalized mean ofall inputs in the space. This ensures that representation ofMcCulloch-Pitts model is a special case of proposed neuronmodel.
In Section 2 we discuss the motivation, physical architec-ture and mathematical representation of the proposed neuron.Section 3 presents the architecture and learning rule for amultilayer feedforward neural network based on GMN. Sec-tion 4 discusses the performance of the neural network usingproposed neuron model on two typical pattern recognitionproblems – classification and function approximation. Wesolve the channel equalization, Pima Indians diabetes andsynthetic two-class problems using GMN based network andcompare it with a multilayer perceptron (MLPs) based net-work which takes more parameters and longer training time.Similarly, Mackey-Glass, Box-Jenkins gas furnace and HCLinternet incoming traffic data sets are used to demonstratefunction approximation capabilities of the proposed neuronmodel. The final conclusion is presented in Section 5.
2 Generalized mean based neuron model
Neuron modeling concerns with relating function to the struc-ture of neuron on the basis of its operation. As the name sug-gests the proposed neuron model is based on the concepts ofgeneralized mean [18] of the input signals. The generalized
258 R. N. Yadav et al.
Table 1 Forms of the Generalized Mean as the value of r changes
r Mr Operation
∞ Max(xj ) Maximum
2 1N
(∑Nj=1 x2
j
)1/2RMS
1 1N
∑Nj=1 xj Arithmetic mean
0(∏N
j=1 xj
)1/N
Geometric mean
−1 1N
(∑Nj=1
1xj
)−1Harmonic mean
−∞ Min(xj ) Minimum
mean (GM) of N input signals xj (j = 1, 2, . . ., N, N ∈ I)can be given as
GM = 1
N
N∑j=1
xrj
1/r
(1)
where r(r ∈ R) is a generalization parameter which givesvarious means (arithmetic mean, geometric mean and har-monic mean) depending upon its values. It also gives theMax and Min operators when the value of r is maximum(+∞) and minimum (−∞) respectively. Table 1 shows thedifferent operations attained by the generalized mean opera-tor as the value of r changes. Inspired by the importance offlexibility of the above equation, aggregation function of theGMN can be defined as
y(xj , wj ) =
N∑j=1
wjxrj + w0
1/r
(2)
where wj is the adaptive parameter corresponding to each xj
and w0 is threshold of the neuron. From equation (2) we findthat
y(xj , wj ) =
N∑j=1
wjxj + w0
for r = 1 (3)
which is the output of the McCulloch-Pitts model. Thus theperceptron model is a special case of proposed generalizedmean based neuron model.The physical architecture of theproposed neuron model is same as that of perceptronmodel.For r = 0 the equation (2) can be modified as followingLet xj (j = 1, 2, . . . , N) ∈ �+ and wj(j = 1, 2, . . . , N)∈ �+ such that w1 + w2 + ... + wN = 1. For r �= 0 theweighted generalized mean of x1, x2, ..., xN is given as
y(x, w) = (w1xr1 + w2x
r2 + · · · + wNxr
N)1/r . (4)
Using Taylor series expansion et = 1 + t + O(t2), whereO(t2) is Landau notation [22] for terms of order t2 and higher,we can write xr
j as xrj = er log xj =1 + r log xj + O(r2). By
substituting this into definition of y(x, w) of equation (4),we get
y(x, w) = [w1(1 + r log x1) + w2(1 + r log x2)
+ · · · + wN(1 + r log xN) + O(r2)]1/r
= [1 + r(w1 log x1 + w2 log x2 + · · · + wN log xN)
+O(r2)]1/r
= [1 + r log (xw11 x
w22 , ..., x
wN
N ) + O(r2)]1/r
= exp
[1
rlog{1 + r log (x
w11 x
w22 , ..., x
wN
N )
+O(r2)}]
(5)
Now using the Taylor series of the form log (1 + t) = t +O(t2), we get
y(x, w) = exp
[1
r{r log (x
w11 x
w22 ...x
wN
N ) + O(r2)}]
= exp[{log (x
w11 x
w22 ...x
wN
N ) + O(r)}] (6)
Now taking the limit r −→ 0, we find
y(x, w) = exp[{log (x
w11 x
w22 , ..., x
wN
N )}]
= xw11 x
w22 , ..., x
wN
N (7)
which is most general type of the multiplicative unit given in[3] whose function approximation capability is proved therein. This means that the the multiplicative neuron unit givenin [3] is a special case of the proposed GMN model.
3 Multilayer feedforward network using GMN model
3.1 Network architecture and description
Let us consider a feedforward multilayer architecture of anetwork in which M hidden layer neurons receive N inputsas shown in Fig. 1. The input and output vectors of the net-work are X = [x1 x2 . . . xN ]T and Y = [y1 y2 . . . yK ]T
respectively. If wij is a weight that connects the ith neuronwith j th input, the activation value of the ith neuron can begiven as
neti =
N∑j=1
wijxrj + w0i
1/r
for i = 1, 2, ..., M (8)
where w0i is bias of the ith neuron in the hidden layer. Thenonlinear transformation performed by each of M neurons inthe network is given as
yi = f(neti ) for i = 1, 2, ..., M (9)
where f denotes a nonlinear function (sigmoid function inthis case). Similarly output of the kth neuron in the outputlayer can be given as
yk = f(netk) for k = 1, 2, ..., K (10)
where
netk =[
M∑i=1
wkiyri + w0k
]1/r
for k = 1, 2, ..., K (11)
Neural network learning with generalized-mean based neuron model 259
Fig. 1 Multilayer feedforward network using GMN model
where wki is the weight that connects the ith neuron of hid-den layer to the kth neuron of output layer and w0k is bias tocorresponding output layer neuron. The value of the gener-alization parameter r , for simplicity, is considered same forevery neuron in our simulations.
3.2 Learning rule
We describe an error backpropagation based learning rulefor the network using proposed GMN model. The simplic-ity of learning method makes it convenient for the model tobe used in different situations unlike the higher-order neuronmodel [12], which is difficult to train and is susceptible tocombinatorial explosion of terms.
A simple gradient descent rule, using a mean-squarederror function, is described by the following set of equations:Output layer: From equation (10) we have
yk = f(netk) = 1
1 + e−netk. (12)
The mean-squared error (MSE) is given as
EMSE = − 1
2PK
K∑k=1
P∑p=1
(y
p
k − yp
dk
)2, (13)
where yp
k and yp
dk are the actual and desired value of the kthneuron,for the pth pattern, in the output layer respectivelyand P is the number of training pattern in the input space.The weight update rule can be defined by equations (14–17).
�wki = −η∂E
∂wki
= 1
PK.δ yr
i net1−rk
r(14)
�w0k = −η∂E
∂w0k
= 1
PK.δ net1−r
k
P r(15)
wnewki = wold
ki + �wki (16)
wnew0k = wold
0k + �w0k , (17)
where δ = η yk (yk − ydk)(1 − yk) and η (η ∈ [0, 1]) islearning rate.
Hidden layer: Now from equations (8) and (9) we candefine the update rules for the weights wij and w0i by theequations (18) and (19).
�wij = −η∂E
∂wij
= 1
PK.δ(1 − yi) yr
i net1−ri xr−1
j
[∑Kk=1 net1−r
k wki
]
r(18)
�w0i = −η∂E
∂w0i
= 1
PK.δ(1 − yi) yr
i net1−ri
[∑Kk=1 net1−r
k wki
]
r. (19)
The new weights wnewij and wnew
0i can be determined accordingto the equations (16) and (17).
The learning rate η can either be adapted with epochsor can be fixed to a small number based on heuristics. Thislearning method is used to train the network in the next sec-tion to solve some famous benchmark problems relating toboth classification and function approximation.
4 Results and discussions
We discuss some of the important problems arising in ma-chine learning that can be broadly categorized as classifi-cation or function approximation. Detailed experiments andcomparison with existing multilayer network (MLN) topol-ogy suggest that the networks using proposed neuron modelachieve better results with less number of computations. In allthe problems we discuss, the dataset has been pre-processedby normalizing them between 0.1 and 0.9. In all simulations,the results reported are average of ten runs for a range ofreported learning rates. All multilayer networks reported are
260 R. N. Yadav et al.
trained using the standard gradient descent learning algo-rithm. The network topology reported is in the form of
n × h1 × · · · × hk × o
where n is number of input nodes, hi’s are number of nodes inthe ith hidden layer (for i = 1, . . . , k) and o is the number ofoutput nodes. Along with calculation of training and testingerror with network topology we have also used some sta-tistical properties like covariance, correlation and Akaike’sinformation criterion (AIC)[20,21] for testing the capabilityof the proposed neuron model. The AIC, which is defined byequation (20), evaluates the goodness of fit of model basedon the MSE for training data and the number of estimatedparameters.
AIC = −2 log (maximum likelihood) + 2L (20)
where L is number of independently estimated parameters.If output errors are statistically independent of each otherand follow normal distribution with zero mean and constantvariance, equation (16) can be written as
AIC = −2Pk log (σ 2) + 2L (21)
where P is number of training data, k number of output unitsand σ 2 the maximum likelihood estimate of the MSE. Themodel which minimizes AIC is optimal in the minimal aver-aging loss sense i.e. minimizing the expected discrepancy[16]. In all simulations we have taken the absolute values ofthe estimated MSE and outputs to avoid complex computa-tions.
4.1 Classification
We discuss the results of some popular classification prob-lems using the GMN model: channel equalization, syntheticdata and Pima Indians diabetes data.
4.1.1 Channel equalization
Band–limited communication channels are driven at highdata rates and often display inter-symbol-interference (ISI).Nonlinear channel equalization [13] is a popular problem inCommunication Systems that recovers an estimate of s(t −τ),denoted by s(t −τ), given the channel outputs, y(t), pres-ent and past, with τ the equalizer delay. Thus, the channeloutput vector
y(t) = [y(t), y(t − 1), . . . , y(t − m + 1)] , (22)
is used to compute s(t − τ), where m is the equalizer or-der. Considering the fact that s(t − τ) is binary, the problemis essentially a classification task. We consider a nonlinearmodel that uses equations (23) and (24), where the equalizerdelay and order both are two.
o = s(i) + 0.5s(i − 1) (23)
x(i − 2) = o − 0.9o3 . (24)
The data generated is then subjected to a 10 dB noise level.Fig. 2 shows the plot of two classes (zero and one) in the
Fig. 2 A channel equalization problem with two classes
space y(t) and y(t − 1). Five hundred points were taken fortraining and the system was tested with 4500 points.
The performance of the Channel Equalization problemis given in Table 2. The similar network using GMN modelproves a better network for Nonlinear Channel Equalizationthan the multilayer network and attains a bit error rate (BER)of 1.38.
4.1.2 Diabetes – Pima Indians
The famous diabetes dataset of Pima Indians women is usedas a benchmark for classifier systems. The idea is to pre-dict the presence of diabetes using seven variables: num-ber of pregnancies, plasma glucose concentration, diastolicblood pressure, triceps skin fold thickness, body mass index(weight/height2), diabetes pedigree, and age. In [14], theauthor provides an analysis of the dataset, which has a totalof 532 diabetes records. Out of the total 532, 200 are used fortraining and 332 are used for testing, with about 33% of thetotal dataset having diabetes. Table 3 shows the performanceof the GMN based network as compared to a multilayer net-work. The GMN based network in this case shows slightlyimproved performance in the training and testing sets withless number of parameters.
4.1.3 Synthetic two-class problem
This is a ‘realistic’ problem from Ripley [19] that is usedto illustrate how methods work. There are two features andtwo classes, each has a bimodal distribution. The class dis-tribution were chosen to allow the best-possible error rateof about 8% and are in fact equal mixtures of two distribu-tions. The component normal distributions have a commoncovariance matrix. The GMN based multilayer network wastrained using 250 sample of data and was tested with 1000samples. The performance of this network was comparedwith a similar network using MLPs and it was observed thatthe GMN based network performs better than the network
Neural network learning with generalized-mean based neuron model 261
Table 2 Comparison of performance for the channel equalization problem between multiplicative neuron model and a standard multilayernetwork.
Method Structure Learning rate Training error(%) Testing error(%) AIC Epochs
GMNs(r = 1.2) 2 × 3 × 1 0.4 1 1.38 −9.1582 151MLPs 2 × 3 × 1 0.1 1.2 1.91 −8.8398 29
Table 3 Comparison of performance for the Pima–Indians diabetes dataset between GNM based network and a standard multilayer network
Method Structure Learning rate Training error(%) Testing error(%) AIC Epochs
GMNs(r = 0.9) 7 × 3 × 1 0.5 20.5 12.03 −2.13 855MLPs 7 × 4 × 1 0.2 21 12.22 −2.08 300
Table 4 Comparison of performance for the synthetic two-class data between GMN based network and a standard multilayer network
Method Structure Learning rate Training error(%) Testing error(%) AIC Epochs
GMNs(r = 1.2) 2 × 5 × 1 0.1 14 8.64 −13.24 214MLPs 2 × 5 × 1 0.1 19.2 13.84 −12.84 335
of perceptrons. Table 4 shows the performance of the GMNbased network as compared to a multilayer network.
4.2 Function approximation
We evaluate the capabilities of the proposed GMN Model onthe following problems:
(1) Mackey–Glass time Series dataset(2) Short–term internet incoming traffic prediction(3) Box–Jenkins gas furnace dataset
Mackey–Glass and Box–Jenkins datasets are benchmarkproblems and are popularly used to evaluate a proposed learn-ing method. We also investigate short–term internet incomingtraffic prediction using the HCL–infinet internet traffic data-set.
4.2.1 Mackey–Glass
The Mackey–Glass (MG) time series [11] represents a modelfor white blood cell production in leukemia patients and hasnonlinear oscillations. The MG delay–difference equation isgiven by equation (25).
y(t + 1) = (1 − b)y(t) + ay(t − τ)
1 + y10(t − τ)(25)
where a = 0.2, b = 0.1, and τ = 17. The time delay τis a source of complications in the nature of the time series.Objective of the modeling is to predict the value of time seriesbased on four previous values. Four measurements y(t), y(t−6), y(t − 12) and y(t − 18) are used to predict y(t + 1). Thetraining is performed on 250 samples and the model is testedon 200 time instants post training. A mean square error of7.06 × 10−6 was achieved on training the model for 3598epochs. Figure 3 shows the training and prediction results.
In Table 5, the performance of GMN model based net-work is compared with a multilayer network with one hidden
Fig. 3 Long term prediction results for the Mackey–Glass time seriesdataset using the proposed neuron model
layer having five nodes and trained using gradient descent.The performance of the GMN model based network is defi-nitely better than the multilayer network in this case, thoughit has fewer parameters.
4.2.2 HCL–infinet internet traffic
Short term internet traffic data was supplied by HCL–infinet(a leading Indian ISP). Weekly internet traffic graph with a30-Min average is shown in Fig. 4.
The solid-graph in gray shows the incoming traffic whilethe line-graph in black represents the outgoing traffic. Allvalues are reported in bits per second. We propose a modelfor predicting the internet traffic using previous values. Threemeasurements y(t), y(t − 1) and y(t − 2) are used to predicty(t+1) for incoming internet traffic. For the incoming traffic,150 training samples were taken and the model was tested for
262 R. N. Yadav et al.
Fig. 4 Weekly Graph (30 Min. Average) of the internet traffic for the HCL–infinet router at Delhi, INDIA
Table 5 Comparison of performance for Mackey–Glass time seriesdataset between GMN network and a standard multilayer network, bothtrained using gradient descent method
GMNs(r = 0.75) MLPs
Topology 4 × 3 × 1 4 × 5 × 1Epochs 3598 6637Training error 7.06 × 10−6 6.14 × 10−4
Testing error 2.19 × 10−4 3.02 × 10−4
Covariance 2.5 × 10−5 7.21 × 10−5
Correlation 0.9970 0.9934AIC −11.4477 −7.1422
150 samples. Figure 5 shows the prediction results for incom-ing internet traffic data. The performance is compared withmultilayer network which is shown in Table 6.
4.2.3 Box–Jenkins gas furnace
The Box–Jenkins gas furnace dataset [10] reports the furnaceinput as the gas flow rate u(t) and the furnace output y(t) asthe CO2 concentration. In this gas furnace, air and methanewere combined in order to obtain a mixture of gases whichcontained CO2. We model the furnace output y(t + 1) asa function of the previous output y(t) and input u(t − 3).The training and testing results of GMN based network and
Fig. 5 Testing result on the HCL-infinet MRTG incoming internet band-width usage data
Table 6 Comparison of performance for the incoming internet band-width usage of the HCL-infinet router data between GMN based net-work and a standard multilayer network, both trained using gradientdescent method
GMNs(r = 1.2) MLPs
Topology 4 × 3 × 1 4 × 5 × 1Epochs 4916 10000Training error 2.11 × 10−4 5.50 × 10−3
Testing error 4.2 × 10−3 2.9 × 10−3
Covariance 1.53 × 10−5 6.90 × 10−6
Correlation 0.8822 0.9076AIC −11.4477 −7.1422
MLPs network are shown in Fig. 6. Table 7 shows the detailcomparison of these networks.
5 Conclusions
This paper presents a new approach towards the conceptu-alization of a neuron model with better learning and gener-alization capabilities. The idea was motivated by nonlinearactivities in the brain, which have been modeled by the mostbasic of all nonlinearities. While, this is not the first instancewhen new neuron has been thought of as a potent model, thiswork provides a simpler and generalized methods to imple-ment the model so that it can be used without the hassles of
Fig. 6 Performance result on the Box–Jenkins dataset
Neural network learning with generalized-mean based neuron model 263
Table 7 Comparison of performance for the Box–Jenkins gas furnacedataset between GMN model and a standard multilayer network, bothtrained using gradient descent method
GMNs(r = 0.9) MLPs
Topology 2 × 3 × 1 2 × 5 × 1Epochs 389 4000Training error 1.802 × 10−6 3.841 × 10−4
Testing error 9.22 × 10−4 0.0010Covariance 1.46 × 10−4 2.27 × 10−5
Correlation 0.9894 0.9856AIC −13.0532 −7.5846
possible combinatorial explosions, as in higher–order neu-rons. The simulation results show that the proposed GMNmodel outperforms the existing perceptron model.
Acknowledgements We would like to thank P.V. Ramadas, HCL-Infi-net, for providing the Internet Traffic data and Prof. D.H. Balard , Uni-versity of Rochester,USA, for useful discussions related to the proposedneuron model.
References
1. McCulloch WS, Pitts W (1943) A logical calculation of the ideasimmanent in nervous activity. Bull Math Biophys 5:115–133
2. Koch C (1999) Biophysics of computation: information processingin single neurons. Oxford University Press, New York
3. Schmitt M (2001) On the complexity of computing and learningwith multiplicative neural networks. Neural Comput 14:241–301
4. Shin Y, Ghosh J (2001) Ridge polynomial networks. IEEE TransNeural Netw 6:610–622
5. Zhang CN, Zhao M, Wang M (2000) Logic operations based onsingle neuron ratioal model. IEEE Trans Neural Netw 11:739–747
6. Basu M, Ho TK (1999) Learning behavior of single neuron classi-fiers on linearly separable or nonseparable inputs. IEEE IJCNN’992:1259–1264
7. Labib R (1999) New single neuron structure for solving nonlinearproblems. IEEE IJCNN’99 1:617–620
8. Iyoda EM, Nobuhara H, Hirota K (2003) A Solution for the N-bit parity problem using a single translated multiplicative neuron.Neural Processing Lett 18:233–238
9. Hoppensteadt F, Izhikevich E (2001) Canonical neuron models.In: Arbib MA (ed), Brain theory and neural networks. MIT Press,Cambridge
10. Box GEP, Jenkins GM, Reinse GC (1994) Time series analysis:forecasting and control. Prentice Hall, Englewood Cliffs
11. Mackey M, Glass L (1997) Oscillation and chaos in physiologicalcontrol systems. Science 197:287–289
12. Guler M, Sahin E (1994) A new higher-order binary-input neuralunit: learning and generalizing effectively via using minimal num-ber of monomials. In: Proceedings of third turkish symposium onartificial intelligence and neural networks, pp 51–60
13. Proakis JG (2001) Digital communications. McGraw Hill Interna-tional, Singapore
14. Ripley BD (1996) Pattern recognition and neural networks. Cam-bridge University Press, Cambridge
15. Schreiner K (2001) Neuron function: the mystery persists. IEEEIntll Syst 16:4–7
16. Murata N, Yoshizawa S, Amari S (1994) Network information cri-terion-determining the number of hidden units for and artificialneural networks model. IEEE Tran Neural Netw 5:865–872
17. Plate TA (2000) Randomly connected sigma-pi neurons can formassociator networks. NETCNS: Network: Comput Neural Syst11:321–332
18. Piegat A (2001) Fuzzy modeling and control. Physica-Verlag, Hei-delberg, New York
19. Ripley DB (1994) Neural networks and related methods of classi-fication. J Roy Stat Soc Ser B56:409–456
20. Akaike H (1974) A new look at the statistical model identification.IEEE Tran Appl Comp AC-19: 716–723
21. Fogel DB (1991) An information criterion for optimal neural net-work selection. IEEE Tran Neural Netw 2:490–497
22. Hardy GH, Wright EM (1979) Some notations. In: An introductionto the theory of numbers 5th ed. Clarendon Press, Oxford pp. 7–8