Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
ANALYSIS AND DESIGN OF THE MULTI-LAYER PERCEPTRON USING
POLYNOMIAL BASIS FUNCTIONS
The members of the committee approve the doctoral
dissertation of Mu-Song Chen
Michael T. ManrySupervising Professor
Kai S. Yeung
Venkat Devarajan
Jonathan Bredow
Daniel S. Levine
Dean of the Graduate School
ANALYSIS AND DESIGN OF THE MULTI-LAYER PERCEPTRON USING
POLYNOMIAL BASIS FUNCTIONS
by
MU-SONG CHEN
Presented to the Faculty of the Graduate School of
The University of Texas at Arlington in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT ARLINGTON
DECEMBER 1991
ACKNOWLEDGEMENTS
I would like to express my deepest appreciation to my supervising professor Dr.
Michael T. Manry, for his support, encouragement, and guidance. Without his constant
encouragement and willingness to meet at untimely schedules, I would not have been able
to complete my dissertation. I would also like to thank the other members of my
dissertation committee, Dr. Yeung, Dr. Devarajan, Dr. Levine, and Dr. Bredow, for
providing constructive suggestions.
I also owe a great deal to the members of the Image Processing and Neural
Networks Lab including Kamyar Rohani and Steve Apollo, for helping me with the
software tools on the school computers.
Finally, I wish to thank my parents, who always believed in higher education for
their children, for their support and encouragement. Also, forever, I am grateful for the
sacrifices they made in supporting me while I was miles and miles away from home.
November 7, 1991
iii
ABSTRACT
ANALYSIS AND DESIGN OF THE MULTI-LAYER PERCEPTRON USING
POLYNOMIAL BASIS FUNCTIONS
Publication No._________
Mu-Song Chen, Ph.D.
The University of Texas at Arlington, 1991
Supervising Professor : Michael. T. Manry
In this dissertation, the theory of polynomial basis functions is developed as a
means for the design and analysis of the multi-layer perceptron (MLP) neural networks.
Methods and algorithms are presented for designing theMLP network system using
polynomial models. The theory enables us to develop an approximation theorem for the
MLP network, to map an existingN-dimensional polynomial function to theMLP network
(forward mapping), and to construct polynomial discriminant functions from an existing
MLP network (inverse mapping).
There are several advantages associated with forward and inverse mappings. The
forward mapping allows us to determine the minimum required network topology and to
iv
initialize the network with small errors. The inverse mapping allows us to prune the
useless units in theMLP network, determine the complexity of the conventional
implementation of the network and find the polynomial approximation of the network
output.
v
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Nonlinear Modelling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 MLP Neural Network Model and Its Problems. . . . . . . . . . . . . . . . . . 5
1.3 The Scope of the Dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. NETWORK BASIS FUNCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Orthogonal Basis Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Radial Basis Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Polynomial Basis Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3. MLP APPROXIMATION THEOREMS . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Polynomial Approximating of Functions. . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Approximating Functions of One Variable. . . . . . . . . . . . . . . . . 24
vi
3.1.2 Approximating Functions of Many Variables. . . . . . . . . . . . . . . 26
3.2 Realization of Multi-Input Products. . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Completion of the Proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4. FORWARD MAPPINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Complete Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Multi-LayerComplete Networks. . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.3 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Compact Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 A Lower Bound on the Number of Weights. . . . . . . . . . . . . . . 44
4.2.2 Construction ofCompact Networkswith Monomial Activation . . 47
4.2.2.1 Compact Mapping with Block Approach. . . . . . . . . . . . . . 49
4.2.2.2 Compact Mapping with Group Approach. . . . . . . . . . . . . . 52
4.2.3 Conversion of Monomial Activation to Analytic Activation. . . . . 54
4.2.4 Sparse Second DegreeCompact Network . . . . . . . . . . . . . . . . . 56
4.2.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5. INVERSE MAPPINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Polynomial Network Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Calculation of the Condensed Network Model. . . . . . . . . . . . . . . . . . 71
5.3 Network Pruning Using the Condensed Network Model. . . . . . . . . . . 75
vii
5.4 Calculation of the Exhaustive Network Model. . . . . . . . . . . . . . . . . . 82
5.5 Experiments with the Exhaustive Network Model. . . . . . . . . . . . . . . 84
5.5.1MLP Neural Network Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.2MLP Neural Network Classifier. . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.3 Experiments with Quadratic Discriminants. . . . . . . . . . . . . . . . . 87
6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
APPENDIX A. BACK-PROPAGATION LEARNING ALGORITHM . . . . . . 93
APPENDIX B. REALIZATION OF MONOMIAL AND TWO-INPUT
PRODUCT SUBNETS. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B.1 Find X0 for the Second Degree Taylor Series. . . . . . . . . . . . . . . . . . 98
B.2 Conditions for Mapping Accuracy of the Truncated Taylor Series. . . . 99
B.3 Monomial Subnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.4 Two-Input Product Subnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
APPENDIX C. FEATURE DATA SET . . . . . . . . . . . . . . . . . . . . . . . . . . 108
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
viii
LIST OF FIGURES
Figure 1.1 The Multi-LayerPerceptronNetwork . . . . . . . . . . . . . . . . . . . . . . . . 6
Figure 1.2 Artificial Neuron Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Figure 1.3 Four Representative Activation Functions. . . . . . . . . . . . . . . . . . . . . 9
Figure 1.4 Block Diagram of the Proposed Tasks. . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.1MLP Network Representation via Radial Basis Functions. . . . . . . . . . 17
Figure 2.2MLP Network Representation via Polynomial Basis Functions. . . . . . . 21
Figure 3.1 Monomial Subnet with Multiple Inputs. . . . . . . . . . . . . . . . . . . . . . . 29
Figure 3.2 Constructionf(x) by Subnet Approaches. . . . . . . . . . . . . . . . . . . . . . 31
Figure 4.1 Block Diagram of Forward Mapping. . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 4.2 Subnet Approach for Mapping a 4-Input Second Degree Polynomial . . 35
Figure 4.3 The 4-Input Second DegreeComplete Network . . . . . . . . . . . . . . . . . 35
Figure 4.4 The Multi-LayerComplete Networkfor Realizing a Function withN =
2 andP = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 4.5 The Multi-LayerComplete Networkfor Realizing Productx1 x2 x7 . . . 40
Figure 4.6 Training Results of aComplete Network(16-136-4) withCHEF Data . 42
Figure 4.7 Training Results of aComplete Network(16-136-4) withLPTF Data . . 43
Figure 4.8 Compact Mapping with Block Approach. . . . . . . . . . . . . . . . . . . . . . 48
Figure 4.9 Flowchart of Iterative Conjugate-Gradient Method. . . . . . . . . . . . . . . 51
ix
Figure 4.10 Compact Mapping with Group Approach. . . . . . . . . . . . . . . . . . . . . 53
Figure 4.11 Conversion of akth Degree Monomial Activation to Sigmoid
Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 4.12 Replace Monomial Activation (2nd Degree) by Sigmoid Activation . . 57
Figure 4.13 Efficient Compact Mapping of Second Degree Function. . . . . . . . . . 59
Figure 4.14 Training Results of aCompact Networkwith RWEFData . . . . . . . . . 61
Figure 4.15 Group Approach for Realizing Allx1 Terms (Second Degree Case) . . 63
Figure 4.16 Group Approach for Realizing Allx2 Terms (Second Degree Case) . . 63
Figure 4.17 Group Approach for Realizing Allx3 Terms (Second Degree Case) . . 64
Figure 4.18 Group Approach for Realizing Allx4 Terms (Second Degree Case) . . 64
Figure 4.19 Group Approach for Realizing Allx5 Terms (Second Degree Case) . . 65
Figure 4.20 Group Approach for Realizing Allx6 Terms (Second Degree Case) . . 65
Figure 4.21 Block Approach for Realizing All Terms of a Second Degree
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 5.1 Block Diagram of Inverse Mappings. . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 5.2 Decide Unit Degreep(i) for the ith Unit . . . . . . . . . . . . . . . . . . . . . . 73
Figure 5.3 The Shaded Uniti is Ready to be Removed. . . . . . . . . . . . . . . . . . . 77
Figure 5.4 Pattern Classifier Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 5.5 The Network After Pruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 5.6 Results of Karnin’s Analysis (Shaded Units and Dark Lines are
Candidates for Removal). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
x
Figure 5.7 3-Input Median Filter Network with Layer Structure 3-10-1 and
Maximum Degreep(i) = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Figure A.1 Backpropagate the Error Signals from Output Layer. . . . . . . . . . . . . 96
Figure B.1 Monomial Subnetxk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure B.2 Monomial Subnetx2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure B.3 Product Subnetx1x2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure C.1 Examples of Geometric Shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xi
LIST OF TABLES
Table 4.1 Subnet Approach for Realizing a Function withN = 2 andP = 3 . . . . . 36
Table 4.2 Comparisons of Single-Layer and Multi-LayerComplete Networks. . . . 41
Table 4.3 Classification Results of Gaussian Classifier andComplete Network. . . 42
Table 4.4 Classification Error Percentages for Gaussian Classifier and Second
DegreeCompact Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 4.5 Mean and STD Deviation for the Coefficients in Different Functions . . 61
Table 4.6 Results for Mapping a 4th Degree Function toCompact Networks,
Using Group and Block Approaches. . . . . . . . . . . . . . . . . . . . . . . . . 62
Table 4.7 Compare the Required Hidden Units with the Theoretical Results. . . . . 66
Table 4.8 Using Block Approach to Approximate Function 1/(x1x2) . . . . . . . . . . . 68
Table 5.1 Truth Table of the Exclusive-OR Problem. . . . . . . . . . . . . . . . . . . . . 74
Table 5.2 Approximation of the Exclusive-ORMLP Network (Layer Structure :
2-1-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Table 5.3 Truth Table of the Parity-Check Problem. . . . . . . . . . . . . . . . . . . . . . 74
Table 5.4 Approximation of the Parity-CheckMLP Network (Layer Structure :
3-10-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Table 5.5 Weights in Figure 5.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 5.6 Input Weight Sensitivities in Figure 5.4. . . . . . . . . . . . . . . . . . . . . . . 79
xii
Table 5.7 Output Weight Sensitivities in Figure 5.4. . . . . . . . . . . . . . . . . . . . . . 80
Table 5.8 Degree of Each Unit for 3-Input Median Filter WhenT = 0.01 . . . . . . . 86
Table 5.9 Output Coefficients of the Approximating Polynomial for 3-Input
Median Filter Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 5.10 Error % for Gaussian, Quadratic Discriminants andMLP Network . . . 88
Table 5.11 Analysis of Similarity Between Gaussian and Quadratic
Discriminants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 5.12 Training of the Quadratic Discriminants. . . . . . . . . . . . . . . . . . . . . . 90
Table C.1 Shape Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xiii
LIST OF ACRONYMS
MLP Multi-Layer Perceptron
BP Back-Propagation
PNN Probablistic Neural Network
RBF Radial Basis Function
PBF Polynomial Basis Function
UB Upper Bound Number of Hidden Units
LB Lower Bound Number of Hidden Units
MSE Mean Square Error
CHEF Circular Harmonic Expansion Features
RDF Radius Features
LPTF Log-Polar Transform Features
RWEF Ring-Wedge Energy Features
FMDF Fourier-Mellin Descriptor Features
xiv
NOMENCLATURE
N dimension off( ) or number of input units of theMLP network
x N-dimensional input vector [x1,x2, ,xN]T
di(x) the ith discriminant function
p(i) the probability of occurrence for theith class
mi the mean vector for classi
Covi covariance matrix for classi
Dis( ) distance measure function
Xnet(i) net input of theith hidden unit
φ(i) polynomial basis function for theith hidden unit
w(i,j) connection weight form unitj to unit i
θ(j) threshold of thejth hidden unit
G( ) analytic activation output
E(W) error function for theMLP network
Nh number of hidden units
f( ) the desired network output function
Nv number of training vectors (patterns)
Ωi( ) the ith radial basis function
αi connection weight from theith hidden unit to output unit in the
xv
RBF network
X0(i) expansion point for Taylor series of theith hidden unit
A(i,j) Taylor series coefficient of thejth power term in theith unit
E(i,ξ) the remainder coefficient for Tayor series approximation of theith
hidden unit’s activation output
p(i) degree of polynomial expansion for theith unit’s activation output
M(i) convergence radius for theith hidden unit’s power series
X vector of all one-term polynomials with degrees 0 throughP
L dimension ofX
Af coefficient vector ofX
P the output degree of theMLP networks
wo(i) output weight connecting to theith class
C output coefficient matrix of theMLP network
Ns total number of sampling points
ϕi(x) orthogonal polynomial basis function with single variable
Ψi(x) orthogonal polynomial basis function with multiple variables
u(x) weighting function for orthogonal polynomial basis function
fk(x) an N-dimensional polynomial with terms of degreek only
<φ(i)> the average output activation of theith hidden unit
D(i,j) polynomial coefficient of thejth power term in theith hidden unit
d(i,n) the coefficient vector forXnetn(i)
xvi
E(p(i)) mean square error between the actual activation output and its
polynomial approximation with degreep(i) in the ith hidden unit
R(p(i)) relative mean square error ofE(p(i))
T threshold forR(p(i))
Sij weight sensitivities between hidden unitsi and j
NIT total number of iterations
p(J) the maximum output degree in hidden layerJ
T(p)(i) desired output of theith class for thepth pattern
O(p)(i) output of theith output unit for thepth pattern
η learning factor inBP algorithm
Nc number of output classes
xvii
CHAPTER 1
INTRODUCTION
Real world systems for processing signals are commonly classified according to
several criteria. Among these criteria are (1) the purpose of the system, such as control,
communication, signal processing etc., and (2) the mathematical characteristics of the
system such as linearity or non-linearity, time variance or invariance etc. The
mathematical characteristics of the system are determined by modelling its behavior for
different classes of inputs. The mathematical complexity of the model depends on how
much is known about the process being studied and on the purpose of the modelling
exercise. In preliminary studies of systems, the models are often assumed to be linear in
the parameters. Such models will be referred to as linear models. In the early
development of signal processing, linear systems were the primary tools. Their
mathematical simplicity and the existence of some desirable properties made them easy
to design and implement. However, the more realistic and accurate models are often
nonlinear in the parameters and called as nonlinear models. In this chapter, we briefly
discuss some conventional nonlinear modelling techniques, which are applied in systems
for filtering and classification. Then we proceed to introduce multi-layer perceptron neural
networks as an alternative approach for nonlinear modelling.
1
2
1.1 Nonlinear Modelling
Nonlinear systems are often applied in filtering and classification. For example,
it is well known that in detection and estimation problems, nonlinear filters arise in the
case when the signal and noise joint densities are not Gaussian, and when the noise is not
independent of the signal. A possible way to describe the input-output relationship in a
nonlinear filter is to use a discrete Volterra series representation. The Volterra series can
be viewed as a Taylor series with memory [1]
wherex(n) andy(n) denote the input and output respectively and
(1.1)y(n) h0
N
k 1
Hk[x(n)]
The Volterra filter is general enough to model many of the classical nonlinear filters,
(1.2)Hk[x(n)]N 1
i1 0
N 1
i2 0
N 1
ik 0
hk(i1,i2, ,ik)x(n i1)x(n i2) x(n ik)
including order statistic [2]-[4] and morphological filters [5]. By adding higher-order
terms to the Volterra series, its modelling accuracy can be improved. Filters based on the
Volterra series and another representation called Wiener series [1] will be referred as
polynomial filters. Unfortunately, the design of polynomial filters often requires a
knowledge of the higher order statistics of the input signal. Filter of higher than second
degree are usually impractical because the number of coefficients is too large.
Polynomials also arise in classification problems. They are members of the class
of parametric classifiers, which means that the feature vectors to be classified are assumed
3
to have probability densities with only a few parameters. Therth order polynomial
discriminant function can be expressed as
where x is an N-dimensional vector,x = [x1,x2, ,xN]T, and T denotes transpose. In
(1.3)di(x) ωi1g1(x) ωi2g2(x) ωikgk(x) ωi,k 1
general,ωij is referred to as weight andgj(x) is of polynomial form as
In the case ofr=2, di(x) is called a quadratic discriminant function
(1.4)gj(x) x
n1
k1x
n2
k2x
nr
kr,
n1,n2, ,nr 0 or 1
k1,k2, ,kr 1, ,N
and k = ½N(N + 3). The quadratic discriminant function is useful because it is the
(1.5)di(x)N
j 1
ωjj x2j
N 1
j 1
N
m j 1
ωjmxj xm
N
j 1
ωj xj ωk 1
nonlinear discriminant which is most easily designed. It arises from the Bayes classifier,
when the feature vectors have a Gaussian joint probability density [6]-[7]. For normal
distributed patternsx, the optimal Bayes decision function of theith class can be found
as
p(i), mi andCovi denote the probability of occurrence, mean vector and covariance matrix
(1.6)di(x) ln p(i) 12
ln Covi
12
(x m i)TCov 1
i (x m i)
for theith class, respectively. The performance of the Bayes classifier is often sub-optimal
because the distribution of the patterns is often non-Gaussian or even discrete-valued, the
4
class statistics are estimated from a finite number of training or example vectors, and the
covariance matrix inversions can be ill-conditioned. Specht [8] has presented a
probablistic neural network (PNN), which has been approximated via Taylor series to
yield the Padaline [9]. ThePNN is based on the Bayes strategy to finding the network
output in a polynomial form. Gabor [10] designed a machine consisting of a polynomial
classifier, together with a training algorithm. The training algorithm optimizes the output
by successive adjustment of the coefficients until the output errors are small. The
problems with polynomial discriminant functions are the same as the problems with
conventional polynomial filters. Discriminants of degree greater than two are rarely used
because of the large storage requirements and the large number of operations necessary
to generate outputs. For trainable discriminants, such as those of Gabor [10], the training
process is also time consuming.
The other method for classification is to apply a nonparmetric model (no
assumption is made about the underlying data distribution) such as the nearest neighbor
classifier [11]-[12]. Given a set of reference vectors R1,R2, ,RM associated with
classes Ω1,Ω2, ,ΩM, the rule of the nearest neighbor classifier assigns a patternx to
the class of its nearest neighbor. ThusRi is the nearest neighbor tox if
where Dis( ) is any distance measure defined over the pattern space. The nearest
(1.7)Dis(Ri,x) Mink 1,2, ,M
Dis(Rk,x)
neighbor classifier approximates the minimum error Bayes classifier as the number of
5
reference vectors becomes large. However, at the same time, the computational
complexity of the nearest neighbor classifier increases. Also, for a small number of the
reference vectors, the nearest neighbor classifier is not optimal with respect to the training
data.
1.2 MLP Neural Network Model and Its Problems
In 1943, McCulloch and Pitts [13] proposed a mathematical model of the neuron.
These abstract nerve cells provided the basis for a formal calculus of brain activity. In
1958, Rosenblatt [14] presented the Rosenblattperceptron, which was the most successful
neural network system of that time. It was an elementary visual system which could be
taught to recognize a limited class of patterns. Thisperceptronmodel is the foundation
for many other forms of the artificial neural networks [15].
Definition : A perceptronis a device which computes a weighted sum of its inputs, and
puts this sum through a special function, called the activation, to produce
the output. The activation function can be linear or nonlinear.
A network of linearperceptrons has serious computational limitations. For example, a
linear perceptronis incapable of yielding a discriminant that will solve the exclusive-or
and parity-check problems. That is, the linearperceptroncannot automatically learn the
discriminant that will classify the even and odd patterns. This limitation of the linear
perceptronnetwork is overcome by adding layers of nonlinear perceptrons. The resulting
network is often called the multi-layerperceptronneural networks.
6
The MLP networks are feedforward networks with one or more layers of units
between the input and output nodes. The weights in the network are termed feedforward
because the weights flow in the forward direction, starting with the inputs, and no weights
feed back to the previous or current layers. A typicalMLP network is shown in
Figure 1.1. The input layer contains the dummy units which just distribute the inputs to
the network. The output layer are equivalent to the network discriminant functions.
Between them are the hidden layers. In the following, we begin to introduce theMLP
network’s structure.
Consider theith hidden unit in Figure 1.1, the inputs to this unit are the weighted
Figure 1.1 The Multi-LayerPerceptronNetwork.
7
outputs from all previous layers. The net input,Xnet(i), for the ith unit is formulated as
follows
φ(j) is the activation output of thejth unit andθ(i) is a variable bias with similar function
(1.8)Xnet(i)j
w(i,j)φ(j) θ(i)
to a threshold.w(i,j) is the weight connecting thejth unit to the ith unit and the
summation is over all units feeding into theith unit. Figure 1.2 shows a model that
implements the idea.
As shown in Figure 1.2,G is an activation function which performs a transformation of
Figure 1.2 Artificial Neuron Model.
the net input and decides the output level of theith unit, φ(i). G can be a linear function
8
(Figure 1.3(a)) such that
or a squaring function (Figure 1.3(b))
(1.9)φ(i) K Xnet(i) , K is a constant
or a threshold function (Figure 1.3(c))
(1.10)φ(i) X 2net(i)
However, a function that more accurately simulates the nonlinear transfer characteristics
(1.11)φ(i)
1, if Xnet(i) ≥ T0, otherwise
of the biological neuron and permits more general network functions is shown in
Figure 1.3(d). This is the most commonly used activation function and is called the
sigmoid function. This function is expressed mathematically as
This sigmoid activation has the feature of being nondecreasing and differentiable and its
(1.12)φ(i) 1
1 e Xnet(i)
range is 0≤ φ(i) ≤ 1.
There are several potential advantages of theMLP neural networks. Unlike the
polynomial filter or Gaussian classifier, no assumption is made about the underlying data
distribution for designing theMLP networks, so the data statistics do not need to be
estimated [16]. Second, the parallel structure of theMLP network make it realizable in
parallel computers. Third, theMLP network exhibits a great degree of robustness or fault
9
tolerance because of built-in redundancy. Damage to a few nodes or links thus need not
impair overall performance significantly. Fourth, theMLP network can form any
unbounded decision region in the space spanned by the inputs. Such regions include
convex polygons and unbounded convex regions [17]. Finally, theMLP networks have
a strong capability for function approximation.
Additional major characteristics of theMLP network are its abilities to learn and
Figure 1.3 Four Representative Activation Functions.
generalize. Learning can be viewed as producing a surface in multidimensional space that
fits the set of training data in some best sense. Generalization is learning which is limited
10
such that when new patterns are input into the network, they are processed (filtered or
classified) almost as well as were the training input patterns. If anMLP network has
enough free parameters, it can learn without generalization. In other words it can
implement a multi-dimensional Lagrange interpolation [18] of the training data, but can
perform very poorly when applied to additional patterns. In Appendix A, we demonstrate
a training method for theMLP networks which is based upon the iterative steepest
descent algorithm. This is so called the back-propagation (BP) learning algorithm [19].
Although theMLP network, with BP learning, has been used for many pattern
classification and filtering problems, it has many drawbacks. These include the following.
First, it is often unclear which network topology is required for the solution of a given
problem. For a given task, the number of hidden layers and hidden units are varied until
satisfactory results are obtained. Two methods have been proposed as remedies. One
consists of starting with a small network and expanding it [20]-[21]; the other consists of
starting with a large network and pruning it to a smaller size [22]-[23]. In both cases, the
training time is increased significantly over regularBP learning. Therefore, it is desirable
to find new techniques for determining the network topology.
Second, large scaleMLP networks have extremely low training rates whenBP is
used, since the networks are highly nonlinear in the weights and thresholds. The network
might become trapped in a local minima of the error function being minimized.
Convergence to local minima can be caused by an improper setting of the initial weights
and thresholds, but their is presently no reliable method for initializing the network.
11
Third, there is no simple effective theory developed which explains the behavior
and mapping capabilities of theMLP network. Although several researchers [24]-[27]
have demonstrated that sufficiently complex multilayer networks are capable of arbitrarily
accurate approximations to an arbitrary mapping, there is no rule to determine the optimal
number of hidden unitsNh required for the solution of a given problem. In addition, the
methods to find a set of weights and thresholds to approximate the given function are still
unclear.
1.3 The Scope of the Dissertation
In this dissertation, we introduce polynomial basis functions (PBFs) and use them
to analyze and design theMLP neural networks which have analytic activation functions.
The basic tasks are (1) to develop aPBF model for theMLP network, (2) to find methods
for mapping polynomial filters and discriminants to theMLP network (forward mapping),
and (3) to find methods for calculating thePBF model from an existingMLP network
(inverse modelling). Figure 1.4 shows a block diagram for these tasks. In summary, it will
be possible to freely convert many continuous polynomial functions from an
N-dimensional power series representation to anN-input MLP network, and vice versa.
This dissertation is divided into 6 chapters. In Chapter 2, three basis function
representations of theMLP networks are presented. First, the orthogonal and radial basis
function (RBF) networks are reviewed. Then, polynomial basis functions are introduced.
The approximation theorems for theMLP networks are proved based on thePBF model.
12
This is the main topic in Chapter 3. In Chapter 4, the mapping theorems are developed
and used to find practical methods for mapping polynomial functions to theMLP network.
Two kind of networks,completeand compact networks, are demonstrated for forward
mappings. In Chapter 5, techniques are described for modelling an existingMLP network
by finite degree polynomial functions. Finally, the conclusions are given in Chapter 6.
TheBPalgorithm, construction of monomial and product subnets and several feature date
sets are presented in Appendix A, B and C, respectively.
Figure 1.4 Block Diagram of the Proposed Tasks.
CHAPTER 2
NETWORK BASIS FUNCTIONS
In recent years, several researchers have developed methods to analyze the
behavior of theMLP neural networks, in order to compare them to conventional
classifiers and filters. Nerrand [28] has trained recurrent neural networks to perform
nonlinear adaptive filtering and modelling. Klimasauskas [29] and Anderson [30]
discussed the use of theMLP networks for noise filtering and compared it with linear
Wiener filtering. Their technique is experimental, and has not led to an increased
theoretical understanding of theMLP network. Gallinari [31] has compared a linearMLP
network to a conventional discriminant analysis method which utilizes projections of the
input vectors onto optimal subspaces. Asoh [32] performed a regression analysis on the
MLP networks having nonlinear hidden units. The drawback to his analysis is that it
consists principally of empirical observations. Toshio [33] presents a multiple logistic
model to find the weights of theMLP network, which is based upon a maximum
likelihood method. Unfortunately, statistical network design methods, such as his, require
prior knowledge or information about training data characteristics.
Several basis vector approaches have been used to study theMLP networks. Fujita
[34] proposed to use output state vectors of hidden units as internal representations of the
13
14
MLP networks. His technique, called the Orthogonal Complement Method, allows one to
estimate the necessary number of hidden units from the dimension of the subspace
spanned by the input state vectors. His approach is useful for designing binary output
systems. Unfortunately, the required number of computations increases exponentially with
the network size. Snadberg [35] proved an universal approximation theorem for radial
basis function (RBF) networks. However,RBFnetworks have only one hidden layer, and
have a very specific type of activation function. Therefore, analyses ofRBFnetworks are
of limited applicability to the more generalMLP networks. In this chapter, our goal is to
introduce a polynomial basis vector representations for theMLP networks. First, we
review the orthogonal basis network and theRBF network. Then we introduce the more
general concept of polynomial basis functions (PBFs).
2.1 Orthogonal Basis Functions
Qian and Lee [36] have designed theMLP networks using a set of orthogonal
basis functions. That is, the network output is expressed as
and Ω is the weight vector. Iff(x) is the desired output function, then the best
(2.1)f(x)i
Ωi(x)Ψi(x)
approximation is obtained by the least mean square minimization of the error function
15
Nv is the number of the entire training vectors. The summation in Eq. (2.2) can be
(2.2)E 12
Nv
j 1
f(x(j)) f(x(j))2
approximated by an integration when the training set contains a large number of vectors.
Here, p(x) is the probability distribution ofx. It can be proved [37]-[38] thatΩ can be
(2.3)E ≈ 12 ⌡
⌠ f(x) f(x)2p(x)dx 1
2< f(x) f(x)
2>
found asΩ = R-1<f(x)f(x)> whereR is the correlation matrix ofΨi(x)’s. Inversion of the
R matrix is practical ifR is an orthogonal matrix or identity matrix. Then the weight
vector is Ω = <f(x)Ψ(x)>. The problem left is how to find a set of orthogonal basis
functions. For the one-dimensional case, Qian [36] performed a variable change, dµ =
p(x)dx, such that
There are several sets of orthogonal basis functionsΨ(µ) available for finite support such
(2.4)⌡⌠Ψi(x)Ψj(x) p(x)dx δij ⇒ ⌡
⌠Ψi(µ)Ψj(µ)dµ δij
as Ψ(µ) = 1,cos(πµ),cos(2πµ),cos(3πµ), which is defined on [0,1]. However, their
results are difficult to extend to higher-dimensional problems.
2.2 Radial Basis Functions
A RBF network [35],[39]-[42] can be regarded as a single hidden layerMLP
network, in which the output is a linear function of the hidden unit outputs. In theRBF
16
network, the network output functionf(x) for the input vectorx is represented by
whereα0 is an additive bias andαi (i ≥ 1) represents a weight from theith hidden unit
(2.5)f(x) α0
Nh
i 1
αi Ωi( x x i )
to the output. By clustering training patternsx into Nh clusters,xi will be taken as the
mean vector of each cluster and is known as theRBF center. TypicallyΩi( ) is chosen
as a Gaussian function,
In Eq. (2.6), denotes a norm which is usually taken to be Euclidean.σij’s are the
(2.6)Ωi( x x i ) e
j
(xj xj)2
2σ2ij
elements of a covariance matrix, which is taken to be diagonal. The representation of the
MLP network via radial basis functions is shown in Figure 2.1.
There are three steps in the design of theRBF network. First, we pick a
representative set of training vectors. Second, one hidden unit is chosen for each training
vector. Finally, we find the best set of output weights to approximate the desired output.
Because of the linear dependence of the network output on the weights in theRBF
expansion of Eq. (2.5), a global minimum exists in the error function for theRBF
network. Therefore, the adjustable output weights,αi, can be determined using the linear
least squares method. This is an important advantage of this approach.
The RBF network represented by Eq. (2.5) has many useful qualities, including
fast learning and ease of design. However, for certain classes of problems, theRBF
17
approach may not be a good strategy. First, the number ofRBF centers (hidden units) is
much greater than the number of hidden units used in anMLP network designed from the
same training data. There is usually considerable redundancy in the network’s hidden
units. Second,RBFnetworks work only when the centers are well chosen. In practice the
centers are often arbitrarily selected from sampling data [43]. Such a mechanism is clearly
unsatisfactory. Third, the large number of centers required results in long training time
for the output weights. Here we suggest another set of basis functions for modelling the
MLP networks.
Figure 2.1 MLP Network Representation via Radial Basis Functions.
18
2.3 Polynomial Basis Functions
The goal here is to present aPBF model for theMLP neural networks. ThePBF
model solves the problems of theRBF network. The advantages are
(1). ThePBF model, applied to single and multiple hidden layer networks, is general
enough to describe both theMLP andRBF networks.
(2). ThePBF model leads to the approximation theorems for theMLP networks,
(3). ThePBF model leads to straightforward mappings between theMLP networks
and conventional filtering and classification algorithms.
(4). ThePBF model results in finding the polynomial approximation of an existing
MLP network.
As with theRBFapproach, we have one polynomial basis function for each hidden
unit. Assume that the hidden unit activations are analytic functions, such as sigmoid
functions. Then the activation of theith unit in the network can be modeled as a power
series with integer degreep(i) [44],
for
(2.7)φ(i)p(i)
j 0
A(i,j)(Xnet(i) X0(i))j E(i,ξ)(Xnet(i) X0(i))
p(i) 1
The net input,Xnet(i), of the ith unit is
(2.8)Xnet(i) X0(i) ≤ M(i)
19
where indexk is for all the hidden units or input units feeding theith hidden unit from
(2.9)Xnet(i)k
φ(k)w(i,k) θ(i)
the previous layers. For input units,p(i) = 1 and φ(i) = xi where xi denotes theith
component of anN-dimensional input vectorx = [x1,x2, ,xN]T. φ(k), k > N, is the
activation output for the (k-N)th hidden unit,w(i,k) is the synaptic weight between them
and θ(i) is an additive bias.A(i,j) is the Taylor series coefficient of thejth power term
in the ith unit,
where G(j)( ) is the jth derivative of the analytic activation function.E(i,ξ) is the
(2.10)A(i,j)G (j)(X0(i))
j!
remainder term (ξ is somewhere betweenXnet(i) and the expansion pointX0(i)). Good
choices forM(i) (radius of convergence),p(i), andX0(i) allow accurate approximations
for a wide variety of activation functions.
The output of theMLP network can be characterized as the weighted sum of the
polynomial basis vectors (Figure 2.2)
where Nu = 1 + N + Nh and Nh denotes the number of hidden units in the network.
(2.11)f(x)Nu
i 1
wo(i)φ(i)
Equations (2.7), (2.9) and (2.11), with the degreesp(i), form a condensedmodel of the
MLP network, in which the output is a weighted sum of compositions of polynomials.
20
Substituting Eq. (2.9) into Eq. (2.7), and multiplying out the compositions, the
activation of theith unit can be written as
C(i) is a coefficient vector and
(2.12)φ(i) ≈ C(i)X
The vectorX has elementsXkm, which denote themth one-term polynomial of degreek
(2.13)X [X01,X11, ,X12,X22, ,X31, ]T
in the variablesxj for j = 1 to N. For example,X01 = 1, X1m = xm, andX2m denotes the
termsx12, x2
2, x1x2, x1x3, etc. Given the dimensionN of the input vector, the number of
degree-k terms is (k+N-1)!/(k!(N-1)!) and the total number of terms inX is
whereP is the highest degree inX.
(2.14)LP
k 0
(k N 1)!k!(N 1)!
Definition : A polynomial basis function is anN-variable polynomial, which
approximates a hidden unit’s activation output. TheN variables are the
network inputs.
Substituting Eq. (2.12) into Eq. (2.11),
Wo is anNu by 1 output weight vector. TheC matrix is anNu by L coefficient matrix
(2.15)f(x) WTo CX
21
The ith row of the C matrix is the vectorC(i) in Eq. (2.12). SinceL, the column
(2.16)C
C(1,1) C(1,2) C(1,3) C(1,L)
C(2,1) C(2,2) C(2,3) C(2,L)
C(Nu,1) C(Nu,2) C(Nu,3) C(Nu,L)
dimension of theC matrix, can be very large, making the calculation and storage of the
C matrix prohibitive, therefore, we call Eq. (2.15) anexhaustive polynomial basis
function model off(x).
Figure 2.2 MLP Network Representation via Polynomial Basis Functions.
CHAPTER 3
MLP APPROXIMATION THEOREMS
The approximation capabilities of theMLP networks have been proven by many
investigators [24]-[27]. They have demonstrated that sufficiently complex theMLP
networks are capable of accurate approximations to arbitrary continuous mappings over
a bounded compact set. A mathematical result of Kolmogorov [45] has been interpreted
as saying that for any continuous mapping, there exists a three-layerMLP network which
realizes it. These results indicate that theMLP network provides a very powerful tool for
realizing nonlinear mappings for filtering, control, and pattern classification.
Unfortunately, these investigators have not
(1). Given procedures for determining the number of hidden units,Nh, required for the
solution of a given problem,
(2). Given a technique for finding the network weights, or
(3). Given simple proofs of the approximation capabilities.
A solution to problem (1) is critical. If the number of units in a hidden layer is
too large (over-determined case), the network can memorize the training data and perform
poorly at generalization tasks. If the number of units is too small (under-determined case),
recall accuracy will suffer and the network may fail to extract the desired relationship
22
23
from the training data. In this chapter, we give an approximation theorem and propose a
constructive proof for it, which uses the concept of the polynomial basis functions. This
proof solves the problems for the network structure in the design of theMLP networks.
First, the theorem is stated as follows.
Theorem 3.1: Any continuous function defined over a bounded compact set can be
approximated using anMLP neural network with hidden units having the
activation function
where all termsA(i,j) are nonzero forj between 2 andp(i).
(3.1)G(Xnet(i))p(i)
j 0
A(i,j)(Xnet(i) X0(i))j E(i,ξ)(Xnet(i) X0(i))
p(i) 1
Proof : The proof consists of three steps. In section 3.1, we review the Weierstrass
approximation theorem and multi-dimensional orthonormal polynomials. The Weierstrass
theorem shows the existence of multivariate approximating polynomials, and the
orthonormal polynomials provide a concrete method for generating an approximating
polynomial. In section 3.2, we show that each term in the approximating polynomial,
which is a multi-input product, can be realized by a subnet in anMLP network.
3.1 Polynomial Approximating of Functions
As the first step of the proof, we briefly review polynomial approximations for
single variable functions, and then review the multivariate case.
24
3.1.1 Approximating Functions of One Variable
According to the Weierstrass approximation theorem [46]" Any bounded function
F(x) can be uniformly approximated over a closed interval [a,b] by a polynomial f(x),
where is a positive real number" , the approximating polynomialf(x) can be found as
(3.2)F(x) f(x) ≤ a ≤ x ≤ b
a weighted sum ofPth degree orthogonal polynomial basis functions [47]
where
(3.3)f(x) ≈P
n 0
cnϕn(x)
for i = 0,1,2, . The approximation mean square error can be written as
(3.4)ϕn(x)n
i 0
ai xi
whereNs is total number of sampling points ofF(x) and the superscriptj denotes different
(3.5)ErrNs
j 1
u(x (j)) F(x (j)) f(x (j))2
sampling points.u(x) is the weighting function associated with different kinds of
orthogonal polynomials being used such that
As an example, the first five Legendre polynomials [48] are
(3.6)⌡⌠∞
∞u(x)ϕi(x)ϕj(x)dx δij
1, i j0, i ≠j
25
and the weighting function is
(3.7)
ϕ0(x) 1, ϕ1(x) x, ϕ2(x) 32
x 2 12
ϕ3(x) 52
x 3 32
x, ϕ4(x) 358
x 4 154
x 2 38
The mean square error can be minimized in the sense of lease square error
(3.8)u(x)
1, 1≤x≤1
0, otherwise
approach [49] and the coefficientscn are calculated as
As a consequence of the Weierstrass theorem, the set of polynomials ϕi(x) is complete
(3.9)cn
Ns
j 1
u(x (j))ϕn(x(j))F(x (j))
Ns
j 1
u(x (j))ϕ2n(x
(j))
in the sense that for any continuous functionF(x), Err tends to zero whenP → ∞.
As a second example, we use Laguerre polynomial [49] (the weighting function
of u(x) is e-x for 0 ≤ x < ∞) to approximate the functionF(x) = e-2x, for x ε [0,∞). The
resulting approximating function is
whereϕi(x)’s are the Laguerre polynomials
(3.10)F(x) ≈ f(x) 13
ϕ0(x) 29
ϕ1(x) 227
ϕ2(x) 4243
ϕ3(x)
26
(3.11)ϕ0(x) 1, ϕ1(x) 1 x,
ϕ2(x) x 2 4x 2, ϕ3(x) x 3 9x 2 18x 6
3.1.2 Approximating Functions of Many Variables
Extension of the single variable function to the multiple variables case is
straightforward. The Stone-Weierstrass theorem [46] says that continuous multivariate
functions can be approximated by weighted combinations of continuous univariate
functions. Suppose that we have a complete system of orthonormal functions of one
variable, ϕ0(x),ϕ1(x),ϕ2(x), , over a bounded intervala ≤ x ≤ b. Then a complete
system of orthonormal functions ofN variables, x1,x2, ,xN, may be constructed by
taking N-tuples (products) of functions from the one-variable set and substituting the
variables,x1,x2, ,xN, in the arguments. For instance, suppose that we want to construct
five Legendre orthogonal functions of three variables (N = 3). From the above discussion
we have
If the original functions are orthonormal in the intervala ≤ x ≤ b, the resultingN-variable
(3.12)
Ψ0(x1,x2,x3) ϕ0(x1)ϕ0(x2)ϕ0(x3) 1
Ψ1(x1,x2,x3) ϕ0(x1)ϕ0(x2)ϕ1(x3) x3
Ψ2(x1,x2,x3) ϕ0(x1)ϕ1(x2)ϕ0(x3) x2
Ψ3(x1,x2,x3) ϕ1(x1)ϕ0(x2)ϕ0(x3) x1
Ψ4(x1,x2,x3) ϕ0(x1)ϕ1(x2)ϕ1(x3) x2x3
27
functions, Ψ0(x),Ψ1(x), ,ΨN(x), are orthonormal over the hypercubea ≤ xi ≤ b, i =
1,2, ,N [47]. i.e.
where
(3.13)⌡⌠ b
x1 a⌡⌠ b
x2 a ⌡⌠ b
xN au(x)Ψ i(x)Ψ j(x)dx δij
1, i j0, i ≠j
After we set up the basis functions Ψ0(x),Ψ1(x), ,ΨN(x), the corresponding
(3.14)u(x) u(x1,x2, ,xN) Ψ i(x) Ψi(x1,x2, ,xN)
coefficientci for Ψi(x) can be found by using Eq. (3.9).
3.2 Realization of Multi-Input Products
In the previous section, we give a review ofN-dimensional approximating
polynomials and demonstrate how they can be constructed from sets of one-dimensional
orthonormal polynomials. Assume that such an approximating polynomial with dimension
N and degree-P has been found, and is represented as
where Af is a coefficient vector, with dimensionL (from Eq. (2.14)). Comparing Eq.
(3.15)f(x) ATf X
(2.15) with Eq. (3.15), the output weight vector can be found by inverting theC matrix.
That is
28
and
(3.16)WToCX AT
f X
if the C matrix has rankL and is made to be square. To prove that aC matrix of rankL
(3.17)WTo AT
f C1
exists, we show that a multi-input product (and therefore elements ofX) can be
constructed with anMLP network having one hidden layer. After showing Rank[C] = L,
making theC matrix square is then simply a matter of discarding linearly dependent rows.
In Appendix B, methods are given for designing theMLP networks which
approximate monomial functions and products of two inputs. In the following, we show
that a product ofk terms can be generated using an element which realizesxk and from
elements which realize the product of (k-1) terms. It can then be shown by induction that
products ofk variables can be generated in one hidden layer.
Let g(x1,x2, ,xN) = (x1+x2+ +xN)N, which is realizable using a monomial
subnet (Figure 3.1).
Lemma 3.1 : Let h(x1,x2, ,xN) denote the functiong(x1,x2, ,xN) with one or more
variables,xi, replaced by one. Then all terms ing are present inh,
except terms with the variablexi which are reduced in degree by at least
1.
For example,
Since thex22 term does not have anx1, which has been set to 1, it is present ing(x1,x2)
29
andg(1,x2). The next theorem shows that products ofN inputs can be constructed using
(3.18)g(x1,x2) x 2
1 x 22 2x1x2,
h(x1,x2) g(1,x2) 1 x 22 2x2
the monomial of degreeN and products of (N-1) terms.
Figure 3.1 Monomial Subnet with Multiple Inputs.
Theorem 3.2: Let hi(x1,x2, ,xN), i = 1 to N-1, represent the sum of all functions
g(x1,x2, ,xN) having i variables set to 1. Then the function
whereK is a non-zero constant.
(3.19)g(x1,x2, ,xN)
N 1
i 1
( 1)i hi(x1,x2, ,xN)
K x1x2 xN terms of degree N1 or less
Proof : Here we use the Lemma 3.1 repeatedly. First, every term in g having degreeN
andN-1 variables (one variable is squared and one other is absent), can be removed as
30
However, this operation subtracts all terms having degreeN in N-2 variablesN-2 times
(3.20)g(x1,x2, ,xN) h1(x1,x2, ,xN)
instead of the required 1 time. This is corrected by adding backh2(x1,x2, ,xN).
However, this adds back terms of degreeN with N-3 variables, too many times.
Continuing this process, we get the result through induction.
For example, Eq. (3.19) is written the 3-input case as
Thus, the 3-input product,x1x2x3, can be realized using the third degree monomial and 2-
(3.21)
(x1 x2 x3)3 (1 x1 x2)
3 (1 x2 x3)3 (1 x1 x3)
3
(1 1 x1)3 (1 1 x2)
3 (1 1 x3)3
6x1x2x3 6x1x2 6x1x3 6x2x3 6x1 6x2 6x3 21
input products subnets.
Corollary : The operation in Theorem 3.2 requires a number of hidden units equal to
Theorem 3.3: The product ofN bounded inputs can be realized in anMLP network
(3.22)N1(N) 1N 1
k 1
Nk
having one hidden layer of units having the activation of Eq. (3.1). The
required number of hidden units is
whereN2(k,N) = (k+N-1)!/(k!(N-1)!).
(3.23)Nu(N) N1(N)N 1
k 2
N2(k,N) Nu(k)
31
Proof : There areN2(k,N) terms of degreek which can be constructed using a set ofN
inputs.Nu(N) then equalsN1(N) plus Nu(k) units for each possible term of degreek for k
= 2 to N-1.
3.3 Completion of the Proof
From Theorem 3.3, each 2nd or higher degree term inX can be closely
approximated by a product subnet composed of several monomial subnets. The function
f(x) in Eq. (3.15) is then approximated by theMLP network, by taking the weighted sum
of all the output of the subnets as shown in Figure 3.2. Therefore the proof is complete.
Figure 3.2 Constructionf(x) by Subnet Approaches.
CHAPTER 4
FORWARD MAPPINGS
In the previous chapter, we have proved the approximation theorem (Theorem 3.1),
which states that it is possible to mapN-input degree-P polynomials to theMLP network.
In this chapter we discuss practical methods for performing such mappings. There are
several advantages associated with such forward mappings. First, the mappings provide
good initial weights for theMLP network. TheBP learning algorithm can then be used
to improve upon this initial solution. Second, the mapping approach leads to specific
network topologies. A block diagram of the forward mapping methodology is shown in
Figure 4.1.
Assume that we want to map a function ofN variables to anMLP network.
Following Figure 4.1, the first step is to obtain anN-variable polynomial expansion of the
function, using orthonormal polynomials. Using the subnet approach of Appendix B or
the mapping theorems which are developed later, one or more terms of the polynomial
expansion is realized as a subnet. Redundant units, and the corresponding linearly
dependantPBFs, are removed. The final network may then be improved throughBP
learning. In this chapter, two approaches for performing this forward mapping are
discussed.
4.1 Complete Networks
32
33
In this section, our goal is to describe a simple approach for mapping a given
Figure 4.1 Block Diagram of Forward Mapping.
polynomial function to acomplete network. Thecomplete networkis defined as follows.
Definition : A complete networkof degreeP andN-input is anMLP network which has
L hidden units, with no redundant units (the rank of the output coefficient
matrix C is L).
In principal,complete networkswith one hidden layer can be designed by following the
procedure in the proof of Theorem 3.1. This involves the construction of one product
subnet for each term of second or higher degree inX. The redundant, linear dependent,
hidden units are removed so thatNh = L-N-1.
Theorem 4.1: A complete networkwith UB = (L-N-1) hidden units is capable of
approximating anyN-input, degree-P polynomialf(x) = AfTX, whereAf is
a coefficient vector with dimensionL as in Eq. (2.14). HereUB denotes
34
upper bound.
Proof : The constructive proof in Chapter 3 shows that each term ofX can be closely
represented by one subnet. However, theC matrix formed by the expansion of all the
subnets has exactly rankL. This implies that some rows of theC matrix are linearly
dependent with others and can be discarded until there areL independent vectors left. In
this case, each term is completely described by onePBF (or one hidden unit) and the
function f(x) is a unique linear combination of a set of linearly independentPBFs. Of
theseL rows in theC matrix, N of them model the input units, which contribute to the
final output through direct connections. One row models the effects of thresholds. This
leaves only (L-N-1) units to represent the approximating functionf(x).
Assume that we are to implement a quadratic function ofN variables or features,
as is used in the Bayes Gaussian classifier [6]-[7]. Each second order product of theN
features must be realized. These includeN squared terms andN (N-1)/2 cross products.
From Appendix B, squares can be realized by 1-1-1 subnets and the cross products can
be realized as 2-3-1 subnets. This requiresNh = N + 3 N (N-1)/2 units. Since each
product subnet generates redundant monomial subnets, the redundant monomial subnets
can be removed. This results inNh = N (N+1)/2. Note that the number of hidden units
equals the number of second degree terms in the quadratic polynomial ofN variables. An
example of mapping the quadratic polynomial of 4 variables, before and after removing
redundant units, is shown in Figure 4.2 and Figure 4.3.
It is possible to extend this process of removing redundant units to higher degree
35
polynomials. Another example shown here is to realize the function
Figure 4.2 Subnet Approach for Mapping a 4-Input Second Degree Polynomial.
Figure 4.3 The 4-Input Second DegreeComplete Network.
(4.1)f(x) ax31 bx2
1 x2 cx1x2
2 dx31 ex2
1 gx1x2 hx22
36
as a single hidden layerMLP network, using the developed theorem. Table 4.1 lists the
required hidden units for each term, from Theorem 3.3. Totally, there are 41 hidden units
required to mapf(x) by this subnet approach.
Since the product subnet forx1x2 generates redundant monomial subnets (x12 andx2
2), the
Table 4.1Subnet Approach for Realizing a Function withN = 2 andP = 3.
Terms Required Units Terms Required Units
x13 2 x1x2 3
x12x2 16 x2
2 1
x1x22 16 x1 0
x23 2 x2 0
x12 1 constant 0
monomial subnets forx12 andx2
2 can be removed. In addition, product terms likex12x2 and
x1x22 can be realized from the existing monomial subnets (from product subnet) and direct
connections from inputs. Finally, 7 hidden units are necessary for mapping the function
f(x). The number of hidden units equals the number of second degree terms (3 terms) and
third degree terms (4 terms) in Eq. (4.1).
Lemma 4.1 : A complete networkof degreeP andN-input is capable of realizing any
number of additional polynomials (with the sameN and P), without an
increase in the number of hidden units.
Proof : From the proof of Theorem 4.1,complete networksalready have one hidden unit
for each term inX (second or higher degree) of the function. Therefore, new functions
37
are easily realized by connecting the existing (L-N-1) hidden units,N inputs, and a bias
term directly to the new output node.
In summary, the design method forcomplete networks, having one hidden layer,
are summarized in the following steps :
Step1. Given the polynomial function to be mapped, calculateUB, the required
number of hidden units.
Step2. Initialize input weights for each hidden unit (Appendix B).
Step3. Eliminate redundant units.
Step4. Calculate output weights using Eq. (3.17).
The final complete networkis far more efficient than the original network
composed of subnets, because it requires one hidden unit for each of theL terms inX.
However, there are two drawbacks to thecomplete network. First, a network must be
designed with far more than (L-N-1) hidden units, and then pruned of its linearly
dependent units. Second, some units are generating very high degree products, which
result in large weight values. We propose a multi-layercomplete networkwhich requires
the same number of hidden units.
4.1.1 Multi-Layer Complete Networks
In the design of the multi-layercomplete network, we assume that each subnet is
used to implement a squaringx2 (1 hidden unit required) or productxixj (3 hidden units
38
required) operation. The layers are numbered starting withn = 1 for the input layer. For
the second and higher layers (n ≥ 2), the hidden units generate terms of degreek, where
k falls between (1 + 2n-2) and 2n-1. Define as a ceiling function, 1.1 = 2, or 2.9
= 3. We state the following lemmas to realize a polynomial function using multi-layer
complete network.
Lemma 4.2 : Given a functionf(x) with maximum degreeP, it can be realized in the
multi-layer complete networkwith log2P hidden layers.
As an example, the multi-layercomplete networkwhich realizes the function
having two hidden layers is shown in Figure 4.4. TheUB number of hidden units for an
(4.2)f(x) A0 (x1 x2) (x1 x2)2 (x1 x2)
3 (x1 x2)4
N-dimensional degree-P function is the same for both single-layer and multi-layer
complete networks. However, if f(x) has some missing terms, and therefore a sparseAf
vector, multi-hidden-layer topologies are sometimes more efficient. As a more extreme
example, assume we want to design a product term forN inputs, the multi-layercomplete
network is more efficient than the single-layercomplete network. Several lemmas are
developed next to demonstrate the required hidden layers and units for product terms.
Lemma 4.3 : An N-variable product (all variables have first degree only) can be realized
by log2N hidden layers.
Lemma 4.4 : The number of hidden units in each hidden layer in Lemma 4.3 is
The example for a 7-input product term,x1 x2 x7, is constructed as shown in
39
Figure 4.5. As a result of Lemma 4.3 and 4.4, comparisons for realizing a single product
Figure 4.4 The Multi-LayerComplete Networkfor Realizing a Function withN = 2 andP = 4.
(4.3)Nh(j) 3 N
2j0.5 j 1,2, , log2N
term in using single and multi-layer network is listed in Table 4.2. Table 4.2 reveals that
the multi-layer network does have an advantage in this case.
Lemma 4.5 : An N-variable product withm variables, having degrees (n1,n2, ,nm), can
be realized by 1 + log2K hidden layers, where
Lemma 4.6 : The number of hidden units in each hidden layer in Lemma 4.5 is
whereK is same as Eq. (4.4).
40
(4.4)K
m N m2
, when N m is even
m N m 12
, when N m is odd
Figure 4.5 The Multi-LayerComplete Networkfor Realizing Productx1 x2 x7.
(4.5)Nh(j)
m
i 1
(ni 1) 3 N m2
0.5 , when j 1
3 K
2j 10.5 , when j 2,3, , log2K
4.1.2 Experimental Results
Our algorithms for designing the single-layercomplete networkhave been tested
41
for two examples. As a first step, Gaussian discriminants were designed from the shape
Table 4.2Comparisons of Single-Layer and Multi-LayerComplete Networks.
Number of InputsN Single-Layer Multi-Layer
3 16 Units 6 Units/2 Layers
4 65 Units 9 Units/2 Layers
5 246 Units 12 Units/3 Layers
6 917 Units 15 Units/3 Layers
7 3,424 Units 18 Units/3 Layers
8 12,861 Units 21 Units/3 Layers
feature data sets. The resulting polynomial discriminant functions were then mapped into
complete networks. For CHEF andLPTF type shape features, the resulting networks had
16 inputs andUB = 136 hidden units. In Table 4.3, the classification error percentages
for both the Gaussian classifier and thecomplete network(after mapping) are listed. The
performances are very similar as one would expect. After the mapping is completed, the
complete networkswere trained using theBP algorithm. For comparison networks having
the same topology, but initialized with random weights and then trained, were also tested
for same data sets. From Figure 4.6 and Figure 4.7,complete networkswith mapped
weights outperform the same network with random initial weights.
4.1.3 Summary
Several advantages associated with thecomplete networkare summarized here.
42
First, the mappings provide good initial weights for theMLP network. TheBP algorithm
Table 4.3Classification Results of Gaussian Classifier andComplete Network.
Shape Features Gaussian Classifier Complete Network
CHEF 3.00 % 3.125 %
LPTF 1.875 % 1.75 %
Figure 4.6 Training Results of aComplete Network(16-136-4) withCHEF Data.
can then be used to improve upon this set of initial weights. Second, the mapping
approach leads to specific network topologies. Third, an upper bound on the required
number of hidden units can be derived.
It is desirable for the number of hidden units to be kept as low as possible. One
problem in the design of thecomplete networkis that the number of hidden units
43
increases explosively as the number of inputs and the degreeP increases. For example,
Figure 4.7 Training Results of aComplete Network(16-136-4) withLPTF Data.
if the given function has 23-input andP = 7, the total number of hidden units required
for the mapping is 2,035,776. Largecomplete networksof high degree are certainly not
feasible. A similar but more efficient mapping algorithm is presented in the next section.
4.2 Compact Networks
In complete networks, each 2nd or higher degree term of the vectorX for the
approximating polynomial is realizable by one hidden unit (or onePBF). However, the
number of free parameters (weights and thresholds) in the resulting network is far greater
than the number of hidden units. In this section, we describe methods for designing
"compact networks", in which the number of weights and the dimension ofX are more
44
in line. In the following sections, output weights are defined as the weights connecting
from input and hidden units to output units. Hidden weights are weights from input units
to hidden units or weights between hidden layers. First, let’s give a definition of the
compact network.
Definition : A compact network(Rank[C] < L) is anMLP network, where each hidden
unit realizes many of the terms in Eq. (3.15) and results in less hidden units
than thecomplete network.
Before we discuss practical methods for constructingcompact networks, we find a lower
bound number of hidden units required.
4.2.1 A Lower Bound on the Number of Weights
A theorem which specifies a lower bound on the number of hidden weights and
thresholds is stated in the following.
Theorem 4.2: An MLP network capable of realizing a continuous functionf(x) with N-
dimensional, degree-P must have at leastNt free parameters (hidden
weights and thresholds of hidden units) such thatNt ≥ L.
Proof : Assuming anMLP network withN inputs,Nh hidden units andNw hidden weights,
there areNt free parameters needed to be determined andNt = Nw + Nh, whereNh is the
total number of threshold values forNh hidden units. From Eq. (2.15), the output of the
MLP network is expressed as a matrix form,WoTCX, which approximates the function
f(x). Our purpose here is to approximate theC matrix by a first degree function of hidden
45
weights and thresholds. i.e.
C0 and C1 areNu x L matrices with elementsc0( ) and c1( ). c0(i,j) is a scaler andc1(i,j)
(4.6)C C0 C1
is taken to be a first degree function of hidden weights and thresholds. That is
δw is anNt by 1 vector andδw = Wx - Wx0. HereWx (with elementwx( )) is the vector for
(4.7)c1(i,j) h(i,j)δw
Nt
k 1
h(i,j,k)δw(k)
hidden weights and thresholds andWx0 is the corresponding expanding points for Taylor
series.h(i,j,k) is the first degree Taylor series coefficient for the vectorh(i,j). Substituting
C with C into Eq. (2.15), Eq. (2.15) can be rewritten as
WoTC1 is an 1 xL vector where thenth element is expressed as weighted summation of
(4.8)W ToC1 AT
f WToC0
δw( )
ThenWoTC1 is identical toδw multiples a matrixR andR is anNt x L matrix with element
(4.9)
Nu
v 1
wo(v)c1(v,n)Nt
k 1
Nu
v 1
wo(v)h(v,n,k) δw(k)
Nt
k 1
r(k,n)δw(k)
r(k,n). This results in
From Eq. (4.10), there are two cases to solve a set of nontrivial solutions forδw( ). In the
46
first case, if L = Nt (# of terms inX = # of hidden weights + # of thresholds) and
(4.10)δ TwR AT
f WToC0
Rank[R] = L, δw( ) can be solved by inverting theR matrix. Second, ifNt > L and
Rank[R] = L, then there are infinity solutions forδw( ). For both cases, the condition for
nontrivial solutions ofδw( ) is Nt ≥ L.
As long as δw( )’s are found, wx( )’s can be determined by subtracting the
correspondingwx0( )’s. Although the initial solutions forWx are only approximation
answers, they can be improved further by iterative methods. As a special case of Theorem
4.2, the lower bound number of hidden units for a fully-connectedMLP network to map
a functionf(x) is discussed in the following corollary.
Corollary : The lower bound of the number of hidden units for mapping a continuous
function f(x) with N-input, degree-P into a fully-connectedMLP network is
where is a ceiling function.
(4.11)LB UB N 1N 1
Proof : For a fully-connectedMLP network withNh hidden units, there areNt = N Nh +
Nh free parameters. According to Theorem 4.2,Nt must be greater than or equal toL.
Thus,LB for Nh is
47
(4.12)LB Nh
Nt
N 1≥ UB N 1
N 1
Based on Theorem 4.2 and its corollary, we concentrate on the application of the
single hidden layerMLP network with fully-connected weights. In this case, the free
parameters like hidden weights is analogous to input weights (from input units to hidden
units). In addition, the thresholds for the hidden units are not taken into account.
4.2.2 Construction ofCompact Networkswith Monomial Activation
The goals here are to present practical methods to realizecompact networksusing
the monomial activation. In the latter part of the chapter, we consider how to use the
sigmoid activation to designcompact networks. The monomial activation is
wherek is an integer greater than or equal to 2. The first design step is to dividef(x) up
(4.13)G(Xnet(i)) X knet(i)
into blocks of equal-degree terms, as in Figure 4.8. That is
where fk(x) has all terms of degreek. The kth degree block is expressed as the sum of
(4.14)f(x) A0 f1(x) f2(x) f3(x) f P(x)
products for the inputs
48
and
(4.15)fk(x)N
i1 1
N
i2 i1
N
iN iN 1
ak(i1,i2, ,iN)xqk1(i1)
1 xqk2(i2)
2 xqkN(iN)
N
wherek ≥ qkm( ) ≥ 0 for m = 1,2, ,N.
(4.16)qk1(i1) qk2(i2) qkN(iN) k
Note that in Eq. (4.14),A0 can be a bias term in the output unit andf1(x) (first degree
Figure 4.8 Compact Mapping with Block Approach.
function) can be easily realized by connecting the input and assigning coefficients of the
first degree terms to the weights. Thus, there needs no unit for mapping functionf1(x).
We will focus on finding theLB number of hidden units,Nh(k), required for realizing
49
eachfk(x) function fork ≥ 2. In the following, two approaches are discussed for compact
mapping.
4.2.2.1 Compact Mapping with Block Approach
The block approach is to realize all terms infk(x) functions (k ≥ 2) at the same
time. We utilize a monomial activation (Eq. (4.13)), then the network output for thekth
degree block is
wx(m,i) is the weight betweenith input unit to mth hidden unit. Carrying out the
(4.17)
fk(x) ≈Nh(k)
m 1
φ(m)wo(m)
Nh(k)
m 1
N
i 1
wx(m,i)xi
k
wo(m)
multiplication, Eq. (4.17) can be rewritten as
and
(4.18)fk(x) ≈N
i1 1
N
i2 i1
N
iN iN 1
bk(i1,i2, ,iN)xqk1(i1)
1 xqk2(i2)
2 xqkN(iN)
N
Initializing Wo with small random numbers, our purpose is to find a set of input
(4.19)bk(i1,i2, ,iN) k!
qk1(i1)!qk2(i2)! qkN(iN)!
Nh(k)
m 1
wqk1(i1)
x (m,1)wqk2(i2)
x (m,2) wqkN(iN)
x (m,N) wo(m)
weights such that the mean square error betweenak( ) and bk( ) (Eq. (4.15) and Eq.
50
(4.18)) is minimized. It involved in solving a set of nonlinear equations. Whenever the
gradient is available, a generalization of the conjugate-gradient method can be applied to
minimize the nonlinear functions. Define the mean square error,Ek(Wx), as
Then substitutebk( ) into Eq. (4.20) and take the derivative ofEk(Wx) with respect to
(4.20)Ek(Wx)N
i1 1
N
i2 i1
N
iN iN 1
ak(i1,i2, ,iN) bk(i1,i2, ,iN) 2
each input weight, the gradients,gmj, are found as
for 1 ≤ m ≤ Nh(k), 1 ≤ j ≤ N and
(4.21)
gmj
∂Ek(Wx)
∂wx(m,j)
2N
i1 1
N
i2 i1
N
iN iN 1
ak(i1,i2, ,iN) bk(i1,i2, ,iN)∂bk(i1,i2, ,iN)
∂wx(m,j)
2Q ak(i1,i2, ,iN) bk(i1,i2, ,iN) wqk1(i1)
x (m,1) qkj(i j)wqkj(i j) 1
x (m,j) wo(m)
The basic conjugate-gradient iteration of Fletcher and Reeves [50]-[51] has the
(4.22)Q k!qk1(i1)!qk2(i2)! qkN(iN)!
form
where superscripts denote iteration number andz is chosen to minimizedEk(Wx), d n-1 is
(4.23)W nx ← W n 1
x zd n 1
the direction vector in the (n-1)th iteration
51
and
(4.24)dn gn 1 µndn 1
A flowchart of this method is shown in Figure 4.9.
(4.25)µn 1 (gn)T gn
(gn 1)T gn 1
Rather than using the arbitrarilyz, we then try to find the zeros for the derivative of
Figure 4.9 Flowchart of Iterative Conjugate-Gradient Method.
Ek(Wx - zd),
and eachci is the function of the coefficientbk( ), input and output weights. It is
important to note that the conjugate-gradient algorithm must be restarted periodically in
52
order to guarantee superlinear convergence [52]. The usual recommendation [53] is to
(4.26)∂Ek(Wx zd)
∂z
k 2
i 0
ci zi 0
restart after everyNh(k) N iterations. Thus, we set µn = 0 whenevern is divisible by
Nh(k) N. Observe that theBP algorithm using steepest descent has the same form as Eq.
(4.23), except that forBP algorithm we takez to be constant which defines the strength
term. There is convincing theoretical and empirical evidences that conjugate gradients
should converge faster than steepest descent.
The block approach has the advantage of fast design since the coefficients in each
degree term are realized simultaneously. However, problems arise whenever the
distribution of the input coefficients has large deviations. In this case, finding the global
minimum of Ek(Wx) generally requires too many iterations.
4.2.2.2 Compact Mapping with Group Approach
Whenever the global minimum ofEk(Wx) is difficult to find or theN or P for the
given function is large, we recommend another alternative. Instead of realizing all the
terms in each degree block, we divide all terms in each degree block intoN groups
(Group approach) and realize allxj product terms sequentially forj starting from 1 toN.
A general concept can be seen from Figure 4.10. As an example, if a function has 3
inputs (N = 3) and has only third degree terms, then there are 10 terms after the
expansion. These 10 terms can be divided into 3 groups with terms shown in the
53
following
Figure 4.10Compact Mapping with Group Approach.
The developed theorems and corollary in the previous section can be used as well as for
(4.27)Group 1 :
x 31 , x1x2x3
x 21 x2, x1x
22
x 21 x3, x1x
23
Group 2 :
x 32
x 22 x3, x2x
23
Group 3 : x 33
the group approach. Assume that the connections in each group are fully connected. Then
in group j (realize allxj product terms), there areNj = N-j+1 inputs. In each group, the
error function in Eq (4.20) is simplified as
54
Then the iterative method discussed in the previous section for finding a set of input
(4.28)E (j)k (Wx)
N j 1
m j
ak( ,im, ) bk( ,im, ) 2
weights can be applied again to the group approach.
To calculate the minimum number of hidden units required for all degreek terms,
we need to decide the number of hidden units in each group first. From Eq. (2.14), the
number of product terms,Npj, in group j (with degreek) is
Thus, the minimum number of hidden units in the groupj is Npj/Nj . Finally, the total
(4.29)Npj
(k Nj 1)!
k!(Nj 1)!
(k Nj 2)!
k!(Nj 2)!
number of required hidden units for akth degree block is
Several demonstrations for comparison of these two approaches can be shown later. From
(4.30)Nh(k)N 1
j 1
Npj
Nj
1
the experimental results, the group approach does show the advantages for realizing the
desired coefficients to the network, especially for difficult data.
4.2.3 Conversion of Monomial Activation to Analytic Activation
Since the monomial activation is not bounded, the output is easily saturated when
the BP algorithm is used for training. Therefore, it is natural to find a method for
55
converting compact monomial networks to compact analytic activation networks. The
conversion process begins with the units having the highest degree monomial activation
(degreeP). If there arenP monomial units withPth degree monomial activation, each of
them can be replaced by a sigmoid unit (Appendix B) with the highest degreep(i) = P.
Figure 4.11 shows the basic ideas.
This first substitution generates thePth degree terms accurately but generates extra terms
Figure 4.11Conversion of akth Degree Monomial Activation to Sigmoid Activation.
with degrees (P-1), (P-2), . However, these unwanted terms can be subtracted from
the rest of the polynomial outputs. In this case, the coefficients for terms with degrees (P-
1), (P-2), of the original output are recalculated. Then, the conversion process
continues for (P-1)th degree monomial activation until the monomial activation with
degree 2 is reached. From the above discussions, the conversion procedure is simply
multiplying the scaling factor and adding offset to the original set of weights and does
56
not change the number of hidden units required.
Here we show an example for the conversion of a network with second degree
monomial activations by sigmoid activations. After the replacement, the number of
sigmoidal units is the same as the original number of squaring units. As shown in
Figure 4.12, theSi’s are chosen such that the output of the sigmoidal squaring subnet is
equivalent to squaring unit. Consider theith unit which fed themth output unit, if there
areN inputs feeding theith unit the weights and thresholds are compensated according
to the following rules :
The result of the second degree case can be extended to higher degree function without
(4.31)
θ(m) ← θ(m) wo(i)(S4 S5θ(i))
θ(i) ← S1θ(i) S2
wx(m,j) ← wx(m,j) S5wx(i,j)wo(i) , j 1,2, ,N
wx(i,j) ← S1wx(i,j) , j 1,2, ,N
wo(i) ← S3wo(i)
difficulty.
4.2.4 Sparse Second DegreeCompact Network
Second degree polynomial functions have found widely use in pattern analysis and
signal processing. The Bayes-Gaussian discriminant function [6]-[7] is popular because
of its simple form and ease of design using training data. In this subsection, a special
57
treatment for second degreecompact networkis discussed. Rather than using the
Figure 4.12Replace Monomial Activation (2nd Degree) by Sigmoid Activation.
conjugate-gradient method to find a set of input weights for mapping a given second
degree function, we derive close forms for mapping the giving coefficients to the network
weights. We utilize group approach for allxj product terms in the second degree block.
For example, consider a second degree functionf(x1,x2,x3),
it requires at least six free parameters (weights) for mapping these six terms. A special
(5.1)f(x1,x2,x3) a1x2
1 a2x2
2 a3x2
3 a4x1x2 a5x1x3 a6x2x3
kind of sparse connections of thecompact networkis shown in Figure 4.13 for this
example. The desired coefficients can be accomplished by solving a set of equations step
58
by step. From Figure 4.13, the outputs of each unit are
Take the summation at the output unit and compare the resulting output withf(x), 6
(5.2)
wo(1) x1wx(1,1) x2wx(1,2) x3wx(1,3) 2
wo(2) x2wx(2,2) x3wx(2,3) 2
wo(3) x3wx(3,3) 2
equations can be generated in the following computational sequential order,
As long asa1, a2 anda3 are not all zero, the input weights associated with each unit can
(5.3)
wo(1)w 2x (1,1) a1
2wo(1)wx(1,1)wx(1,2) a4
2wo(1)wx(1,1)wx(1,3) a5
wo(1)w 2x (1,2) wo(2)w 2
x (2,2) a2
2wo(1)wx(1,2)wx(1,3) 2wo(2)wx(2,2)wx(2,3) a6
wo(1)w 2x (1,3) wo(2)w 2
x (2,3) wo(3)w 2x (3,3) a3
be solved sequentially by
59
Note thatwo(1), wo(2) andwo(3) are chosen, in the specified order, to make square root
(5.4)wx(1,1)
a1
wo(1)wx(1,2)
a4
2wo(1)wx(1,1)wx(1,3)
a5
2wo(1)wx(1,1)
(5.5)wx(2,2)a2 wo(1)w 2
x (1,2)
wo(1)wx(2,3)
a6 2wo(1)wx(1,2)wx(1,3)
2wo(2)wx(2,2)
(5.6)wx(3,3)a3 wo(1)w 2
x (1,3) wo(2)w 2x (2,3)
wo(3)
arguments in Eqs. (5.4) to (5.6) positive. These results are easily extended to quadratic
functions withN inputs.
Figure 4.13Efficient Compact Mapping of Second Degree Function.
60
4.2.5 Experimental Results
Our algorithms for designing thecompact networkhave been tested for several
second and higher degree functions.
First, example to compare the classification performances of second degree
compact networkand Gaussian classifier is demonstrated. The training data for both
approaches consists ofCHEF, FMDF, RWEF, LPTF and RDF, each of them has 16
inputs and 4 output classes (ellipse, triangle, quadrilateral, and pentagon) and each class
has 200 patterns (see Appendix C for details). The classification error percentages for
both the Gaussian classifier and sparse-connectioncompact networkis listed in Table 4.4.
In Table 4.4, the misclassification percentages showed for the Gaussian classifier
Table 4.4 Classification Error Percentages for Gaussian Classifier and SecondDegreeCompact Network.
Feature Data Gaussian Classifier Sparse-ConnectionCompact Network
CHEF 3.00 % 3.25 %
FMDF 0.375 % 0.375 %
RWEF 2.875 % 2.75 %
LPTF 1.875 % 1.875 %
RDF 0.25 % 0.25 %
andcompact networkare almost the same. The sparse-connection second degreecompact
network provides us another method to design the Gaussian classifier. To applyBP
algorithm, the replacement of the monomial units with sigmoid unit is taken. Finally, the
61
equivalent network is compared with the network with random weights. In Figure 4.14,
the training result forRWEFdata is demonstrated after 20 iterations. It is obvious that the
network with mapping weights outperforms the same network with random weights.
The second example is tried to map a set of functions to the network using group
Figure 4.14Training Results of aCompact Networkwith RWEFData.
and block approaches, respectively. The first one has a 7-input but with second degree.
The second one is a function with a 5-input, third degree only and the third one is a
function with a 3-input, fourth degree only. The desired coefficients for each case have
the Gaussian distribution with different means and standard deviations as shown in
Table 4.5.
The iterative conjugate-gradient method is applied to find a set of input weights
such that the relative mean square error between the network output coefficients and the
62
desired coefficients is below 1 x 10-5. The training results using conjugate-gradient
Table 4.5Mean and STD Deviation for the Coefficients in Different Functions.
Mean Standard Deviation
7-Input, 2nd Degree 10 5
5-Input, 3rd Degree 5 4
3-Input, 4th Degree 2 1.5
method for the case of a 2nd degree function are displayed from Figure 4.15 to
Figure 4.21. Since there are so many for 2nd and 3rd degree terms (28 and 35), we only
show the mapping results for 4th degree terms in Table 4.6.
Figure 4.15Group Approach for Realizing Allx1 Terms (Second Degree Case).
63
Table 4.6 Results for Mapping a 4th Degree Function toCompact Networks, UsingGroup and Block Approaches.
4th-Degree Terms Desired Coeff. Mapping Results
Group Block
x14 1.784729 1.784731 1.784729
x13x2 2.064153 2.064143 2.064159
x13x3 -3.872893 -3.872891 -3.872889
x12x2
2 1.847609 1.847592 1.847579
x12x2x3 1.177845 1.177817 1.177823
x12x3
2 -0.817769 -0.817781 -0.817769
x1x23 2.289803 2.289832 2.289833
x1x22x3 0.6076442 0.6076502 0.6076581
x1x2x32 1.089452 1.089462 1.089451
x1x33 3.560644 3.560627 3.560652
x24 2.269974 2.270015 2.269949
x23x3 -2.443908 -2.443928 -2.443920
x22x3
2 3.450646 3.450626 3.450643
x2x33 0.5177643 0.5177612 0.5177647
x34 2.423806 2.423808 2.423806
64
Figure 4.16Group Approach for Realizing Allx2 Terms (Second Degree Case).
Figure 4.17Group Approach for Realizing Allx3 Terms (Second Degree Case).
65
Figure 4.18Group Approach for Realizing Allx4 Terms (Second Degree Case).
Figure 4.19Group Approach for Realizing Allx5 Terms (Second Degree Case).
66
We also estimated the theoreticalUB and LB values and compared them to the
Figure 4.20Group Approach for Realizing Allx6 Terms (Second Degree Case).
Figure 4.21Block Approach for Realizing All Terms of a Second Degree Function.
67
experimental results. As shown in Table 4.7, the required number of hidden units is
always bounded betweenLB andUB.
The last example is to map a 2-dimensional non-integer degree function to the
Table 4.7Compare the Required Hidden Units with the Theoretical Results.
LB Group Approach Block Approach UB
7-Input, 2nd Degree 4 7 7 28
5-Input, 3rd Degree 7 16 8 35
3-Input, 4th Degree 5 11 15 15
network, using block approach. According to the multidimensional Lagrange interpolation
formula [18], a 2-dimensional function can be expressed as a polynomial with integer
degrees
for x1 ε [1,2] andx2 ε [1,2]. The approximating function is then realized in thecompact
(5.7)
f(x) 1x1x2
≈ 4.694 3.25x1 3.25x2 ⇐ Const. and 1st degee block
0.722x 21 2.25x1x2 0.722x 2
2 ⇐ 2nd degree block
0.5x 21 x2 0.5x1x
22 ⇐ 3rd degree block
0.111x 21 x 2
2 ⇐ 4th degree block
network by dividing the function into 4 blocks. There are 2 units for the 2nd degree
block, 3 units for the 3rd degree block and 3 units for the 4th degree block. Totally, 8
68
hidden units (with monomial activation) approximate the function well. Note that this
number is greater thanLB = 6 and less thanUB = 12. In Table 4.8, we show the desired
coefficients and the mapping results for comparison.
From the examples we have demonstrated, it proves our algorithms successfully
realize given functions to thecompact networksand find the required number of units for
performing such mapping.
4.2.6 Summary
In this chapter, we develop two kinds ofMLP networks based on thePBF model
by observing the rank of theC matrix. If the MLP network has theC matrix with rank
L and with no redundant units, it results in acomplete networkbecause each term of the
desired function is completely represented by onePBF (or one hidden unit). Once the
complete networkis constructed, any additional function can be realized in the same
network without adding any extra hidden units. If theC matrix has rank less thanL, the
network was called acompact networkbecause eachPBF represents several terms of the
desired function and has a more compact network topology.
CHAPTER 5
INVERSE MAPPINGS
In this chapter, a straightforward method is described for generating a polynomial
Table 4.8Using Block Approach to Approximate Function 1/(x1x2).
2nd-Degree Terms Desired Coeff. Mapping Results
x12 0.7222 0.7222
x1x2 2.25 2.249
x22 0.7222 0.7222
3rd-Degree Terms Desired Coeff. Mapping Results
x13 0.0 -1.729 x 10-6
x12x2 0.5 0.499
x1x22 -0.5 -0.500
x23 0.0 -1.490 x 10-8
4th-Degree Terms Desired Coeff. Mapping Results
x14 0.0 2.691 x 10-4
x13x2 0.0 8.211 x 10-5
x12x2
2 0.111 0.111
x1x23 0.0 -6.373 x 10-5
x23 0.0 2.980 x 10-8
basis function model of a trainedMLP neural network. First, we model each hidden unit
activation as a polynomial function of the net input. Then a method is developed for
finding the PBF for each hidden unit. The output polynomial discriminant can be
calculated from thePBFs. Preliminary methods for eliminating linearly dependentPBFs
are given. The processing sequence is illustrated in Figure 5.1.
69
70
Figure 5.1 Block Diagram of Inverse Mappings.
5.1 Polynomial Network Models
In this section we develop two differentPBF models of theMLP network. Given
the network weights, topology, and the training data, the first step is to model each hidden
unit activation as a polynomial function of its net input. As before,Xnet(k)(i) and φ(k)(i)
denote the net input and activation output respectively for thekth input training pattern
of the ith unit. From the analyticity of the activation, we can model each hidden unit’s
output as a power series with finite integer degreep(i) in the variableXnet(k)(i). This leads
to
and the network output is
In Eq. (5.1),X0(i) is defined as the total net input of theith unit divided by the total
number of training vectorsNv. D(i,n)’s are coefficients after polynomial expansion. This
71
process continues until all hidden unit’s output have the same form as Eq. (5.1). This is
(5.1)
φ(k)(i) ≈p(i)
n 0
A(i,n) X (k)net(i) X0(i)
n
p(i)
n 0
D(i,n) X (k)net(i)
n
(5.2)f(x)Nu
i 1
wo(i)φk(i)
referred to as the condensed network model.
In the condensed model of Eq. (5.1) eachPBF is a composition of polynomials.
The exhaustivePBF model is obtained by multiplying out these compositions for each
hidden unit. Then, eachPBF can be expressed as an inner product of a coefficient vector
with X, i.e.
C(i) is directly related to the function of network weights and thresholds of hidden units.
(5.3)φ(k)(i) ≈ C(i)X
Although the exhaustive network model is not easily to obtained, especially when the unit
degreep(i) or number of network inputs,N, is large, it provides a compact form to derive
the polynomial approximations of the network. Application of using both models will be
discussed in details in the following section. First, we begin to calculate the coefficients
D(i,n) in Eq. (5.1) for the condensed network model.
5.2 Calculation of the Condensed Network Model
72
There are several methods for determining the power series coefficientsD(i,n) in
Eq. (5.1). We have chosen to use a method which is optimal in a least-mean-squares
sense. The method is described in the following. Given the desired degreep(i) of the
polynomial approximating the activation function, we minimize the mean square error
function
with respect to theD(i,n) coefficients over the training set,Nv. Thus we set up a matrix
(5.4)E(p(i))
Nv
k 1
φ(k)(i)p(i)
n 0
D(i,n) X (k)net(i)
n
2
form to solveD(i,n)’s as
It can be shown [54] that the resulting (p(i)+1) x (p(i)+1) square matrix is nonsingular as
(5.5)
k
φ(k)
k
φ(k)X (k)net
k
φ(k) X (k)net
p(i)
k
1k
X (k)net
k
X (k)net
p(i)
k
X (k)net
k
X (k)net
2
k
X (k)net
p(i) 1
k
X (k)net
p(i)
k
X (k)net
p(i) 1
k
X (k)net
2p(i)
D(i,0)
D(i,1)
D(i,p(i))
long asXnet(k)(i)’s are distinct.
It is also desirable to find an algorithm for choosing the polynomial degree of each
hidden unit’s activation function automatically. We propose to measure the relative mean
square error, (R(p(i))), over all training vectors and use it to calculate the unit degreep(i).
73
The relativeMSE is defined as theE(p(i)) divides the sample variance of theith hidden
unit’s output, measured over all training patterns. i.e.
where <φ(i)> denotes the average output activation of theith unit. For each hidden unit,
(5.6)R(p(i))
Nv
k 1
φ(k)(i)p(i)
n 0
D(i,n) X (k)net(i)
n
2
Nv
k 1
φ(k)(i) <φ(i)>2
the correspondingR(p(i)) is a measure of the approximation accuracy. For user chosen
thresholdT, which represents a desired or maximum acceptable value ofR(p(i)), the
degree p(i) is increased until the condition,R(p(i)) ≤ T, is satisfied. Figure 5.2
demonstrates the details. It is obvious that smaller values ofT correspond to more
accurate approximations. Also, the degree of each unit reveals the relative importance of
the unit in the approximation.
We demonstrate two examples to show the effectiveness of our approach for
modelling the hidden units. As the first example, consider the exclusive-orMLP network
having two inputs, one hidden unit, and one output unit. The network was trained byBP
algorithm with sigmoid activation until theMSE(Mean Square Errors) is below 0.00574
without misclassification. Then its hidden unit’s output is approximated by ap(i)-degree
polynomial for p(i) = 1, 2, and 3. The truth table of the exclusive-or problem and the
MSE, R(p(i)) and error % of these approximations are shown in Table 5.1 and Table 5.2,
respectively. The second example for approximating the parity-checkMLP network is also
74
listed in Table 5.3 and Table 5.4. In both cases,T is set to be 1 x 10-8.
Figure 5.2 Decide Unit Degreep(i) for the ith Unit.
Table 5.1Truth Table of the Exclusive-OR Problem.
x1 x2 Desired Output Class #
0 0 0 1
0 1 1 2
1 0 1 2
1 1 0 1
It is clear from Table 5.2 and Table 5.4 that the least squares approach works well
for a second or higher degree polynomial. As seen in tables, we can determine the hidden
unit’s degree by observing the behavior of theMSE or the relative MSE during
approximation, as the degree is changed. In general, high degree units are more important
75
than units with low degree.
Table 5.2 Approximation of the Exclusive-ORMLP Network (Layer Structure : 2-1-1).
Degreep(i) MSE R(p(i)) Error %
123
0.5023050.0057420.005742
0.31525.178 x 10-9
6.745 x 10-10
25.00 %0.000 %0.000 %
Table 5.3Truth Table of the Parity-Check Problem.
x1 x2 x3 Desired Output Class #
0 0 0 0 1
0 0 1 1 2
0 1 0 1 2
0 1 1 0 1
1 0 0 1 2
1 0 1 0 1
1 1 0 0 1
1 1 1 1 2Table 5.4 Approximation of the Parity-CheckMLP Network (Layer Structure :
3-10-1).
MSE Error %
Sigmoid Activation 0.008988 0.000 %
Maximum Degree = 1Maximum Degree = 2Maximum Degree = 3
1.004051.004130.009419
62.50 %62.50 %0.000 %
76
5.3 Network Pruning Using the Condensed Network Model
We note that a trainedMLP network is not necessarily making effective use of all
hidden units. Pruning useless units then simplifies the network topology without effecting
the performance and reduces the computational complexity. Existing techniques [55]-[56]
start the network with a large size and then remove hidden units that have little effect on
the classification error. The units are ordered according to the amount by which the
classification error is decreased when the unit is removed. The unit with the smallest
effect is removed or pruned. These methods are not optimal in that for a given
classification error, a smaller network than that found by these algorithms may exist. In
this section, we utilize the condensed network model to detect the presence of linearly
dependent basis vectors by observing the degree of thePBFs, which is in the form as a
function of its net input. If thePBF has lower degree such as 0 or 1, it is useless. If the
PBF has a unique degree term, then it is important. In the following, we state the
properties based on the condensed network model followed by examples.
Property 5.1 : Units which havePBF with degree 0 or 1 are linearly dependent.
This can be shown as follows : according to Eq. (5.1), the activation output of theith unit
with degreep(i) = 1 is
The ith unit can be eliminated without changing the network output. If themth unit feeds
(5.7)φ(i) D(i,0) D(i,1)Xnet(i)
the ith unit and theith unit feeds thejth unit (Figure 5.3), then the weights feeding the
77
jth unit and thresholdθ(j) of the jth unit are compensated as
For the case of constant output wherep(i) = 0, the threshold of thejth unit, which theith
(5.8)θ(j) ← θ(j) w(j,i) D(i,0) D(i,1)θ(i)
(5.9)w(j,m) ← w(j,m) D(i,1)w(i,m)w(j,i)
unit feeds to, is updated as
and the uniti can then be removed.
(5.10)θ(j) ← θ(j) w(j,i)D(i,0)
In the following example, we apply property 5.1 to detect the useless units in an
Figure 5.3 The Shaded Uniti is Ready to be Removed.
existing network. Our results are also compared with Karnin method [22], which estimates
the sensitivity of the error function to the exclusion of each connection. The training data
we generated had 4 input features and 2 classes. There were 200 input vectors for each
class and they had a joint Gaussian probability density. For each class, the training
78
vectors had the same covariance matrix. The optimal discriminant function is a first-
degree polynomial function [6]. AnMLP neural network, having the 4-5-2 structure with
no direct connections from input to output (Figure 5.4), was trained using this data set.
After 100 iterations, the mean square error was reduced to 0.003436. We applied the
methods developed in section 5.2, using a maximum degree ofp(i) = 2 for each unit.
However, the unit degree for all hidden units was automatically reduced to 1 and has
close approximation to the original analytic activation function, forT = 0.005. Thus all
hidden units can be removed and replaced by direct connections, using Eq. (5.8) and Eq.
(5.9). Finally, the network structure becomes 4-2 (only input and output layers left).
Table 5.5 lists the weights shown in Figure 5.5.
For comparison, Karnin’s pruning method was also applied to the same network.
Figure 5.4 Pattern Classifier Network.
In his approach, weight sensitivities between layers are calculated as
79
The subscripts (I) and (F) denote the initial and final connection weights (before and after
Figure 5.5 The Network After Pruning.
Table 5.5Weights in Figure 5.5.
w(i,j) 1 2 3 4
G -12.42 -0.67 5.83 4.38
H 12.42 0.70 -5.90 -4.36
(5.11)Sij
NIT
k 1
∆w (k)(i,j)2 w (F)(i,j)
η w (F)(i,j) w (I)(i,j)
training process) between uniti and j. NIT is the total number of iterations andη is the
learning factor. The sensitivities we obtained are listed in Table 5.6 and Table 5.7.
From Table 5.6 and Table 5.7, we see that the input and output weights of hidden
80
units c ande have smaller sensitivities (refer to Figure 5.6). Thus, hidden unitsc ande
Table 5.6 Input Weight Sensitivities in Figure 5.4.
Sij 1 2 3 4
a 0.0235 0.0189 0.0191 0.0246
b 0.0091 -0.0218 0.0037 0.0060
c 0.0005 0.0009 -0.0001 0.0008
d 0.0248 0.0139 0.0165 0.0160
e 0.0003 -0.0009 0.0001 0.0000
Table 5.7Output Weight Sensitivities in Figure 5.4.
Sij a b c d e
G 0.487 0.496 0.213 0.167 0.162
H 0.522 0.402 0.102 0.150 0.174
can be removed. The final network would have topology 4-3-2. Comparing Figure 5.5
with Figure 5.6, it is obvious that our pruning method results in a more compact network
whenever the original network is effectively linear.
Property 5.2 : Units which have outputs across the training set which mimic the outputs
of other units can be removed.
To illustrate, if one unitm gives approximately the same output as unitn, for the entire
training set, it can be discarded. To remove unitm without changing the solution, at each
81
unit j, on the next layer,w(j,n) can be updated as
Figure 5.6 Results of Karnin’s Analysis (Shaded Units and Dark Lines are Candidatesfor Removal).
Property 5.3 : PBFs which have unique degree terms are linearly independent of the
(5.12)w(j,n) ← w(j,n) w(j,m)
otherPBFs.
For example, if onePBF has a 5th degree term but all others are of 4th degree or less,
the 5th degreePBF is linearly independent of the others. Property 5.3 can considerably
simplify our analysis of the basis functions.
Property 5.4 : If a subset of the coefficient matrix with dimensionNh by Ls, Nh ≤ Ls <
L, has linearly independent rows, then the complete coefficient matrix has
linearly independent rows.
In other words, the linear independence of the rows of the coefficient matrix can be
82
investigated by examining a very small number of its columns. Although this is only a
sufficient condition for detecting the linearly independence of the basis functions, it is
extremely useful whenNh L.
5.4 Calculation of the Exhaustive Network Model
The condensed form of thePBF yields simple form to examine the unit degree as
the function of its net input. However, it is sometimes useful to find the exhaustive
network model. The first step is to find the network output degreeP. If there areH
hidden layers in the network, then
wherep(J) denotes the maximum degree in theJth hidden layer, i.e.
(5.13)PH
J 1
p(J)
Given degreeP, we can findL, the dimension ofX, as in Eq. (2.13).
(5.14)p(J) Maxi ∈LayerJ
p(i)
The next step after findingP andL is to find a method for evaluating the network
PBFs. Rewrite the net input for theith unit as an inner product
Each element of the vectord(i) is taken the weighted sum ofPBFs from previous layers.
Using Eq. (5.15) in Eq. (5.1) we get
83
Defining d(i,n) as the coefficient vector forXnetn(i), (d(i)X)n = d(i,n)X, it is obvious that
(5.15)Xnet(i)
k
C(k)X w(i,k) θ(i)
d(i)X
(5.16)φ(i) ≈p(i)
n 0
D(i,n)(d(i)X )n
d(i,1) = d(i). Therefore,
With the extension of Eq. (5.17), thenth degree term in Eq. (5.16) can be formed through
(5.17)X 2net(i) (d(i)X)2 X TdT(i)d(i)X d(i,2)X
polynomial multiplication as
Finally, the output of each hidden unit can be written as an inner product ofX with a
(5.18)(d(i)X)n X TdT(i,n 1)d(i,1)X d(i,n)X
coefficient vector
In Eq. (5.19), the quantity inside the bracket denotes theith row vector in Eq. (2.16) or
(5.19)φ(i) ≈
p(i)
n 0
D(i,n) d(i,n) X
C(i) in Eq. (2.12). Continuing the process until the last hidden layer is reached, theMLP
network output can be expressed realized as a linear combination ofPBFs as in Eq.
(2.11). Before we start to demonstrate design examples for find the polynomial
84
approximations for theMLP network, we outline the practical procedures in the following
steps.
Step1. Decide a proper network size (how many layers and how many hidden units
per layer) and train the network withBP algorithm such that theMSEbelow
some acceptable values.
Step2. Approximate the activation output of each unit by a maximum finite degree
polynomial.
Step3. Choose the threshold valueT for the relativeMSE, R(p(i)). Decrease the
degree of the polynomial function for each unit as long as the condition,
R(p(i)) ≤ T, is satisfied.
Step4. Remove units which output degree is 0 or 1, using Property 5.1.
Step5. Use the exhaustive network model to find the approximating polynomial for
network output.
Following these procedures, we begin to illustrate several examples to test the
effectiveness of our methods.
5.5 Experiments with the Exhaustive Network Model
In this section, the goal is to use the exhaustive network model in finding the
polynomial approximation for theMLP network’s output. Several design examples such
asMLP filters andMLP classifiers are demonstrated.
85
5.5.1MLP Neural Network Filter
One example is demonstrated for designing the nonlinear filter. The example is
attempted to design a 3-input median filter network, having topology 3-10-1. The training
data is uniformly distributed between 0 and 1 and the desired output is the median of the
3 inputs. Totally, there are 1000 patterns. Again, the maximum allowed value forp(i) was
5. In Figure 5.7, we show the error % and required hidden units for different thresholds.
As shown in Figure 5.7, the model network deviates significantly from the sigmoid
network until T ≤ 0.01 and the number of hidden units having degree greater than 1
increases. The value of the network degreeP here was 5, and we can say that training has
succeeded. Clearly, this median filter network, and all its hidden units, are performing a
nonlinear operations.
In Table 5.8, the degree of each unit is listed and coefficients for all terms (56
terms) are shown in Table 5.9.
Table 5.8Degree of Each Unit for 3-Input Median Filter WhenT = 0.01.
Unit 1 2 3 4 5 6 7 8 9 10
Degree 2 2 2 2 2 3 5 3 3 2
86
5.5.2MLP Neural Network Classifiers
Figure 5.7 3-Input Median Filter Network with Layer Structure 3-10-1 and MaximumDegreep(i) = 5.
For the case of nonlinear classifiers, examples of two-input exclusive-or and three-
input parity-check problem (in section 5.2) are examined first. The thresholdT chosen for
these examples is 1.0 x 10-8 to make sure the best approximations for the original
activation function. For exclusive-or network, the hidden unit is approximated by a
polynomial with degree 2. The polynomial approximation, for the net function of the
output unit, is
87
Using the fact thatxin = xi for binary inputs, we can rewrite Eq. (5.20) as
Table 5.9 Output Coefficients of the Approximating Polynomial for 3-Input MedianFilter Network.
Terms Coeff. Terms Coeff. Terms Coeff. Terms Coeff.
Const. -0.014 x12x3 1.230 x2
3x3 0.008 x13x2x3 1.826
x1 0.271 x1x2x3 -4.865 x12x3
2 -0.002 x12x2
2x3 -5.283
x2 0.457 x22x3 2.305 x1x2x3
2 0.006 x1x23x3 6.794
x3 0.301 x1x32 1.203 x2
2x32 -0.006 x2
4x3 -3.277
x12 0.010 x2x3
2 0.173 x1x33 -0.001 x1
3x32 -0.440
x1x2 0.011 x33 -0.475 x2x3
3 0.002 x12x2x3
2 2.548
x22 0.022 x1
4 -0.000 x34 -0.000 x1x2
2x32 -4.915
x1x3 -0.042 x13x2 0.002 x1
5 -0.051 x23x3
2 3.160
x2x3 -0.044 x12x2
2 -0.007 x14x2 0.491 x1
2x33 -0.409
x32 0.021 x1x2
3 0.009 x13x2
2 -1.893 x1x2x33 1.580
x13 -0.397 x2
4 -0.004 x12x2
3 3.652 x22x3
3 -1.524
x12x2 0.009 x1
3x3 -0.001 x1x24 -3.523 x1x3
4 -0.190
x1x22 2.375 x1
2x2x3 0.007 x25 1.359 x2x3
4 0.367
x23 -1.558 x1x2
2x3 -0.013 x14x3 -0.237 x3
5 -0.035
(5.20)f(x1,x2) 2.74 0.91x1 0.78x2 4.65x 21 11.19x1x2 4.73x 2
2
It is obvious thatf(x1,x2) < 0 when patterns are from class 1 andf(x1,x2) > 0 for patterns
(5.21)f(x1,x2) 2.74 5.56x1 5.51x2 11.19x1x2
in class 2. For parity-check problem, theBP algorithm is used in the network with 3-10-1
topology until theMSE≤ 0.0089. Model each hidden unit with degreep(i) = 3. Finally,
88
the polynomial approximation for the net function (xin = xi) of the output unit is
(5.22)f(x1,x2,x3) 2.764 5.825x1 5.886x2 5.736x3 11.919x1x2
11.915x1x3 11.952x2x3 23.932x1x2x3
5.5.3 Experiments with Quadratic Discriminants
In this subsection, example is shown to design Quadratic discriminant functions
using the conventional approximate Bayesian technique and via second degree
approximations to theMLP neural networks. TheN-input second degree Gaussian
discriminant for theith class,Gi(x), and the Quadratic approximation to theith MLP
discriminant,Qi(x), can be written respectively as
The coefficients Wi( ),wi( ),θi in Eq. (5.23) are found from the covariance matrix and
(5.23)Gi(x) θi
N
j 1
wi(j)xj
N
k 1
k
l 1
Wi(k,l)xkxl
(5.24)Qi(x) θ i
N
j 1
wi(j)xj
N
k 1
k
l 1
Wi(k,l)xkxl
mean vector for each class [6]. The coefficients Wi( ),wi( ),θi in Eq. (5.24) are found
using the technique of the previous sections.
The training data used for both approaches are geometric feature data. The
classification error percentages for theMLP network (with network topology 16-20-4) and
each discriminant are shown in Table 5.10. It is obvious thatGi(x) andQi(x) have quite
89
similar classification performance.
The Gaussian discriminant is an attempt to minimize the probability of error. However,
Table 5.10Error % for Gaussian, Quadratic Discriminants andMLP Network.
Gaussian DiscriminantGi(x)
MLP Network Quadratic DiscriminantQi(x)
RWEF 2.50 % 2.50 % 5.88 %
CHEF 3.00 % 1.63 % 2.88 %
FMDF 0.25 % 1.25 % 2.38 %
the design of theMLP network tries to minimize the mean square error between the
network outputs and the desired values. Thus, the error percentages between them are
slightly different. From the table, The Quadratic discriminant has higher error percentages
than the sigmoidMLP network because the modelling polynomial has too low a degree.
In order to compareGi(x) and Qi(x) further, we can measure the relative mean
square error between the corresponding two-dimensional coefficient arrays of the
Gaussian and Quadratic discriminants. TheEi for each class is of the form
The constantsi is found to minimizeEi. Another way to measure the similarity is to
(5.25)Ei
N
k 1
k
l 1
wi(k,l) siwi(k,l) 2
N
k 1
k
l 1
wi2(k,l)
calculate the correlation coefficientRi for each class [57]. The calculation ofRi is
90
where
(5.26)Ri
Nv
j 1
(Gi(x(j)) Gi)(Qi(x
(j)) Qi)
Nv
j 1
(Gi(x(j)) Gi)
2
Nv
j 1
(Qi(x(j)) Qi)
2
and x(j) denotes thejth pattern number. It is apparent that if theEi is small or theRi is
(5.27)Gi
1Nv
Nv
j 1
Gi(x(j)) , Qi
1Nv
Nv
j 1
Qi(x(j))
large, the two discriminants are more similar. Table 5.11 demonstrates the results of our
analysis. In spite of the results of Wan [58], in which a Bayesian interpretation is given
for the MLP networks, we see that the discriminant functions from the two approaches
can differ significantly. Finally, we show the learning capability of quadratic
discriminants, Eq. (5.24). As the results shown in Table 5.12, the training takes only 50
iterations and both mean square errors and error percentages decrease.
Table 5.11Analysis of Similarity Between Gaussian and Quadratic Discriminants.
Class # 1 2 3 4
E1 R1 E2 R2 E3 R3 E4 R4
RWEF .449 .559 .999 .589 .601 .354 .390 .701
CHEF .229 .723 .442 .372 .691 .148 .399 .743
FMDF .880 .887 .599 .455 .616 .257 .886 .658
91
Table 5.12Training of the Quadratic Discriminants.
Shape Feature Before Training After Training
RWEF 5.88 % 3.75 %
CHEF 2.88 % 2.38 %
FMDF 2.38 % 1.50 %
CHAPTER 6
CONCLUSIONS
In this dissertation, we present aPBF model for the analysis and design of the
MLP neural networks. Applications of thePBF model have been used in forward and
inverse mappings, respectively. In the following, we summarized the major aspects of our
work.
(1). The PBF model leads to approximation theorems for theMLP networks. A
constructive proof for realizing each term of the desired function was shown which
utilizes thePBFs. The required number of hidden units was determined.
(2). ThePBFmodel leads to straightforward mappings between theMLP networks and
conventional filtering and classification algorithms. Given anN-dimensional finite
degree polynomial function, two different kinds of networks,complete and
compact networks, were developed. The upper bound (UB) and lower bound (LB)
on the number of the required hidden units was also derived. The forward mapping
allowed us to determine the required network topology for a given task. The
network can then be improved throughBP learning.
(3). Given a trainedMLP neural network, both condensed and exhaustivePBF models
can be found. In the condensed network model, thePBF for each unit is the
function of its net input. The condensed network model is useful for determining
network degree and network pruning. The pruning methods based on this model
are shown to be more efficient than existing techniques. The exhaustive network
model, which is a polynomial discriminant function, is found by multiplying out
the condensed model.
92
APPENDIX A
BACK-PROPAGATION LEARNING ALGORITHM
93
94
The MLP networks are most often designed using the Back-Propagation (BP)
learning rules or its variants [19]. Basically, theBP algorithm is a gradient descent
technique. Its objective is to adjust the network weights so that applications of a set of
inputs produces the desired set of outputs. Learning in the network is equivalent to
minimizing the sum of the squared errors between the desired and actual network outputs
with respect to these weights. Each input vector is paired with a target vector,T(p)(i),
representing the desired output of theith unit for patternp. The total mean square error,
E(W), at the outputs of the network is
whereO(p)(i) is the actual output forith output unit and the summation is performed over
(A.1)E(W) 12
Nv
p 1
Nc
i 1
T (p)(i) O (p)(i)2
all Nc output units andNv patterns.
Before starting the training process, all of the weights must be initialized as small
random values. Large values could saturate the network. For each training patternp, the
direction of steepest descent in parameter space is determined by the partial derivative of
E(W) with respective to each weight (or threshold)
Then the weights are updated as
(A.2)∆w (p)(i,j) ∼ ∂E(W)
∂w (p)(i,j)
95
In general, Eq. (A.2) is rewritten as
(A.3)w (p)(i,j) w (p 1)(i,j) ∆w (p)(i,j)
η is called learning rate andδ(p)(i), which propagates error signals backward through the
(A.4)∆w (p)(i,j) ηδ (p)(i)O (p)(j)
network, is defined as
Essentially, the determination ofδ(p)(i) is a recursive process which starts with the output
(A.5)δ(p)(i) ∂E(W)
∂X (p)net(i)
layers and working backwards to the first hidden layer (refer to Figure A.1). Then, the
δ(p)(i) is given by
and
(A.6)
δ(p)(i) G (X (p)net(i))(T
(p)(i) O (p)(i)) , for unit i in output layer
δ (p)(i) G (X (p)net(i))
k
δ (p)(k)w (p)(k,i) , for unit i in hidden layers
Finally, all the weights are updated according to Equations (A.3) and (A.4).
(A.7)G (X (p)net(i))
∂φ(i)
∂X (p)net(i)
In summary, theBP algorithm for training theMLP networks is as follows :
Step1. Initialize weights and thresholds between all units.
96
Step2. Present input and desired outputs.
Step3. Calculate actual outputs.
Step4. Adapt weights backward.
Step5. Repeat by going toStep2.
Figure A.1 Backpropagate the Error Signals from Output Layer.
APPENDIX B
REALIZATION OF MONOMIAL AND TWO-INPUT PRODUCT SUBNETS
97
98
In this Appendix, methods to realize monomial and two-input product subnets are
presented. An iterative method for finding the expansion point for the second degree
Taylor series is discussed first. The accuracy of the truncated Taylor series is reviewed.
Example squaring and product subnets are also presented.
B.1 Find X0 for the Second Degree Taylor Series
Our goal here is to closely approximate the sigmoid activation output of Eq. (1.12)
by a power series withp(i) = 2. The method is quite straightforward. As a first step, we
need to decide a specific point of expansion,X0(i), and the range of convergence,M(i),
for the ith hidden unit. Since the sigmoid function is differentiable, we can find the
(p(i)+1)th derivative of the sigmoid activation and estimate an upper bound on the
remainder termRp(i)+1( ) and the radius of convergence on both sides aboutX0(i)
according to the following formula [59]
G(p(i)+1)(ξ) is the (p(i)+1)th derivative of the sigmoid activation function andξ is
(B.1)Rp(i) 1(Xnet(i)) G (p(i) 1)(ξ)(Xnet(i) X0(i))
p(i) 1
(p(i) 1)!
somewhere betweenXnet(i) andX0(i). The choice of the expanding point is very crucial.
It has to make sure that (1)φ(i) can be approximately expressed by the first three terms
of Taylor series and (2) the ratio of maximum remainder to the third term of Taylor series
is as small as possible. That is the truncation error is bounded explicitly. Another
important factor in the Taylor series approximation is to decide the radius of convergence
99
such thatXnet(i) - X0(i) ≤ M(i). This constraint helps us to decide the initial weights and
threshold of each hidden unit. We suggest an iterative method to findX0(i) and M(i).
Initially, we have a rough estimation forX0(i) being about 1.5 in this case. Thus we can
start with an initial guess forX0(i) between [0.8,1.8] and use of numerical analysis to get
the best operating point. The iterative process in thekth step is
(1). X0(k)(i) ← X0
(k)(i) + ∆X0(i), where∆X0(i) = 0.05.
(2). Taking∆M(k)right(i) to be 1, find the remainder term using Eq. (B.1). Calculate
the ratioRp(i)+1( ) with the term of degreep(i). If the ratio is less than some
typical value (sayδ=0.0001), then the right side radius of convergence in the
kth step isM(k)right(i) ← ∆M(k)
right(i). Otherwise, decrease∆M(k)right(i) by δ
value, and then repeat (2).
(3). Repeat (2) for left side region to findM(k)left(i).
From our experimental results, the optimal value ofX0(i) is chosen to be the value
associated with maximum radius of convergence. For the case ofp(i) = 2, the result for
X0(i) is 1.45 and the radius of convergence is 0.505.
B.2 Conditions for Mapping Accuracy of the Truncated Taylor Series
In previous section, we show a method to find the expanding point and radius of
convergence for each unit. It is desirable to find conditions which allow us to neglect
remainder terms (truncation error) after the Taylor series expansion.
Theorem B1 : The ratio of the remainder’s magnitude to that of thep(i) degree term in
100
Eq. (2.7) approaches zero as the radius of convergence for that unit
approaches zero.
Proof : According to Eq. (2.7), the ratio of the remainder to thep(i) degree term is
whereK is a finite constant. For bounded input, therefore bounded activation outputs
(B.2)E(i,ξ) (Xnet(i) X0(i))
p(i) 1
A(i,j) (Xnet(i) X0(i))p(i)
K (Xnet(i) X0(i))
φ(j) ≤ φ(m)(j), and the constraint ofXnet(j) - X0(j) ≤ M(j), we set up the following
equations
For simplicity, we assume that all these weightsw(i,j) are equal to each other. Then the
(B.3)j
w(i,j)φ(m)(j) θ(i) X0(i) M(i)
j
w(i,j)φ(m)(j) θ(i) X0(i) M(i)
weights feeding theith hidden unit and its threshold value are
Substituting Eq. (B.4) into the net input of theith hidden unit (Eq. (2.9)),
(B.4)w(i,j) M(i)
j
φ(m)(j), θ(i) X0(i)
The ratio in Eq. (B.2) becomes
(B.5)Xnet(i)j
φ(j)M(i)
j
φ(m)(j)θ(i)
101
When M(i) → 0, the ratio of the remainder to thep(i) degree term becomes
(B.6)M(i)
Kj
φ(j)
j
φ(m)(j)
insignificant.
From the results of Theorem B1, we develop methods to realize monomial and
two-input product subnets.
B.3 Monomial Subnet
In designing monomial subnets, we must decide the number of hidden units and
control the mapping accuracy.
Theorem B2 : The monomial function,xk, for a bounded input signalx, x ≤ x(m), can
be realized with one processing layer, having (k-1) hidden units, with
arbitrarily small errors (Figure B.1).
Proof : The hidden units are numbered asi equal 2 tok and for ith hidden unit,p(i) = i.
From Eq. (2.7),xi is observed at the output of theith hidden unit, along with unwanted
terms such asxi-1, xi-2, and so on, which need to be subtracted out. The zero degree and
first degree terms can be generated using a bias term and the connection from the input
layer. Thus the total number of hidden units required is (k-1). As in the proof of Theorem
B2, the input weight and threshold for theith hidden unit is
and i = 2,3, ,k.
102
Clearly Eq. (2.7) becomes
Figure B.1 Monomial Subnetxk.
(B.7)wx(i,1) M(i)
x (m), θ(i) X0(i)
(B.8)Xnet(i) X0(i) xwx(i,1)
Normalize Eq. (B.9) to get the normalized activationφ(i) as
(B.9)φ(i)p(i)
j 0
A(i,j)w jx (i,1)x j E(i,ξ)w i 1
x (i,1)x i 1
(B.10)φ(i)p(i)
j 0
d(i,j)x j En(i)
103
with
After normalization, the coefficient of the highest degree in each hidden unit’s output is
(B.11)d(i,j) A(i,j)M j i(i)
A(i,i) x (m) j i, En(i)
E(i,ξ)M(i)
A(i,i)x (m)x i 1
one. Assume thatxk is formed by connecting theφ(i)’s to an output node through weights
wo(i), 2 ≤ i ≤ k. The net input of the output node is the total summation of allk hidden
unit’s outputs
From Eq. (B.11)d(m,m) = 1, by pickingwo(k) = 1 and
(B.12)k
j 2
k
m j
wo(m)d(m,j)x jk
j 2
wo(m) En(m)
then Eq. (B.12) becomes
(B.13)wo(i)k
j i 1
wo(j)d(j,i) , i 2, ,k 1
As stated in Theorem B2,En(m) is proportional toM(m). Clearly En(m) can be made
(B.14)x kk
m 2
wo(m) En(m)
arbitrarily small by makingM(m) small. Therefore the estimated errors can be made
arbitrarily small.
In summary, conditions to constructxk with arbitrarily small error are
104
As an example, we design a subnet forx2. According to Theorem B3, there is only
(B.15)M(k) → 0, M(i) M(j) , for i j
one hidden unit required. Rewriting Eq. (B.10) for the case ofp(i) = 2, the activation
output is
By choosingM(2) small and the weights as shown in Figure B.2,x2 can be observed at
(B.16)
φ(2) A(2,0)
A(2,2)w 2x (2,1)
A(2,1)A(2,2)wx(2,1)
x x 2E(2,ξ)wx(2,1)
A(2,2)x 3
A(2,0)A(2,2)
(x (m))2
M 2(2)
A(2,1)A(2,2)
x (m)
M(2)x x 2 E(2,ξ)
A(2,2)M(2)
x (m)x 3
λ0 λ1x x 2 Err
the output unit with arbitrarily small errors. In Figure B.2,λ0 andλ1 are in Eq. (B.16) and
(B.17)λ2
M(2)
x (m), λ3
(x (m))2
A(2,2)M 2(2)
B.4 Two-Input Product Subnet
In the discussion of monomial subnets, we proved thatxk can easily be
approximated with arbitrarily small error. Here, the realization of 2-input multipliers are
discussed. The result can be extended to a multi-input multiplier.
A 2-input multiplier subnet with bounded inputs, having 3 hidden units in one
processing layer, can be constructed starting with a monomial subnet forx2 (replacex by
105
x1+x2). Therefore, (x1+x2)2 = (x1
2+2x1x2+x22) can be realized with arbitrarily small errors.
Figure B.2 Monomial Subnetx2.
The unwantedx12 and x2
2 terms can be generated with parallel squaring subnets and
subtracted off the square terms, yieldingx1x2 only as shown in Figure B.3. The method
to initialize the product subnet is described as follows :
Given bounded inputs,x1(n) ≤ x1 ≤ x1
(m) andx2(n) ≤ x2 ≤ x2
(m), the net inputXnet(2)
with the constraintXnet(2) - X0(2) ≤ M(2), can be written as
If net output in the product unit (hidden unit 2 in Figure B.3) is normalized with the
(B.18)wx(2,1)x (m)1 wx(2,2)x (m)
2 θ(2) X0(2) M(2)
(B.19)wx(2,1)x (n)1 wx(2,2)x (n)
2 θ(2) X0(2) M(2)
coefficient ofx1x2, the remainder can be rewritten as
(B.20)E(2,ξ) Xnet(2) X0(2) 3
2wx(2,1)wx(2,2)
106
From Eq. (B.20), to minimize the remainder term is equivalent to maximize the product
of wx(2,1)wx(2,2). By subtracting Eq. (B.19) form Eq. (B.18), we get
wherer1 = x1(m) - x1
(n) andr2 = x2(m) - x2
(n). Taking derivative ofwx(2,1)wx(2,2) with respect
(B.21)wx(2,1)r1 wx(2,2)r2 2M(i)
to eitherwx(2,1) orwx(2,2), the input weights and threshold which minimize the remainder
term can be found as
After calculating the input weights, the output weights can be initialized in the same way
(B.22)wx(2,1) M(i)r1
, wx(2,2) M(i)r2
(B.23)θ(i) X0(i) M(i) (1x (m)
1
r1
x (m)2
r2
)
as in the monomial subnet. An example of a two-input product subnet with specific
weights and thresholds is shown in the following.
(B.24)
θ(1) θ(2) θ(3) 1.451
wx(2,1) wx(2,2) 0.253
wx(1,1) wx(3,2) 0.505
λ0 200.753, λ1 164.375
λ2 10.582, λ3 41.094
107
Figure B.3 Product Subnetx1x2.
APPENDIX C
FEATURE DATA SET
108
109
Our goal in this Appendix is to enumerate several different kinds of feature sets
which are used in this dissertation to test the classification capabilities of theMLP
networks. The features were calculated from four classes of geometric shapes. The four
primary geometric shapes are ellipse, triangle, quadrilateral, and pentagon. Several
example shape images are shown in Figure C.1. Each shape image consists of a matrix
of size 64 x 64, and each element in the matrix represents a binary-valued pixel in the
image. The feature types are Circular Harmonic Expansion (CHEF) [60], Fourier-Mellin
Descriptor (FMDF) [61], Ring-Wedge Energy (RWEF) [62], Log-Polar Transform (LPTF)
[63] and Radius Feature (RDF) [64]. In Table C.1, we summarize the different shape
feature sets.
Table C.1 Shape Features.
Shape Features # of Inputs # of Classes # of Patterns per Class
CHEF 16 4 200
FMDF 16 4 200
RWEF 16 4 200
LPTF 16 4 200
RDF 16 4 200
110
Figure C.1 Examples of Geometric Shapes.
REFERENCES
[1] M. Schetzen,The Voterra and Wiener theories of nonlinear systems, Wiley, 1980.
[2] N. Gallagher and G. Wise, "A Theoretical Analysis of the Properties of Median
Filter," IEEE Trans. on Acoust, Speech, Signal Proc., Vol. ASSP-29, pp. 1136-
1141, Dec. 1981.
[3] G. Arce and N. Gallagher, "State Description for the Root-Signal Set of Median
Filters,"IEEE Trans. on Acoust, Speech, Signal Proc., Vol. ASSP-30, pp. 894-902,
Dec. 1982.
[4] J. Fitch, E. Coyle and N. Gallagher, "Median Filtering by Threshold
Decomposition,"IEEE Trans. on Acoust, Speech, Signal Proc., Vol. ASSP-32, pp.
1183-1188, Dec. 1984.
[5] P. Maragos and R. Schafer, "Morphological Filters - Part I: Their Set-Theoretic
Analysis and Relations to Linear Shift-Invariant Filters,"IEEE Trans. on Acoust,
Speech, Signal Proc., Vol. ASSP-35, pp. 1153-1169, Aug. 1987.
[6] R.O. Duda and P.E. Hart.Pattern classification and scene analysis. New York:
Wiley, 1973.
[7] K. Fukunaga.Introduction to statistical pattern recognition. New York: Academic
Press, 1972.
111
112
[8] D.F. Specht, "Probabilistic neural network and the polynomial adaline as
complementary techniques for classification,"IEEE Trans. Neural Networks,
1(1):111-121 Mar. 1990.
[9] M. Caudill, "The polynomial Adaline algorithm,"Comput. Lang., Dec. 1988.
[10] D. Gabor et al., "A universal nonlinear filter, predictor and simulator which
optimize itself by a learning process,"Proc. Inst. Elec. Eng., Vol. 108B, pp. 422-
438, 1961.
[11] T. Cover and P. Hart, "Nearest Neighbor Pattern Classification,"IEEE Trans.
Information Theory. IT-13, pp. 21-27, 1967.
[12] O.J. Murphy, "Nearest neighbor pattern classification perceptrons,"Proc. IEEE,
78(10):1595-1598, Oct. 1990.
[13] W.S. McCulloch and W.H. Pitts, "An Logical Calculus of the Ideas Immanent in
Nervous Activity,"Bulletin of Mathematics and Biophysics, Vol. 5, pp. 115 1943.
[14] F. Rosenblatt, "The Perceptron: A Probabalistic Model for information Storage and
Organization in the Brain,"Psychological Review, Vol. 65, pp. 386, 1958.
[15] W.Y. Huang and R.P. Lippmann, "Neural net and traditional classifier," In D.
Anderson, editor,Neural Info. Processing Syst., pp. 387-396. American Institute
of Physics, New York, 1988.
[16] R.P. Lippman, "Pattern classification using neural networks,"IEEE Commun. Mag.
113
27, pp. 47-64, 1989.
[17] R.P. Lippmann, "An Introduction to Computing with Neural Nets,"IEEE ASSP
Magazine, pp. 4-22, April 1987.
[18] J. F. Steffensen.Interpolation. Chelsea Publishing Company, New York, 1950.
[19] D.E. Rumelhart, G.E. Hinton and R.J. Williams, "Learning internal representation
by error propagation," in D.E. Rumelhart and J.L. McClelland (Eds.),Parallel
Distributed Processing, Vol. I, Cambridge, Massachusetts: The MIT Press, 1986.
[20] S. Knerr, L. Personnaz and G. Dreyfus, "Single-layer learning Revisited: A
Stepwise Producer for Building and Training a Neural Network,"NATO Workshop
on Neurocomputing, Les Arcs, France, Feb. 1989.
[21] M. Mezard and J.P. Nadal, "Learning in Feedforward Layered Networks: the
Tiling Algorithms," J. Phys. A 22, 2191-2203, 1989.
[22] Ehud D. Karnin, "A Simple Procedure for Pruning Back Propagation Trained
Neural Networks,"IEEE Trans. on Neural Networks. Vol. 1, No. 2, 1990.
[23] M.C. Mozer and P. Smolensky, "Skeletonization: A technique for trimming the fat
from a network via relevance assessment," inAdvances in Neural Information
Processing I, D.S. Touretzky, Ed. Morgan Kaufmann, pp. 107-115, 1989.
[24] G. Cybenko, "Approximations by Superpositions of a Sigmoidal Function,"Math.
Contrl., Signals, Syst., Vol. 2, pp. 303-314, 1989.
[25] A. Lapedes and A. Farber, "Nonlinear signal processing using neural networks:
prediction and system modeling,"Los Alamos National Laboratory, Los Alamos,
114
N.M., TR LA-UR-87-2662, 1987.
[26] Robert Hecht-Nielsen, "Theory of backpropagation neural network," in
Proceedings of the International Joint Conference on Neural Networks, Vol. I, pp.
593-605 Washington D.C., June 1989.
[27] Maxwell Stinchcombe and Harbert White, "Universal approximation using
feedforward networks with non-sigmoid hidden layer activation functions," in
Proceedings of the IJCNN, Vol. I, pp. 613-617. Washington D.C., June 1989.
[28] O. Nerrand, P. Roussel-Ragot, L. Personnaz and G. Drefus, "Neural Network
Training Schemes for Non-linear Adaptive Filtering and Modelling," in
Proceedings of the IJCNN, Vol. I, pp. 61-66, 1991.
[29] C. Klimasauskas, "Neural Nets and Noise Filtering,"Dr. Dobb’s Journalpp. 32
Jan. 1989.
[30] Brooke Anderson and Don Montgomery, "A Method for Noise Filtering with
Feed-forward Neural Networks: Analysis and Comparison with low-pass and
Optimal Filtering," inProceedings of the IJCNN, Vol. I, pp. 209-214, 1990.
[31] P. Gallinari, S. Thiria and F. Fogelman Soulie, "Multilayer Perceptrons and data
analysis,"Proceeding of the IJCNN, Vol I, pp. 391-399, 1988.
[32] H. Asoh and N. Ostu, "Nonlinear data analysis and multilayer perceptrons," in
Proceedings of the IJCNN, Vol. II, pp. 411-415, 1989.
[33] Toshio Irino and Hideki Kawahara, "A method for Designing Neural Network
Using Nonlinear Multivariate Analysis: Application to Speaker-Independent Vowel
115
Recognition," inNeural Computation, Vol. 2, No. 3, pp. 386-397, 1990.
[34] Osamu Fujita, "A method for Designing the Internal Representation of Neural
Networks," inProceedings of the IJCNN, Vol. III, pp. 149-154, 1990.
[35] J. Park and I.W. Sandberg, "Universal Approximation Using Radial-Basis-Function
Network," Neural Computation, Vol. 3, No. 2, pp. 246-257, 1991.
[36] S. Qian, Y.C. Lee, R.D. Jones, C.W. Barnes and K. Lee, "Function Approximation
with an Orthogonal Basis Net," inProceedings of the IJCNN, Vol. III, pp. 605-
619, 1990.
[37] Wilson J. Rugh.NONLINEAR SYSTEM THEORY. The Volterra/Wiener Approach.
The Johns Hopkins Univ. Press, 1981.
[38] Matrin Schetzen, "Nonlinear System Modelling Based on the Wiener Theory,"
Proceeding of the IEEE, Vol. 69, No. 12, 1981.
[39] S. Chen, S.A. Billings and P.M. Grant, "Non-linear system identification using
neural networks,"Int. J. ControlVol. 51 pp. 1191-1214, 1990.
[40] D.S. Broomhead and D. Lowe, "Multivariable Functional Interpolation and
Adaptive Networks,"Complex Systems, 2, pp. 321-355, 1988.
[41] M.J.D. Powell, "Radial basis functions for multi-variable interpolation: A review,"
IMA Conference on Algorithms for the Approximation of Functions and Data,
RMCS Shrivenhamn, UK, 1985.
[42] C.A. Micchelli, "Interpolation of Scattered Data: Distance Matrices and
Conditionally Positive Definite Functions,"Constructive Approximation, 2, pp. 11-
116
22, 1986.
[43] S. Chen, C.F.N. Cowan and P.M. Grant, "Orthogonal Least Squares Learning
Algorithm for Radial Basis Function Network,"IEEE Trans. on Neural Networks,
Vol 2, No. 2, pp. 302-309, March 1991.
[44] Mu-Song Chen and M.T. Manry, "Back-Propagation Representation Theorem
using Power Series," inProceedings of the IJCNN, Vol. I, pp. 643-648, 1990.
[45] A.N. Kolmogorov, "On the representation of continuous function of many
variables by superposition of continuous function of one variable and addition,"
Dokl. Akad. NaukUSSR pp. 953-956, 1957.
[46] R.R. Goldberg.Methods of Real Analysis, John Wiley and Sons, New York, 1976.
[47] R. Courant and D. Hilbert.Method of Mathematical Physics, Vol. 1, Interscience
Publishers, New Work, 1955.
[48] D. Quintin Peasley.Coefficients of associated Legendre functions. Washington:
National Aeronautics and Space Administrations, 1976.
[49] J.T. Tou and R.C. Gonzalez.Pattern Recognition principles, Addison-Wesley
Publishing Company, 1981.
[50] R. Fletcher and C. M. Reeves, "Function minimization by conjugate gradients,"
Computer J., Vol. 7, pp. 149-154, 1964.
[51] R. Fletcher, "Conjugate direction methods," inNumerical Methods for
Unconstrained Optimization,W. Murray, Ed. London and New York: Academic
Press, pp 73-86, 1972.
117
[52] H. Crowder and P. Wolfe, "Linear convergence of the conjugate gradient method,"
IBM J. Res. Develop., Vol. 16, pp. 431-433, 1972.
[53] J. Kowalik and M. R. Osborne.Methods for Unconstrained Optimization
Problems. New York: American Elsevier, 1968.
[54] Richard L. Burden and J. Douglas Faires.Numerical Analysis, third edition, 1985.
[55] A. Bjorck, "Solving linear set squares problems by Gram-Schmit
orthogonalization,"Nordisk Tidskr. Information-Behandling, Vol. 7, pp. 1-21,
1967.
[56] G. Golub, "Numerical methods for solving linear set squares problems,"
Numerische Mathematik, Vol. 7, pp. 206-216, 1965.
[57] A. Papoulis.Probability, Random Variables, and Stochastics Processes. New
York: McGraw-Hill, 1965.
[58] E.A. Wan, "Neural Network Classification,"IEEE Trans. on Neural Networks.
Vol. 1, No. 4, 1990.
[59] Kreyszig, Erwin.Advanced Engineering Mathematics. Fourth Edition, New York:
Wiley and Sons, 1979.
[60] Y-N Hsu and H.H. Arsenault, "Pattern discrimination by multiple circular
harmonic components,"Applied Optics, Vol. 23, pp. 841-844, 1984.
[61] Y. Sheng and H.H. Arsenault, "Experiment on pattern recognition using invariant
Fourier-Mellin descriptor,"Optics society of America, Vol. 3, pp. 771-776, 1986.
118
[62] N. George, S. Wang and D.L. Venable, "Pattern recognition using ring-wedge
detector and neural-network software,"SPIEVol. 1134, Optical Recognition II pp.
96-106, 1989.
[63] D. Casasent and D. Psaltis, "Position, rotation, and scale invariant optical
correlation,"Applied Optics, Vol. 15, 1976.
[64] H.C. Yau, "Transform-based shape recognition employing neural networks,"Ph.D.
Dissertation, Univ. of Texas at Arlington, 1990.