ANALYSIS AND DESIGN OF THE MULTI-LAYER PERCEPTRON

ANALYSIS AND DESIGN OF THE MULTI-LAYER PERCEPTRON USING

POLYNOMIAL BASIS FUNCTIONS

The members of the committee approve the doctoral

dissertation of Mu-Song Chen

Michael T. ManrySupervising Professor

Kai S. Yeung

Venkat Devarajan

Jonathan Bredow

Daniel S. Levine

Dean of the Graduate School



by

MU-SONG CHEN

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT ARLINGTON

DECEMBER 1991

ACKNOWLEDGEMENTS

I would like to express my deepest appreciation to my supervising professor Dr.

Michael T. Manry, for his support, encouragement, and guidance. Without his constant

encouragement and willingness to meet at untimely schedules, I would not have been able

to complete my dissertation. I would also like to thank the other members of my

dissertation committee, Dr. Yeung, Dr. Devarajan, Dr. Levine, and Dr. Bredow, for

providing constructive suggestions.

I also owe a great deal to the members of the Image Processing and Neural

Networks Lab including Kamyar Rohani and Steve Apollo, for helping me with the

software tools on the school computers.

Finally, I wish to thank my parents, who always believed in higher education for

their children, for their support and encouragement. Also, forever, I am grateful for the

sacrifices they made in supporting me while I was miles and miles away from home.

November 7, 1991

iii

ABSTRACT



Publication No._________

Mu-Song Chen, Ph.D.

The University of Texas at Arlington, 1991

Supervising Professor : Michael. T. Manry

In this dissertation, the theory of polynomial basis functions is developed as a

means for the design and analysis of the multi-layer perceptron (MLP) neural networks.

Methods and algorithms are presented for designing theMLP network system using

polynomial models. The theory enables us to develop an approximation theorem for the

MLP network, to map an existingN-dimensional polynomial function to theMLP network

(forward mapping), and to construct polynomial discriminant functions from an existing

MLP network (inverse mapping).

There are several advantages associated with forward and inverse mappings. The

forward mapping allows us to determine the minimum required network topology and to

iv

initialize the network with small errors. The inverse mapping allows us to prune the

useless units in theMLP network, determine the complexity of the conventional

implementation of the network and find the polynomial approximation of the network

output.

v

TABLE OF CONTENTS

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

LIST OF ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

CHAPTER

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Nonlinear Modelling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 MLP Neural Network Model and Its Problems. . . . . . . . . . . . . . . . . . 5

1.3 The Scope of the Dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2. NETWORK BASIS FUNCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Orthogonal Basis Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Radial Basis Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Polynomial Basis Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3. MLP APPROXIMATION THEOREMS . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Polynomial Approximating of Functions. . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Approximating Functions of One Variable. . . . . . . . . . . . . . . . . 24

vi

3.1.2 Approximating Functions of Many Variables. . . . . . . . . . . . . . . 26

3.2 Realization of Multi-Input Products. . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Completion of the Proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4. FORWARD MAPPINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Complete Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Multi-LayerComplete Networks. . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.3 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Compact Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 A Lower Bound on the Number of Weights. . . . . . . . . . . . . . . 44

4.2.2 Construction ofCompact Networkswith Monomial Activation . . 47

4.2.2.1 Compact Mapping with Block Approach. . . . . . . . . . . . . . 49

4.2.2.2 Compact Mapping with Group Approach. . . . . . . . . . . . . . 52

4.2.3 Conversion of Monomial Activation to Analytic Activation. . . . . 54

4.2.4 Sparse Second DegreeCompact Network . . . . . . . . . . . . . . . . . 56

4.2.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5. INVERSE MAPPINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1 Polynomial Network Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Calculation of the Condensed Network Model. . . . . . . . . . . . . . . . . . 71

5.3 Network Pruning Using the Condensed Network Model. . . . . . . . . . . 75

vii

5.4 Calculation of the Exhaustive Network Model. . . . . . . . . . . . . . . . . . 82

5.5 Experiments with the Exhaustive Network Model. . . . . . . . . . . . . . . 84

5.5.1MLP Neural Network Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.2MLP Neural Network Classifier. . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.3 Experiments with Quadratic Discriminants. . . . . . . . . . . . . . . . . 87

6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

APPENDIX A. BACK-PROPAGATION LEARNING ALGORITHM . . . . . . 93

APPENDIX B. REALIZATION OF MONOMIAL AND TWO-INPUT

PRODUCT SUBNETS. . . . . . . . . . . . . . . . . . . . . . . . . . . 97

B.1 Find X0 for the Second Degree Taylor Series. . . . . . . . . . . . . . . . . . 98

B.2 Conditions for Mapping Accuracy of the Truncated Taylor Series. . . . 99

B.3 Monomial Subnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

B.4 Two-Input Product Subnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

APPENDIX C. FEATURE DATA SET . . . . . . . . . . . . . . . . . . . . . . . . . . 108

REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

viii

LIST OF FIGURES

Figure 1.1 The Multi-LayerPerceptronNetwork . . . . . . . . . . . . . . . . . . . . . . . . 6

Figure 1.2 Artificial Neuron Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Figure 1.3 Four Representative Activation Functions. . . . . . . . . . . . . . . . . . . . . 9

Figure 1.4 Block Diagram of the Proposed Tasks. . . . . . . . . . . . . . . . . . . . . . . 12

Figure 2.1MLP Network Representation via Radial Basis Functions. . . . . . . . . . 17

Figure 2.2MLP Network Representation via Polynomial Basis Functions. . . . . . . 21

Figure 3.1 Monomial Subnet with Multiple Inputs. . . . . . . . . . . . . . . . . . . . . . . 29

Figure 3.2 Constructionf(x) by Subnet Approaches. . . . . . . . . . . . . . . . . . . . . . 31

Figure 4.1 Block Diagram of Forward Mapping. . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 4.2 Subnet Approach for Mapping a 4-Input Second Degree Polynomial . . 35

Figure 4.3 The 4-Input Second DegreeComplete Network . . . . . . . . . . . . . . . . . 35

Figure 4.4 The Multi-LayerComplete Networkfor Realizing a Function withN =

2 andP = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 4.5 The Multi-LayerComplete Networkfor Realizing Productx1 x2 x7 . . . 40

Figure 4.6 Training Results of aComplete Network(16-136-4) withCHEF Data . 42

Figure 4.7 Training Results of aComplete Network(16-136-4) withLPTF Data . . 43

Figure 4.8 Compact Mapping with Block Approach. . . . . . . . . . . . . . . . . . . . . . 48

Figure 4.9 Flowchart of Iterative Conjugate-Gradient Method. . . . . . . . . . . . . . . 51

ix

Figure 4.10 Compact Mapping with Group Approach. . . . . . . . . . . . . . . . . . . . . 53

Figure 4.11 Conversion of akth Degree Monomial Activation to Sigmoid

Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 4.12 Replace Monomial Activation (2nd Degree) by Sigmoid Activation . . 57

Figure 4.13 Efficient Compact Mapping of Second Degree Function. . . . . . . . . . 59

Figure 4.14 Training Results of aCompact Networkwith RWEFData . . . . . . . . . 61

Figure 4.15 Group Approach for Realizing Allx1 Terms (Second Degree Case) . . 63






Figure 4.21 Block Approach for Realizing All Terms of a Second Degree

Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 5.1 Block Diagram of Inverse Mappings. . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 5.2 Decide Unit Degreep(i) for the ith Unit . . . . . . . . . . . . . . . . . . . . . . 73

Figure 5.3 The Shaded Uniti is Ready to be Removed. . . . . . . . . . . . . . . . . . . 77

Figure 5.4 Pattern Classifier Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Figure 5.5 The Network After Pruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Figure 5.6 Results of Karnin’s Analysis (Shaded Units and Dark Lines are

Candidates for Removal). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

x

Figure 5.7 3-Input Median Filter Network with Layer Structure 3-10-1 and

Maximum Degreep(i) = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Figure A.1 Backpropagate the Error Signals from Output Layer. . . . . . . . . . . . . 96

Figure B.1 Monomial Subnetxk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Figure B.2 Monomial Subnetx2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Figure B.3 Product Subnetx1x2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Figure C.1 Examples of Geometric Shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xi

LIST OF TABLES

Table 4.1 Subnet Approach for Realizing a Function withN = 2 andP = 3 . . . . . 36

Table 4.2 Comparisons of Single-Layer and Multi-LayerComplete Networks. . . . 41

Table 4.3 Classification Results of Gaussian Classifier andComplete Network. . . 42

Table 4.4 Classification Error Percentages for Gaussian Classifier and Second

DegreeCompact Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Table 4.5 Mean and STD Deviation for the Coefficients in Different Functions . . 61

Table 4.6 Results for Mapping a 4th Degree Function toCompact Networks,

Using Group and Block Approaches. . . . . . . . . . . . . . . . . . . . . . . . . 62

Table 4.7 Compare the Required Hidden Units with the Theoretical Results. . . . . 66

Table 4.8 Using Block Approach to Approximate Function 1/(x1x2) . . . . . . . . . . . 68

Table 5.1 Truth Table of the Exclusive-OR Problem. . . . . . . . . . . . . . . . . . . . . 74

Table 5.2 Approximation of the Exclusive-ORMLP Network (Layer Structure :

2-1-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Table 5.3 Truth Table of the Parity-Check Problem. . . . . . . . . . . . . . . . . . . . . . 74

Table 5.4 Approximation of the Parity-CheckMLP Network (Layer Structure :

3-10-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Table 5.5 Weights in Figure 5.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Table 5.6 Input Weight Sensitivities in Figure 5.4. . . . . . . . . . . . . . . . . . . . . . . 79

xii

Table 5.7 Output Weight Sensitivities in Figure 5.4. . . . . . . . . . . . . . . . . . . . . . 80

Table 5.8 Degree of Each Unit for 3-Input Median Filter WhenT = 0.01 . . . . . . . 86

Table 5.9 Output Coefficients of the Approximating Polynomial for 3-Input

Median Filter Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Table 5.10 Error % for Gaussian, Quadratic Discriminants andMLP Network . . . 88

Table 5.11 Analysis of Similarity Between Gaussian and Quadratic

Discriminants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Table 5.12 Training of the Quadratic Discriminants. . . . . . . . . . . . . . . . . . . . . . 90

Table C.1 Shape Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xiii

LIST OF ACRONYMS

MLP Multi-Layer Perceptron

BP Back-Propagation

PNN Probablistic Neural Network

RBF Radial Basis Function

PBF Polynomial Basis Function

UB Upper Bound Number of Hidden Units

LB Lower Bound Number of Hidden Units

MSE Mean Square Error

CHEF Circular Harmonic Expansion Features

RDF Radius Features

LPTF Log-Polar Transform Features

RWEF Ring-Wedge Energy Features

FMDF Fourier-Mellin Descriptor Features

xiv

NOMENCLATURE

N dimension off( ) or number of input units of theMLP network

x N-dimensional input vector [x1,x2, ,xN]T

di(x) the ith discriminant function

p(i) the probability of occurrence for theith class

mi the mean vector for classi

Covi covariance matrix for classi

Dis( ) distance measure function

Xnet(i) net input of theith hidden unit

φ(i) polynomial basis function for theith hidden unit

w(i,j) connection weight form unitj to unit i

θ(j) threshold of thejth hidden unit

G( ) analytic activation output

E(W) error function for theMLP network

Nh number of hidden units

f( ) the desired network output function

Nv number of training vectors (patterns)

Ωi( ) the ith radial basis function

αi connection weight from theith hidden unit to output unit in the

xv

RBF network

X0(i) expansion point for Taylor series of theith hidden unit

A(i,j) Taylor series coefficient of thejth power term in theith unit

E(i,ξ) the remainder coefficient for Tayor series approximation of theith

hidden unit’s activation output

p(i) degree of polynomial expansion for theith unit’s activation output

M(i) convergence radius for theith hidden unit’s power series

X vector of all one-term polynomials with degrees 0 throughP

L dimension ofX

Af coefficient vector ofX

P the output degree of theMLP networks

wo(i) output weight connecting to theith class

C output coefficient matrix of theMLP network

Ns total number of sampling points

ϕi(x) orthogonal polynomial basis function with single variable

Ψi(x) orthogonal polynomial basis function with multiple variables

u(x) weighting function for orthogonal polynomial basis function

fk(x) an N-dimensional polynomial with terms of degreek only

<φ(i)> the average output activation of theith hidden unit

D(i,j) polynomial coefficient of thejth power term in theith hidden unit

d(i,n) the coefficient vector forXnetn(i)

xvi

E(p(i)) mean square error between the actual activation output and its

polynomial approximation with degreep(i) in the ith hidden unit

R(p(i)) relative mean square error ofE(p(i))

T threshold forR(p(i))

Sij weight sensitivities between hidden unitsi and j

NIT total number of iterations

p(J) the maximum output degree in hidden layerJ

T(p)(i) desired output of theith class for thepth pattern

O(p)(i) output of theith output unit for thepth pattern

η learning factor inBP algorithm

Nc number of output classes

xvii

CHAPTER 1

INTRODUCTION

Real world systems for processing signals are commonly classified according to

several criteria. Among these criteria are (1) the purpose of the system, such as control,

communication, signal processing etc., and (2) the mathematical characteristics of the

system such as linearity or non-linearity, time variance or invariance etc. The

mathematical characteristics of the system are determined by modelling its behavior for

different classes of inputs. The mathematical complexity of the model depends on how

much is known about the process being studied and on the purpose of the modelling

exercise. In preliminary studies of systems, the models are often assumed to be linear in

the parameters. Such models will be referred to as linear models. In the early

development of signal processing, linear systems were the primary tools. Their

mathematical simplicity and the existence of some desirable properties made them easy

to design and implement. However, the more realistic and accurate models are often

nonlinear in the parameters and called as nonlinear models. In this chapter, we briefly

discuss some conventional nonlinear modelling techniques, which are applied in systems

for filtering and classification. Then we proceed to introduce multi-layer perceptron neural

networks as an alternative approach for nonlinear modelling.

1

2

1.1 Nonlinear Modelling

Nonlinear systems are often applied in filtering and classification. For example,

it is well known that in detection and estimation problems, nonlinear filters arise in the

case when the signal and noise joint densities are not Gaussian, and when the noise is not

independent of the signal. A possible way to describe the input-output relationship in a

nonlinear filter is to use a discrete Volterra series representation. The Volterra series can

be viewed as a Taylor series with memory [1]

wherex(n) andy(n) denote the input and output respectively and

(1.1)y(n) h0

N

k 1

Hk[x(n)]

The Volterra filter is general enough to model many of the classical nonlinear filters,

(1.2)Hk[x(n)]N 1

i1 0

N 1

i2 0

N 1

ik 0

hk(i1,i2, ,ik)x(n i1)x(n i2) x(n ik)

including order statistic [2]-[4] and morphological filters [5]. By adding higher-order

terms to the Volterra series, its modelling accuracy can be improved. Filters based on the

Volterra series and another representation called Wiener series [1] will be referred as

polynomial filters. Unfortunately, the design of polynomial filters often requires a

knowledge of the higher order statistics of the input signal. Filter of higher than second

degree are usually impractical because the number of coefficients is too large.

Polynomials also arise in classification problems. They are members of the class

of parametric classifiers, which means that the feature vectors to be classified are assumed

3

to have probability densities with only a few parameters. Therth order polynomial

discriminant function can be expressed as

where x is an N-dimensional vector,x = [x1,x2, ,xN]T, and T denotes transpose. In

(1.3)di(x) ωi1g1(x) ωi2g2(x) ωikgk(x) ωi,k 1

general,ωij is referred to as weight andgj(x) is of polynomial form as

In the case ofr=2, di(x) is called a quadratic discriminant function

(1.4)gj(x) x

n1

k1x

n2

k2x

nr

kr,

n1,n2, ,nr 0 or 1

k1,k2, ,kr 1, ,N

and k = ½N(N + 3). The quadratic discriminant function is useful because it is the

(1.5)di(x)N

j 1

ωjj x2j

N 1

j 1

N

m j 1

ωjmxj xm

N

j 1

ωj xj ωk 1

nonlinear discriminant which is most easily designed. It arises from the Bayes classifier,

when the feature vectors have a Gaussian joint probability density [6]-[7]. For normal

distributed patternsx, the optimal Bayes decision function of theith class can be found

as

p(i), mi andCovi denote the probability of occurrence, mean vector and covariance matrix

(1.6)di(x) ln p(i) 12

ln Covi

12

(x m i)TCov 1

i (x m i)

for theith class, respectively. The performance of the Bayes classifier is often sub-optimal

because the distribution of the patterns is often non-Gaussian or even discrete-valued, the

4

class statistics are estimated from a finite number of training or example vectors, and the

covariance matrix inversions can be ill-conditioned. Specht [8] has presented a

probablistic neural network (PNN), which has been approximated via Taylor series to

yield the Padaline [9]. ThePNN is based on the Bayes strategy to finding the network

output in a polynomial form. Gabor [10] designed a machine consisting of a polynomial

classifier, together with a training algorithm. The training algorithm optimizes the output

by successive adjustment of the coefficients until the output errors are small. The

problems with polynomial discriminant functions are the same as the problems with

conventional polynomial filters. Discriminants of degree greater than two are rarely used

because of the large storage requirements and the large number of operations necessary

to generate outputs. For trainable discriminants, such as those of Gabor [10], the training

process is also time consuming.

The other method for classification is to apply a nonparmetric model (no

assumption is made about the underlying data distribution) such as the nearest neighbor

classifier [11]-[12]. Given a set of reference vectors R1,R2, ,RM associated with

classes Ω1,Ω2, ,ΩM, the rule of the nearest neighbor classifier assigns a patternx to

the class of its nearest neighbor. ThusRi is the nearest neighbor tox if

where Dis( ) is any distance measure defined over the pattern space. The nearest

(1.7)Dis(Ri,x) Mink 1,2, ,M

Dis(Rk,x)

neighbor classifier approximates the minimum error Bayes classifier as the number of

5

reference vectors becomes large. However, at the same time, the computational

complexity of the nearest neighbor classifier increases. Also, for a small number of the

reference vectors, the nearest neighbor classifier is not optimal with respect to the training

data.

1.2 MLP Neural Network Model and Its Problems

In 1943, McCulloch and Pitts [13] proposed a mathematical model of the neuron.

These abstract nerve cells provided the basis for a formal calculus of brain activity. In

1958, Rosenblatt [14] presented the Rosenblattperceptron, which was the most successful

neural network system of that time. It was an elementary visual system which could be

taught to recognize a limited class of patterns. Thisperceptronmodel is the foundation

for many other forms of the artificial neural networks [15].

Definition : A perceptronis a device which computes a weighted sum of its inputs, and

puts this sum through a special function, called the activation, to produce

the output. The activation function can be linear or nonlinear.

A network of linearperceptrons has serious computational limitations. For example, a

linear perceptronis incapable of yielding a discriminant that will solve the exclusive-or

and parity-check problems. That is, the linearperceptroncannot automatically learn the

discriminant that will classify the even and odd patterns. This limitation of the linear

perceptronnetwork is overcome by adding layers of nonlinear perceptrons. The resulting

network is often called the multi-layerperceptronneural networks.

6

The MLP networks are feedforward networks with one or more layers of units

between the input and output nodes. The weights in the network are termed feedforward

because the weights flow in the forward direction, starting with the inputs, and no weights

feed back to the previous or current layers. A typicalMLP network is shown in

Figure 1.1. The input layer contains the dummy units which just distribute the inputs to

the network. The output layer are equivalent to the network discriminant functions.

Between them are the hidden layers. In the following, we begin to introduce theMLP

network’s structure.

Consider theith hidden unit in Figure 1.1, the inputs to this unit are the weighted

Figure 1.1 The Multi-LayerPerceptronNetwork.

7

outputs from all previous layers. The net input,Xnet(i), for the ith unit is formulated as

follows

φ(j) is the activation output of thejth unit andθ(i) is a variable bias with similar function

(1.8)Xnet(i)j

w(i,j)φ(j) θ(i)

to a threshold.w(i,j) is the weight connecting thejth unit to the ith unit and the

summation is over all units feeding into theith unit. Figure 1.2 shows a model that

implements the idea.

As shown in Figure 1.2,G is an activation function which performs a transformation of

Figure 1.2 Artificial Neuron Model.

the net input and decides the output level of theith unit, φ(i). G can be a linear function

8

(Figure 1.3(a)) such that

or a squaring function (Figure 1.3(b))

(1.9)φ(i) K Xnet(i) , K is a constant

or a threshold function (Figure 1.3(c))

(1.10)φ(i) X 2net(i)

However, a function that more accurately simulates the nonlinear transfer characteristics

(1.11)φ(i)

1, if Xnet(i) ≥ T0, otherwise

of the biological neuron and permits more general network functions is shown in

Figure 1.3(d). This is the most commonly used activation function and is called the

sigmoid function. This function is expressed mathematically as

This sigmoid activation has the feature of being nondecreasing and differentiable and its

(1.12)φ(i) 1

1 e Xnet(i)

range is 0≤ φ(i) ≤ 1.

There are several potential advantages of theMLP neural networks. Unlike the

polynomial filter or Gaussian classifier, no assumption is made about the underlying data

distribution for designing theMLP networks, so the data statistics do not need to be

estimated [16]. Second, the parallel structure of theMLP network make it realizable in

parallel computers. Third, theMLP network exhibits a great degree of robustness or fault

9

tolerance because of built-in redundancy. Damage to a few nodes or links thus need not

impair overall performance significantly. Fourth, theMLP network can form any

unbounded decision region in the space spanned by the inputs. Such regions include

convex polygons and unbounded convex regions [17]. Finally, theMLP networks have

a strong capability for function approximation.

Additional major characteristics of theMLP network are its abilities to learn and

Figure 1.3 Four Representative Activation Functions.

generalize. Learning can be viewed as producing a surface in multidimensional space that

fits the set of training data in some best sense. Generalization is learning which is limited

10

such that when new patterns are input into the network, they are processed (filtered or

classified) almost as well as were the training input patterns. If anMLP network has

enough free parameters, it can learn without generalization. In other words it can

implement a multi-dimensional Lagrange interpolation [18] of the training data, but can

perform very poorly when applied to additional patterns. In Appendix A, we demonstrate

a training method for theMLP networks which is based upon the iterative steepest

descent algorithm. This is so called the back-propagation (BP) learning algorithm [19].

Although theMLP network, with BP learning, has been used for many pattern

classification and filtering problems, it has many drawbacks. These include the following.

First, it is often unclear which network topology is required for the solution of a given

problem. For a given task, the number of hidden layers and hidden units are varied until

satisfactory results are obtained. Two methods have been proposed as remedies. One

consists of starting with a small network and expanding it [20]-[21]; the other consists of

starting with a large network and pruning it to a smaller size [22]-[23]. In both cases, the

training time is increased significantly over regularBP learning. Therefore, it is desirable

to find new techniques for determining the network topology.

Second, large scaleMLP networks have extremely low training rates whenBP is

used, since the networks are highly nonlinear in the weights and thresholds. The network

might become trapped in a local minima of the error function being minimized.

Convergence to local minima can be caused by an improper setting of the initial weights

and thresholds, but their is presently no reliable method for initializing the network.

11

Third, there is no simple effective theory developed which explains the behavior

and mapping capabilities of theMLP network. Although several researchers [24]-[27]

have demonstrated that sufficiently complex multilayer networks are capable of arbitrarily

accurate approximations to an arbitrary mapping, there is no rule to determine the optimal

number of hidden unitsNh required for the solution of a given problem. In addition, the

methods to find a set of weights and thresholds to approximate the given function are still

unclear.

1.3 The Scope of the Dissertation

In this dissertation, we introduce polynomial basis functions (PBFs) and use them

to analyze and design theMLP neural networks which have analytic activation functions.

The basic tasks are (1) to develop aPBF model for theMLP network, (2) to find methods

for mapping polynomial filters and discriminants to theMLP network (forward mapping),

and (3) to find methods for calculating thePBF model from an existingMLP network

(inverse modelling). Figure 1.4 shows a block diagram for these tasks. In summary, it will

be possible to freely convert many continuous polynomial functions from an

N-dimensional power series representation to anN-input MLP network, and vice versa.

This dissertation is divided into 6 chapters. In Chapter 2, three basis function

representations of theMLP networks are presented. First, the orthogonal and radial basis

function (RBF) networks are reviewed. Then, polynomial basis functions are introduced.

The approximation theorems for theMLP networks are proved based on thePBF model.

12

This is the main topic in Chapter 3. In Chapter 4, the mapping theorems are developed

and used to find practical methods for mapping polynomial functions to theMLP network.

Two kind of networks,completeand compact networks, are demonstrated for forward

mappings. In Chapter 5, techniques are described for modelling an existingMLP network

by finite degree polynomial functions. Finally, the conclusions are given in Chapter 6.

TheBPalgorithm, construction of monomial and product subnets and several feature date

sets are presented in Appendix A, B and C, respectively.

Figure 1.4 Block Diagram of the Proposed Tasks.

CHAPTER 2

NETWORK BASIS FUNCTIONS

In recent years, several researchers have developed methods to analyze the

behavior of theMLP neural networks, in order to compare them to conventional

classifiers and filters. Nerrand [28] has trained recurrent neural networks to perform

nonlinear adaptive filtering and modelling. Klimasauskas [29] and Anderson [30]

discussed the use of theMLP networks for noise filtering and compared it with linear

Wiener filtering. Their technique is experimental, and has not led to an increased

theoretical understanding of theMLP network. Gallinari [31] has compared a linearMLP

network to a conventional discriminant analysis method which utilizes projections of the

input vectors onto optimal subspaces. Asoh [32] performed a regression analysis on the

MLP networks having nonlinear hidden units. The drawback to his analysis is that it

consists principally of empirical observations. Toshio [33] presents a multiple logistic

model to find the weights of theMLP network, which is based upon a maximum

likelihood method. Unfortunately, statistical network design methods, such as his, require

prior knowledge or information about training data characteristics.

Several basis vector approaches have been used to study theMLP networks. Fujita

[34] proposed to use output state vectors of hidden units as internal representations of the

13

14

MLP networks. His technique, called the Orthogonal Complement Method, allows one to

estimate the necessary number of hidden units from the dimension of the subspace

spanned by the input state vectors. His approach is useful for designing binary output

systems. Unfortunately, the required number of computations increases exponentially with

the network size. Snadberg [35] proved an universal approximation theorem for radial

basis function (RBF) networks. However,RBFnetworks have only one hidden layer, and

have a very specific type of activation function. Therefore, analyses ofRBFnetworks are

of limited applicability to the more generalMLP networks. In this chapter, our goal is to

introduce a polynomial basis vector representations for theMLP networks. First, we

review the orthogonal basis network and theRBF network. Then we introduce the more

general concept of polynomial basis functions (PBFs).

2.1 Orthogonal Basis Functions

Qian and Lee [36] have designed theMLP networks using a set of orthogonal

basis functions. That is, the network output is expressed as

and Ω is the weight vector. Iff(x) is the desired output function, then the best

(2.1)f(x)i

Ωi(x)Ψi(x)

approximation is obtained by the least mean square minimization of the error function

15

Nv is the number of the entire training vectors. The summation in Eq. (2.2) can be

(2.2)E 12

Nv

j 1

f(x(j)) f(x(j))2

approximated by an integration when the training set contains a large number of vectors.

Here, p(x) is the probability distribution ofx. It can be proved [37]-[38] thatΩ can be

(2.3)E ≈ 12 ⌡

⌠ f(x) f(x)2p(x)dx 1

2< f(x) f(x)

2>

found asΩ = R-1<f(x)f(x)> whereR is the correlation matrix ofΨi(x)’s. Inversion of the

R matrix is practical ifR is an orthogonal matrix or identity matrix. Then the weight

vector is Ω = <f(x)Ψ(x)>. The problem left is how to find a set of orthogonal basis

functions. For the one-dimensional case, Qian [36] performed a variable change, dµ =

p(x)dx, such that

There are several sets of orthogonal basis functionsΨ(µ) available for finite support such

(2.4)⌡⌠Ψi(x)Ψj(x) p(x)dx δij ⇒ ⌡

⌠Ψi(µ)Ψj(µ)dµ δij

as Ψ(µ) = 1,cos(πµ),cos(2πµ),cos(3πµ), which is defined on [0,1]. However, their

results are difficult to extend to higher-dimensional problems.

2.2 Radial Basis Functions

A RBF network [35],[39]-[42] can be regarded as a single hidden layerMLP

network, in which the output is a linear function of the hidden unit outputs. In theRBF

16

network, the network output functionf(x) for the input vectorx is represented by

whereα0 is an additive bias andαi (i ≥ 1) represents a weight from theith hidden unit

(2.5)f(x) α0

Nh

i 1

αi Ωi( x x i )

to the output. By clustering training patternsx into Nh clusters,xi will be taken as the

mean vector of each cluster and is known as theRBF center. TypicallyΩi( ) is chosen

as a Gaussian function,

In Eq. (2.6), denotes a norm which is usually taken to be Euclidean.σij’s are the

(2.6)Ωi( x x i ) e

j

(xj xj)2

2σ2ij

elements of a covariance matrix, which is taken to be diagonal. The representation of the

MLP network via radial basis functions is shown in Figure 2.1.

There are three steps in the design of theRBF network. First, we pick a

representative set of training vectors. Second, one hidden unit is chosen for each training

vector. Finally, we find the best set of output weights to approximate the desired output.

Because of the linear dependence of the network output on the weights in theRBF

expansion of Eq. (2.5), a global minimum exists in the error function for theRBF

network. Therefore, the adjustable output weights,αi, can be determined using the linear

least squares method. This is an important advantage of this approach.

The RBF network represented by Eq. (2.5) has many useful qualities, including

fast learning and ease of design. However, for certain classes of problems, theRBF

17

approach may not be a good strategy. First, the number ofRBF centers (hidden units) is

much greater than the number of hidden units used in anMLP network designed from the

same training data. There is usually considerable redundancy in the network’s hidden

units. Second,RBFnetworks work only when the centers are well chosen. In practice the

centers are often arbitrarily selected from sampling data [43]. Such a mechanism is clearly

unsatisfactory. Third, the large number of centers required results in long training time

for the output weights. Here we suggest another set of basis functions for modelling the

MLP networks.

Figure 2.1 MLP Network Representation via Radial Basis Functions.

18

2.3 Polynomial Basis Functions

The goal here is to present aPBF model for theMLP neural networks. ThePBF

model solves the problems of theRBF network. The advantages are

(1). ThePBF model, applied to single and multiple hidden layer networks, is general

enough to describe both theMLP andRBF networks.

(2). ThePBF model leads to the approximation theorems for theMLP networks,

(3). ThePBF model leads to straightforward mappings between theMLP networks

and conventional filtering and classification algorithms.

(4). ThePBF model results in finding the polynomial approximation of an existing

MLP network.

As with theRBFapproach, we have one polynomial basis function for each hidden

unit. Assume that the hidden unit activations are analytic functions, such as sigmoid

functions. Then the activation of theith unit in the network can be modeled as a power

series with integer degreep(i) [44],

for

(2.7)φ(i)p(i)

j 0

A(i,j)(Xnet(i) X0(i))j E(i,ξ)(Xnet(i) X0(i))

p(i) 1

The net input,Xnet(i), of the ith unit is

(2.8)Xnet(i) X0(i) ≤ M(i)

19

where indexk is for all the hidden units or input units feeding theith hidden unit from

(2.9)Xnet(i)k

φ(k)w(i,k) θ(i)

the previous layers. For input units,p(i) = 1 and φ(i) = xi where xi denotes theith

component of anN-dimensional input vectorx = [x1,x2, ,xN]T. φ(k), k > N, is the

activation output for the (k-N)th hidden unit,w(i,k) is the synaptic weight between them

and θ(i) is an additive bias.A(i,j) is the Taylor series coefficient of thejth power term

in the ith unit,

where G(j)( ) is the jth derivative of the analytic activation function.E(i,ξ) is the

(2.10)A(i,j)G (j)(X0(i))

j!

remainder term (ξ is somewhere betweenXnet(i) and the expansion pointX0(i)). Good

choices forM(i) (radius of convergence),p(i), andX0(i) allow accurate approximations

for a wide variety of activation functions.

The output of theMLP network can be characterized as the weighted sum of the

polynomial basis vectors (Figure 2.2)

where Nu = 1 + N + Nh and Nh denotes the number of hidden units in the network.

(2.11)f(x)Nu

i 1

wo(i)φ(i)

Equations (2.7), (2.9) and (2.11), with the degreesp(i), form a condensedmodel of the

MLP network, in which the output is a weighted sum of compositions of polynomials.

20

Substituting Eq. (2.9) into Eq. (2.7), and multiplying out the compositions, the

activation of theith unit can be written as

C(i) is a coefficient vector and

(2.12)φ(i) ≈ C(i)X

The vectorX has elementsXkm, which denote themth one-term polynomial of degreek

(2.13)X [X01,X11, ,X12,X22, ,X31, ]T

in the variablesxj for j = 1 to N. For example,X01 = 1, X1m = xm, andX2m denotes the

termsx12, x2

2, x1x2, x1x3, etc. Given the dimensionN of the input vector, the number of

degree-k terms is (k+N-1)!/(k!(N-1)!) and the total number of terms inX is

whereP is the highest degree inX.

(2.14)LP

k 0

(k N 1)!k!(N 1)!

Definition : A polynomial basis function is anN-variable polynomial, which

approximates a hidden unit’s activation output. TheN variables are the

network inputs.

Substituting Eq. (2.12) into Eq. (2.11),

Wo is anNu by 1 output weight vector. TheC matrix is anNu by L coefficient matrix

(2.15)f(x) WTo CX

21

The ith row of the C matrix is the vectorC(i) in Eq. (2.12). SinceL, the column

(2.16)C

C(1,1) C(1,2) C(1,3) C(1,L)

C(2,1) C(2,2) C(2,3) C(2,L)

C(Nu,1) C(Nu,2) C(Nu,3) C(Nu,L)

dimension of theC matrix, can be very large, making the calculation and storage of the

C matrix prohibitive, therefore, we call Eq. (2.15) anexhaustive polynomial basis

function model off(x).

Figure 2.2 MLP Network Representation via Polynomial Basis Functions.

CHAPTER 3

MLP APPROXIMATION THEOREMS

The approximation capabilities of theMLP networks have been proven by many

investigators [24]-[27]. They have demonstrated that sufficiently complex theMLP

networks are capable of accurate approximations to arbitrary continuous mappings over

a bounded compact set. A mathematical result of Kolmogorov [45] has been interpreted

as saying that for any continuous mapping, there exists a three-layerMLP network which

realizes it. These results indicate that theMLP network provides a very powerful tool for

realizing nonlinear mappings for filtering, control, and pattern classification.

Unfortunately, these investigators have not

(1). Given procedures for determining the number of hidden units,Nh, required for the

solution of a given problem,

(2). Given a technique for finding the network weights, or

(3). Given simple proofs of the approximation capabilities.

A solution to problem (1) is critical. If the number of units in a hidden layer is

too large (over-determined case), the network can memorize the training data and perform

poorly at generalization tasks. If the number of units is too small (under-determined case),

recall accuracy will suffer and the network may fail to extract the desired relationship

22

23

from the training data. In this chapter, we give an approximation theorem and propose a

constructive proof for it, which uses the concept of the polynomial basis functions. This

proof solves the problems for the network structure in the design of theMLP networks.

First, the theorem is stated as follows.

Theorem 3.1: Any continuous function defined over a bounded compact set can be

approximated using anMLP neural network with hidden units having the

activation function

where all termsA(i,j) are nonzero forj between 2 andp(i).

(3.1)G(Xnet(i))p(i)

j 0

A(i,j)(Xnet(i) X0(i))j E(i,ξ)(Xnet(i) X0(i))

p(i) 1

Proof : The proof consists of three steps. In section 3.1, we review the Weierstrass

approximation theorem and multi-dimensional orthonormal polynomials. The Weierstrass

theorem shows the existence of multivariate approximating polynomials, and the

orthonormal polynomials provide a concrete method for generating an approximating

polynomial. In section 3.2, we show that each term in the approximating polynomial,

which is a multi-input product, can be realized by a subnet in anMLP network.

3.1 Polynomial Approximating of Functions

As the first step of the proof, we briefly review polynomial approximations for

single variable functions, and then review the multivariate case.

24

3.1.1 Approximating Functions of One Variable

According to the Weierstrass approximation theorem [46]" Any bounded function

F(x) can be uniformly approximated over a closed interval [a,b] by a polynomial f(x),

where is a positive real number" , the approximating polynomialf(x) can be found as

(3.2)F(x) f(x) ≤ a ≤ x ≤ b

a weighted sum ofPth degree orthogonal polynomial basis functions [47]

where

(3.3)f(x) ≈P

n 0

cnϕn(x)

for i = 0,1,2, . The approximation mean square error can be written as

(3.4)ϕn(x)n

i 0

ai xi

whereNs is total number of sampling points ofF(x) and the superscriptj denotes different

(3.5)ErrNs

j 1

u(x (j)) F(x (j)) f(x (j))2

sampling points.u(x) is the weighting function associated with different kinds of

orthogonal polynomials being used such that

As an example, the first five Legendre polynomials [48] are

(3.6)⌡⌠∞

∞u(x)ϕi(x)ϕj(x)dx δij

1, i j0, i ≠j

25

and the weighting function is

(3.7)

ϕ0(x) 1, ϕ1(x) x, ϕ2(x) 32

x 2 12

ϕ3(x) 52

x 3 32

x, ϕ4(x) 358

x 4 154

x 2 38

The mean square error can be minimized in the sense of lease square error

(3.8)u(x)

1, 1≤x≤1

0, otherwise

approach [49] and the coefficientscn are calculated as

As a consequence of the Weierstrass theorem, the set of polynomials ϕi(x) is complete

(3.9)cn

Ns

j 1

u(x (j))ϕn(x(j))F(x (j))

Ns

j 1

u(x (j))ϕ2n(x

(j))

in the sense that for any continuous functionF(x), Err tends to zero whenP → ∞.

As a second example, we use Laguerre polynomial [49] (the weighting function

of u(x) is e-x for 0 ≤ x < ∞) to approximate the functionF(x) = e-2x, for x ε [0,∞). The

resulting approximating function is

whereϕi(x)’s are the Laguerre polynomials

(3.10)F(x) ≈ f(x) 13

ϕ0(x) 29

ϕ1(x) 227

ϕ2(x) 4243

ϕ3(x)

26

(3.11)ϕ0(x) 1, ϕ1(x) 1 x,

ϕ2(x) x 2 4x 2, ϕ3(x) x 3 9x 2 18x 6

3.1.2 Approximating Functions of Many Variables

Extension of the single variable function to the multiple variables case is

straightforward. The Stone-Weierstrass theorem [46] says that continuous multivariate

functions can be approximated by weighted combinations of continuous univariate

functions. Suppose that we have a complete system of orthonormal functions of one

variable, ϕ0(x),ϕ1(x),ϕ2(x), , over a bounded intervala ≤ x ≤ b. Then a complete

system of orthonormal functions ofN variables, x1,x2, ,xN, may be constructed by

taking N-tuples (products) of functions from the one-variable set and substituting the

variables,x1,x2, ,xN, in the arguments. For instance, suppose that we want to construct

five Legendre orthogonal functions of three variables (N = 3). From the above discussion

we have

If the original functions are orthonormal in the intervala ≤ x ≤ b, the resultingN-variable

(3.12)

Ψ0(x1,x2,x3) ϕ0(x1)ϕ0(x2)ϕ0(x3) 1

Ψ1(x1,x2,x3) ϕ0(x1)ϕ0(x2)ϕ1(x3) x3

Ψ2(x1,x2,x3) ϕ0(x1)ϕ1(x2)ϕ0(x3) x2

Ψ3(x1,x2,x3) ϕ1(x1)ϕ0(x2)ϕ0(x3) x1

Ψ4(x1,x2,x3) ϕ0(x1)ϕ1(x2)ϕ1(x3) x2x3

27

functions, Ψ0(x),Ψ1(x), ,ΨN(x), are orthonormal over the hypercubea ≤ xi ≤ b, i =

1,2, ,N [47]. i.e.

where

(3.13)⌡⌠ b

x1 a⌡⌠ b

x2 a ⌡⌠ b

xN au(x)Ψ i(x)Ψ j(x)dx δij

1, i j0, i ≠j

After we set up the basis functions Ψ0(x),Ψ1(x), ,ΨN(x), the corresponding

(3.14)u(x) u(x1,x2, ,xN) Ψ i(x) Ψi(x1,x2, ,xN)

coefficientci for Ψi(x) can be found by using Eq. (3.9).

3.2 Realization of Multi-Input Products

In the previous section, we give a review ofN-dimensional approximating

polynomials and demonstrate how they can be constructed from sets of one-dimensional

orthonormal polynomials. Assume that such an approximating polynomial with dimension

N and degree-P has been found, and is represented as

where Af is a coefficient vector, with dimensionL (from Eq. (2.14)). Comparing Eq.

(3.15)f(x) ATf X

(2.15) with Eq. (3.15), the output weight vector can be found by inverting theC matrix.

That is

28

and

(3.16)WToCX AT

f X

if the C matrix has rankL and is made to be square. To prove that aC matrix of rankL

(3.17)WTo AT

f C1

exists, we show that a multi-input product (and therefore elements ofX) can be

constructed with anMLP network having one hidden layer. After showing Rank[C] = L,

making theC matrix square is then simply a matter of discarding linearly dependent rows.

In Appendix B, methods are given for designing theMLP networks which

approximate monomial functions and products of two inputs. In the following, we show

that a product ofk terms can be generated using an element which realizesxk and from

elements which realize the product of (k-1) terms. It can then be shown by induction that

products ofk variables can be generated in one hidden layer.

Let g(x1,x2, ,xN) = (x1+x2+ +xN)N, which is realizable using a monomial

subnet (Figure 3.1).

Lemma 3.1 : Let h(x1,x2, ,xN) denote the functiong(x1,x2, ,xN) with one or more

variables,xi, replaced by one. Then all terms ing are present inh,

except terms with the variablexi which are reduced in degree by at least

1.

For example,

Since thex22 term does not have anx1, which has been set to 1, it is present ing(x1,x2)

29

andg(1,x2). The next theorem shows that products ofN inputs can be constructed using

(3.18)g(x1,x2) x 2

1 x 22 2x1x2,

h(x1,x2) g(1,x2) 1 x 22 2x2

the monomial of degreeN and products of (N-1) terms.

Figure 3.1 Monomial Subnet with Multiple Inputs.

Theorem 3.2: Let hi(x1,x2, ,xN), i = 1 to N-1, represent the sum of all functions

g(x1,x2, ,xN) having i variables set to 1. Then the function

whereK is a non-zero constant.

(3.19)g(x1,x2, ,xN)

N 1

i 1

( 1)i hi(x1,x2, ,xN)

K x1x2 xN terms of degree N1 or less

Proof : Here we use the Lemma 3.1 repeatedly. First, every term in g having degreeN

andN-1 variables (one variable is squared and one other is absent), can be removed as

30

However, this operation subtracts all terms having degreeN in N-2 variablesN-2 times

(3.20)g(x1,x2, ,xN) h1(x1,x2, ,xN)

instead of the required 1 time. This is corrected by adding backh2(x1,x2, ,xN).

However, this adds back terms of degreeN with N-3 variables, too many times.

Continuing this process, we get the result through induction.

For example, Eq. (3.19) is written the 3-input case as

Thus, the 3-input product,x1x2x3, can be realized using the third degree monomial and 2-

(3.21)

(x1 x2 x3)3 (1 x1 x2)

3 (1 x2 x3)3 (1 x1 x3)

3

(1 1 x1)3 (1 1 x2)

3 (1 1 x3)3

6x1x2x3 6x1x2 6x1x3 6x2x3 6x1 6x2 6x3 21

input products subnets.

Corollary : The operation in Theorem 3.2 requires a number of hidden units equal to

Theorem 3.3: The product ofN bounded inputs can be realized in anMLP network

(3.22)N1(N) 1N 1

k 1

Nk

having one hidden layer of units having the activation of Eq. (3.1). The

required number of hidden units is

whereN2(k,N) = (k+N-1)!/(k!(N-1)!).

(3.23)Nu(N) N1(N)N 1

k 2

N2(k,N) Nu(k)

31

Proof : There areN2(k,N) terms of degreek which can be constructed using a set ofN

inputs.Nu(N) then equalsN1(N) plus Nu(k) units for each possible term of degreek for k

= 2 to N-1.

3.3 Completion of the Proof

From Theorem 3.3, each 2nd or higher degree term inX can be closely

approximated by a product subnet composed of several monomial subnets. The function

f(x) in Eq. (3.15) is then approximated by theMLP network, by taking the weighted sum

of all the output of the subnets as shown in Figure 3.2. Therefore the proof is complete.

Figure 3.2 Constructionf(x) by Subnet Approaches.

CHAPTER 4

FORWARD MAPPINGS

In the previous chapter, we have proved the approximation theorem (Theorem 3.1),

which states that it is possible to mapN-input degree-P polynomials to theMLP network.

In this chapter we discuss practical methods for performing such mappings. There are

several advantages associated with such forward mappings. First, the mappings provide

good initial weights for theMLP network. TheBP learning algorithm can then be used

to improve upon this initial solution. Second, the mapping approach leads to specific

network topologies. A block diagram of the forward mapping methodology is shown in

Figure 4.1.

Assume that we want to map a function ofN variables to anMLP network.

Following Figure 4.1, the first step is to obtain anN-variable polynomial expansion of the

function, using orthonormal polynomials. Using the subnet approach of Appendix B or

the mapping theorems which are developed later, one or more terms of the polynomial

expansion is realized as a subnet. Redundant units, and the corresponding linearly

dependantPBFs, are removed. The final network may then be improved throughBP

learning. In this chapter, two approaches for performing this forward mapping are

discussed.

4.1 Complete Networks

32

33

In this section, our goal is to describe a simple approach for mapping a given

Figure 4.1 Block Diagram of Forward Mapping.

polynomial function to acomplete network. Thecomplete networkis defined as follows.

Definition : A complete networkof degreeP andN-input is anMLP network which has

L hidden units, with no redundant units (the rank of the output coefficient

matrix C is L).

In principal,complete networkswith one hidden layer can be designed by following the

procedure in the proof of Theorem 3.1. This involves the construction of one product

subnet for each term of second or higher degree inX. The redundant, linear dependent,

hidden units are removed so thatNh = L-N-1.

Theorem 4.1: A complete networkwith UB = (L-N-1) hidden units is capable of

approximating anyN-input, degree-P polynomialf(x) = AfTX, whereAf is

a coefficient vector with dimensionL as in Eq. (2.14). HereUB denotes

34

upper bound.

Proof : The constructive proof in Chapter 3 shows that each term ofX can be closely

represented by one subnet. However, theC matrix formed by the expansion of all the

subnets has exactly rankL. This implies that some rows of theC matrix are linearly

dependent with others and can be discarded until there areL independent vectors left. In

this case, each term is completely described by onePBF (or one hidden unit) and the

function f(x) is a unique linear combination of a set of linearly independentPBFs. Of

theseL rows in theC matrix, N of them model the input units, which contribute to the

final output through direct connections. One row models the effects of thresholds. This

leaves only (L-N-1) units to represent the approximating functionf(x).

Assume that we are to implement a quadratic function ofN variables or features,

as is used in the Bayes Gaussian classifier [6]-[7]. Each second order product of theN

features must be realized. These includeN squared terms andN (N-1)/2 cross products.

From Appendix B, squares can be realized by 1-1-1 subnets and the cross products can

be realized as 2-3-1 subnets. This requiresNh = N + 3 N (N-1)/2 units. Since each

product subnet generates redundant monomial subnets, the redundant monomial subnets

can be removed. This results inNh = N (N+1)/2. Note that the number of hidden units

equals the number of second degree terms in the quadratic polynomial ofN variables. An

example of mapping the quadratic polynomial of 4 variables, before and after removing

redundant units, is shown in Figure 4.2 and Figure 4.3.

It is possible to extend this process of removing redundant units to higher degree

35

polynomials. Another example shown here is to realize the function

Figure 4.2 Subnet Approach for Mapping a 4-Input Second Degree Polynomial.

Figure 4.3 The 4-Input Second DegreeComplete Network.

(4.1)f(x) ax31 bx2

1 x2 cx1x2

2 dx31 ex2

1 gx1x2 hx22

36

as a single hidden layerMLP network, using the developed theorem. Table 4.1 lists the

required hidden units for each term, from Theorem 3.3. Totally, there are 41 hidden units

required to mapf(x) by this subnet approach.

Since the product subnet forx1x2 generates redundant monomial subnets (x12 andx2

2), the

Table 4.1Subnet Approach for Realizing a Function withN = 2 andP = 3.

Terms Required Units Terms Required Units

x13 2 x1x2 3

x12x2 16 x2

2 1

x1x22 16 x1 0

x23 2 x2 0

x12 1 constant 0

monomial subnets forx12 andx2

2 can be removed. In addition, product terms likex12x2 and

x1x22 can be realized from the existing monomial subnets (from product subnet) and direct

connections from inputs. Finally, 7 hidden units are necessary for mapping the function

f(x). The number of hidden units equals the number of second degree terms (3 terms) and

third degree terms (4 terms) in Eq. (4.1).

Lemma 4.1 : A complete networkof degreeP andN-input is capable of realizing any

number of additional polynomials (with the sameN and P), without an

increase in the number of hidden units.

Proof : From the proof of Theorem 4.1,complete networksalready have one hidden unit

for each term inX (second or higher degree) of the function. Therefore, new functions

37

are easily realized by connecting the existing (L-N-1) hidden units,N inputs, and a bias

term directly to the new output node.

In summary, the design method forcomplete networks, having one hidden layer,

are summarized in the following steps :

Step1. Given the polynomial function to be mapped, calculateUB, the required

number of hidden units.

Step2. Initialize input weights for each hidden unit (Appendix B).

Step3. Eliminate redundant units.

Step4. Calculate output weights using Eq. (3.17).

The final complete networkis far more efficient than the original network

composed of subnets, because it requires one hidden unit for each of theL terms inX.

However, there are two drawbacks to thecomplete network. First, a network must be

designed with far more than (L-N-1) hidden units, and then pruned of its linearly

dependent units. Second, some units are generating very high degree products, which

result in large weight values. We propose a multi-layercomplete networkwhich requires

the same number of hidden units.

4.1.1 Multi-Layer Complete Networks

In the design of the multi-layercomplete network, we assume that each subnet is

used to implement a squaringx2 (1 hidden unit required) or productxixj (3 hidden units

38

required) operation. The layers are numbered starting withn = 1 for the input layer. For

the second and higher layers (n ≥ 2), the hidden units generate terms of degreek, where

k falls between (1 + 2n-2) and 2n-1. Define as a ceiling function, 1.1 = 2, or 2.9

= 3. We state the following lemmas to realize a polynomial function using multi-layer

complete network.

Lemma 4.2 : Given a functionf(x) with maximum degreeP, it can be realized in the

multi-layer complete networkwith log2P hidden layers.

As an example, the multi-layercomplete networkwhich realizes the function

having two hidden layers is shown in Figure 4.4. TheUB number of hidden units for an

(4.2)f(x) A0 (x1 x2) (x1 x2)2 (x1 x2)

3 (x1 x2)4

N-dimensional degree-P function is the same for both single-layer and multi-layer

complete networks. However, if f(x) has some missing terms, and therefore a sparseAf

vector, multi-hidden-layer topologies are sometimes more efficient. As a more extreme

example, assume we want to design a product term forN inputs, the multi-layercomplete

network is more efficient than the single-layercomplete network. Several lemmas are

developed next to demonstrate the required hidden layers and units for product terms.

Lemma 4.3 : An N-variable product (all variables have first degree only) can be realized

by log2N hidden layers.

Lemma 4.4 : The number of hidden units in each hidden layer in Lemma 4.3 is

The example for a 7-input product term,x1 x2 x7, is constructed as shown in

39

Figure 4.5. As a result of Lemma 4.3 and 4.4, comparisons for realizing a single product

Figure 4.4 The Multi-LayerComplete Networkfor Realizing a Function withN = 2 andP = 4.

(4.3)Nh(j) 3 N

2j0.5 j 1,2, , log2N

term in using single and multi-layer network is listed in Table 4.2. Table 4.2 reveals that

the multi-layer network does have an advantage in this case.

Lemma 4.5 : An N-variable product withm variables, having degrees (n1,n2, ,nm), can

be realized by 1 + log2K hidden layers, where

Lemma 4.6 : The number of hidden units in each hidden layer in Lemma 4.5 is

whereK is same as Eq. (4.4).

40

(4.4)K

m N m2

, when N m is even

m N m 12

, when N m is odd

Figure 4.5 The Multi-LayerComplete Networkfor Realizing Productx1 x2 x7.

(4.5)Nh(j)

m

i 1

(ni 1) 3 N m2

0.5 , when j 1

3 K

2j 10.5 , when j 2,3, , log2K

4.1.2 Experimental Results

Our algorithms for designing the single-layercomplete networkhave been tested

41

for two examples. As a first step, Gaussian discriminants were designed from the shape

Table 4.2Comparisons of Single-Layer and Multi-LayerComplete Networks.

Number of InputsN Single-Layer Multi-Layer

3 16 Units 6 Units/2 Layers




7 3,424 Units 18 Units/3 Layers

8 12,861 Units 21 Units/3 Layers

feature data sets. The resulting polynomial discriminant functions were then mapped into

complete networks. For CHEF andLPTF type shape features, the resulting networks had

16 inputs andUB = 136 hidden units. In Table 4.3, the classification error percentages

for both the Gaussian classifier and thecomplete network(after mapping) are listed. The

performances are very similar as one would expect. After the mapping is completed, the

complete networkswere trained using theBP algorithm. For comparison networks having

the same topology, but initialized with random weights and then trained, were also tested

for same data sets. From Figure 4.6 and Figure 4.7,complete networkswith mapped

weights outperform the same network with random initial weights.

4.1.3 Summary

Several advantages associated with thecomplete networkare summarized here.

42

First, the mappings provide good initial weights for theMLP network. TheBP algorithm

Table 4.3Classification Results of Gaussian Classifier andComplete Network.

Shape Features Gaussian Classifier Complete Network

CHEF 3.00 % 3.125 %

LPTF 1.875 % 1.75 %

Figure 4.6 Training Results of aComplete Network(16-136-4) withCHEF Data.

can then be used to improve upon this set of initial weights. Second, the mapping

approach leads to specific network topologies. Third, an upper bound on the required

number of hidden units can be derived.

It is desirable for the number of hidden units to be kept as low as possible. One

problem in the design of thecomplete networkis that the number of hidden units

43

increases explosively as the number of inputs and the degreeP increases. For example,

Figure 4.7 Training Results of aComplete Network(16-136-4) withLPTF Data.

if the given function has 23-input andP = 7, the total number of hidden units required

for the mapping is 2,035,776. Largecomplete networksof high degree are certainly not

feasible. A similar but more efficient mapping algorithm is presented in the next section.

4.2 Compact Networks

In complete networks, each 2nd or higher degree term of the vectorX for the

approximating polynomial is realizable by one hidden unit (or onePBF). However, the

number of free parameters (weights and thresholds) in the resulting network is far greater

than the number of hidden units. In this section, we describe methods for designing

"compact networks", in which the number of weights and the dimension ofX are more

44

in line. In the following sections, output weights are defined as the weights connecting

from input and hidden units to output units. Hidden weights are weights from input units

to hidden units or weights between hidden layers. First, let’s give a definition of the

compact network.

Definition : A compact network(Rank[C] < L) is anMLP network, where each hidden

unit realizes many of the terms in Eq. (3.15) and results in less hidden units

than thecomplete network.

Before we discuss practical methods for constructingcompact networks, we find a lower

bound number of hidden units required.

4.2.1 A Lower Bound on the Number of Weights

A theorem which specifies a lower bound on the number of hidden weights and

thresholds is stated in the following.

Theorem 4.2: An MLP network capable of realizing a continuous functionf(x) with N-

dimensional, degree-P must have at leastNt free parameters (hidden

weights and thresholds of hidden units) such thatNt ≥ L.

Proof : Assuming anMLP network withN inputs,Nh hidden units andNw hidden weights,

there areNt free parameters needed to be determined andNt = Nw + Nh, whereNh is the

total number of threshold values forNh hidden units. From Eq. (2.15), the output of the

MLP network is expressed as a matrix form,WoTCX, which approximates the function

f(x). Our purpose here is to approximate theC matrix by a first degree function of hidden

45

weights and thresholds. i.e.

C0 and C1 areNu x L matrices with elementsc0( ) and c1( ). c0(i,j) is a scaler andc1(i,j)

(4.6)C C0 C1

is taken to be a first degree function of hidden weights and thresholds. That is

δw is anNt by 1 vector andδw = Wx - Wx0. HereWx (with elementwx( )) is the vector for

(4.7)c1(i,j) h(i,j)δw

Nt

k 1

h(i,j,k)δw(k)

hidden weights and thresholds andWx0 is the corresponding expanding points for Taylor

series.h(i,j,k) is the first degree Taylor series coefficient for the vectorh(i,j). Substituting

C with C into Eq. (2.15), Eq. (2.15) can be rewritten as

WoTC1 is an 1 xL vector where thenth element is expressed as weighted summation of

(4.8)W ToC1 AT

f WToC0

δw( )

ThenWoTC1 is identical toδw multiples a matrixR andR is anNt x L matrix with element

(4.9)

Nu

v 1

wo(v)c1(v,n)Nt

k 1

Nu

v 1

wo(v)h(v,n,k) δw(k)

Nt

k 1

r(k,n)δw(k)

r(k,n). This results in

From Eq. (4.10), there are two cases to solve a set of nontrivial solutions forδw( ). In the

46

first case, if L = Nt (# of terms inX = # of hidden weights + # of thresholds) and

(4.10)δ TwR AT

f WToC0

Rank[R] = L, δw( ) can be solved by inverting theR matrix. Second, ifNt > L and

Rank[R] = L, then there are infinity solutions forδw( ). For both cases, the condition for

nontrivial solutions ofδw( ) is Nt ≥ L.

As long as δw( )’s are found, wx( )’s can be determined by subtracting the

correspondingwx0( )’s. Although the initial solutions forWx are only approximation

answers, they can be improved further by iterative methods. As a special case of Theorem

4.2, the lower bound number of hidden units for a fully-connectedMLP network to map

a functionf(x) is discussed in the following corollary.

Corollary : The lower bound of the number of hidden units for mapping a continuous

function f(x) with N-input, degree-P into a fully-connectedMLP network is

where is a ceiling function.

(4.11)LB UB N 1N 1

Proof : For a fully-connectedMLP network withNh hidden units, there areNt = N Nh +

Nh free parameters. According to Theorem 4.2,Nt must be greater than or equal toL.

Thus,LB for Nh is

47

(4.12)LB Nh

Nt

N 1≥ UB N 1

N 1

Based on Theorem 4.2 and its corollary, we concentrate on the application of the

single hidden layerMLP network with fully-connected weights. In this case, the free

parameters like hidden weights is analogous to input weights (from input units to hidden

units). In addition, the thresholds for the hidden units are not taken into account.

4.2.2 Construction ofCompact Networkswith Monomial Activation

The goals here are to present practical methods to realizecompact networksusing

the monomial activation. In the latter part of the chapter, we consider how to use the

sigmoid activation to designcompact networks. The monomial activation is

wherek is an integer greater than or equal to 2. The first design step is to dividef(x) up

(4.13)G(Xnet(i)) X knet(i)

into blocks of equal-degree terms, as in Figure 4.8. That is

where fk(x) has all terms of degreek. The kth degree block is expressed as the sum of

(4.14)f(x) A0 f1(x) f2(x) f3(x) f P(x)

products for the inputs

48

and

(4.15)fk(x)N

i1 1

N

i2 i1

N

iN iN 1

ak(i1,i2, ,iN)xqk1(i1)

1 xqk2(i2)

2 xqkN(iN)

N

wherek ≥ qkm( ) ≥ 0 for m = 1,2, ,N.

(4.16)qk1(i1) qk2(i2) qkN(iN) k

Note that in Eq. (4.14),A0 can be a bias term in the output unit andf1(x) (first degree

Figure 4.8 Compact Mapping with Block Approach.

function) can be easily realized by connecting the input and assigning coefficients of the

first degree terms to the weights. Thus, there needs no unit for mapping functionf1(x).

We will focus on finding theLB number of hidden units,Nh(k), required for realizing

49

eachfk(x) function fork ≥ 2. In the following, two approaches are discussed for compact

mapping.

4.2.2.1 Compact Mapping with Block Approach

The block approach is to realize all terms infk(x) functions (k ≥ 2) at the same

time. We utilize a monomial activation (Eq. (4.13)), then the network output for thekth

degree block is

wx(m,i) is the weight betweenith input unit to mth hidden unit. Carrying out the

(4.17)

fk(x) ≈Nh(k)

m 1

φ(m)wo(m)

Nh(k)

m 1

N

i 1

wx(m,i)xi

k

wo(m)

multiplication, Eq. (4.17) can be rewritten as

and

(4.18)fk(x) ≈N

i1 1

N

i2 i1

N

iN iN 1

bk(i1,i2, ,iN)xqk1(i1)

1 xqk2(i2)

2 xqkN(iN)

N

Initializing Wo with small random numbers, our purpose is to find a set of input

(4.19)bk(i1,i2, ,iN) k!

qk1(i1)!qk2(i2)! qkN(iN)!

Nh(k)

m 1

wqk1(i1)

x (m,1)wqk2(i2)

x (m,2) wqkN(iN)

x (m,N) wo(m)

weights such that the mean square error betweenak( ) and bk( ) (Eq. (4.15) and Eq.

50

(4.18)) is minimized. It involved in solving a set of nonlinear equations. Whenever the

gradient is available, a generalization of the conjugate-gradient method can be applied to

minimize the nonlinear functions. Define the mean square error,Ek(Wx), as

Then substitutebk( ) into Eq. (4.20) and take the derivative ofEk(Wx) with respect to

(4.20)Ek(Wx)N

i1 1

N

i2 i1

N

iN iN 1

ak(i1,i2, ,iN) bk(i1,i2, ,iN) 2

each input weight, the gradients,gmj, are found as

for 1 ≤ m ≤ Nh(k), 1 ≤ j ≤ N and

(4.21)

gmj

∂Ek(Wx)

∂wx(m,j)

2N

i1 1

N

i2 i1

N

iN iN 1

ak(i1,i2, ,iN) bk(i1,i2, ,iN)∂bk(i1,i2, ,iN)

∂wx(m,j)

2Q ak(i1,i2, ,iN) bk(i1,i2, ,iN) wqk1(i1)

x (m,1) qkj(i j)wqkj(i j) 1

x (m,j) wo(m)

The basic conjugate-gradient iteration of Fletcher and Reeves [50]-[51] has the

(4.22)Q k!qk1(i1)!qk2(i2)! qkN(iN)!

form

where superscripts denote iteration number andz is chosen to minimizedEk(Wx), d n-1 is

(4.23)W nx ← W n 1

x zd n 1

the direction vector in the (n-1)th iteration

51

and

(4.24)dn gn 1 µndn 1

A flowchart of this method is shown in Figure 4.9.

(4.25)µn 1 (gn)T gn

(gn 1)T gn 1

Rather than using the arbitrarilyz, we then try to find the zeros for the derivative of

Figure 4.9 Flowchart of Iterative Conjugate-Gradient Method.

Ek(Wx - zd),

and eachci is the function of the coefficientbk( ), input and output weights. It is

important to note that the conjugate-gradient algorithm must be restarted periodically in

52

order to guarantee superlinear convergence [52]. The usual recommendation [53] is to

(4.26)∂Ek(Wx zd)

∂z

k 2

i 0

ci zi 0

restart after everyNh(k) N iterations. Thus, we set µn = 0 whenevern is divisible by

Nh(k) N. Observe that theBP algorithm using steepest descent has the same form as Eq.

(4.23), except that forBP algorithm we takez to be constant which defines the strength

term. There is convincing theoretical and empirical evidences that conjugate gradients

should converge faster than steepest descent.

The block approach has the advantage of fast design since the coefficients in each

degree term are realized simultaneously. However, problems arise whenever the

distribution of the input coefficients has large deviations. In this case, finding the global

minimum of Ek(Wx) generally requires too many iterations.

4.2.2.2 Compact Mapping with Group Approach

Whenever the global minimum ofEk(Wx) is difficult to find or theN or P for the

given function is large, we recommend another alternative. Instead of realizing all the

terms in each degree block, we divide all terms in each degree block intoN groups

(Group approach) and realize allxj product terms sequentially forj starting from 1 toN.

A general concept can be seen from Figure 4.10. As an example, if a function has 3

inputs (N = 3) and has only third degree terms, then there are 10 terms after the

expansion. These 10 terms can be divided into 3 groups with terms shown in the

53

following

Figure 4.10Compact Mapping with Group Approach.

The developed theorems and corollary in the previous section can be used as well as for

(4.27)Group 1 :

x 31 , x1x2x3

x 21 x2, x1x

22

x 21 x3, x1x

23

Group 2 :

x 32

x 22 x3, x2x

23

Group 3 : x 33

the group approach. Assume that the connections in each group are fully connected. Then

in group j (realize allxj product terms), there areNj = N-j+1 inputs. In each group, the

error function in Eq (4.20) is simplified as

54

Then the iterative method discussed in the previous section for finding a set of input

(4.28)E (j)k (Wx)

N j 1

m j

ak( ,im, ) bk( ,im, ) 2

weights can be applied again to the group approach.

To calculate the minimum number of hidden units required for all degreek terms,

we need to decide the number of hidden units in each group first. From Eq. (2.14), the

number of product terms,Npj, in group j (with degreek) is

Thus, the minimum number of hidden units in the groupj is Npj/Nj . Finally, the total

(4.29)Npj

(k Nj 1)!

k!(Nj 1)!

(k Nj 2)!

k!(Nj 2)!

number of required hidden units for akth degree block is

Several demonstrations for comparison of these two approaches can be shown later. From

(4.30)Nh(k)N 1

j 1

Npj

Nj

1

the experimental results, the group approach does show the advantages for realizing the

desired coefficients to the network, especially for difficult data.

4.2.3 Conversion of Monomial Activation to Analytic Activation

Since the monomial activation is not bounded, the output is easily saturated when

the BP algorithm is used for training. Therefore, it is natural to find a method for

55

converting compact monomial networks to compact analytic activation networks. The

conversion process begins with the units having the highest degree monomial activation

(degreeP). If there arenP monomial units withPth degree monomial activation, each of

them can be replaced by a sigmoid unit (Appendix B) with the highest degreep(i) = P.

Figure 4.11 shows the basic ideas.

This first substitution generates thePth degree terms accurately but generates extra terms

Figure 4.11Conversion of akth Degree Monomial Activation to Sigmoid Activation.

with degrees (P-1), (P-2), . However, these unwanted terms can be subtracted from

the rest of the polynomial outputs. In this case, the coefficients for terms with degrees (P-

1), (P-2), of the original output are recalculated. Then, the conversion process

continues for (P-1)th degree monomial activation until the monomial activation with

degree 2 is reached. From the above discussions, the conversion procedure is simply

multiplying the scaling factor and adding offset to the original set of weights and does

56

not change the number of hidden units required.

Here we show an example for the conversion of a network with second degree

monomial activations by sigmoid activations. After the replacement, the number of

sigmoidal units is the same as the original number of squaring units. As shown in

Figure 4.12, theSi’s are chosen such that the output of the sigmoidal squaring subnet is

equivalent to squaring unit. Consider theith unit which fed themth output unit, if there

areN inputs feeding theith unit the weights and thresholds are compensated according

to the following rules :

The result of the second degree case can be extended to higher degree function without

(4.31)

θ(m) ← θ(m) wo(i)(S4 S5θ(i))

θ(i) ← S1θ(i) S2

wx(m,j) ← wx(m,j) S5wx(i,j)wo(i) , j 1,2, ,N

wx(i,j) ← S1wx(i,j) , j 1,2, ,N

wo(i) ← S3wo(i)

difficulty.

4.2.4 Sparse Second DegreeCompact Network

Second degree polynomial functions have found widely use in pattern analysis and

signal processing. The Bayes-Gaussian discriminant function [6]-[7] is popular because

of its simple form and ease of design using training data. In this subsection, a special

57

treatment for second degreecompact networkis discussed. Rather than using the

Figure 4.12Replace Monomial Activation (2nd Degree) by Sigmoid Activation.

conjugate-gradient method to find a set of input weights for mapping a given second

degree function, we derive close forms for mapping the giving coefficients to the network

weights. We utilize group approach for allxj product terms in the second degree block.

For example, consider a second degree functionf(x1,x2,x3),

it requires at least six free parameters (weights) for mapping these six terms. A special

(5.1)f(x1,x2,x3) a1x2

1 a2x2

2 a3x2

3 a4x1x2 a5x1x3 a6x2x3

kind of sparse connections of thecompact networkis shown in Figure 4.13 for this

example. The desired coefficients can be accomplished by solving a set of equations step

58

by step. From Figure 4.13, the outputs of each unit are

Take the summation at the output unit and compare the resulting output withf(x), 6

(5.2)

wo(1) x1wx(1,1) x2wx(1,2) x3wx(1,3) 2

wo(2) x2wx(2,2) x3wx(2,3) 2

wo(3) x3wx(3,3) 2

equations can be generated in the following computational sequential order,

As long asa1, a2 anda3 are not all zero, the input weights associated with each unit can

(5.3)

wo(1)w 2x (1,1) a1

2wo(1)wx(1,1)wx(1,2) a4

2wo(1)wx(1,1)wx(1,3) a5

wo(1)w 2x (1,2) wo(2)w 2

x (2,2) a2

2wo(1)wx(1,2)wx(1,3) 2wo(2)wx(2,2)wx(2,3) a6

wo(1)w 2x (1,3) wo(2)w 2

x (2,3) wo(3)w 2x (3,3) a3

be solved sequentially by

59

Note thatwo(1), wo(2) andwo(3) are chosen, in the specified order, to make square root

(5.4)wx(1,1)

a1

wo(1)wx(1,2)

a4

2wo(1)wx(1,1)wx(1,3)

a5

2wo(1)wx(1,1)

(5.5)wx(2,2)a2 wo(1)w 2

x (1,2)

wo(1)wx(2,3)

a6 2wo(1)wx(1,2)wx(1,3)

2wo(2)wx(2,2)

(5.6)wx(3,3)a3 wo(1)w 2

x (1,3) wo(2)w 2x (2,3)

wo(3)

arguments in Eqs. (5.4) to (5.6) positive. These results are easily extended to quadratic

functions withN inputs.

Figure 4.13Efficient Compact Mapping of Second Degree Function.

60

4.2.5 Experimental Results

Our algorithms for designing thecompact networkhave been tested for several

second and higher degree functions.

First, example to compare the classification performances of second degree

compact networkand Gaussian classifier is demonstrated. The training data for both

approaches consists ofCHEF, FMDF, RWEF, LPTF and RDF, each of them has 16

inputs and 4 output classes (ellipse, triangle, quadrilateral, and pentagon) and each class

has 200 patterns (see Appendix C for details). The classification error percentages for

both the Gaussian classifier and sparse-connectioncompact networkis listed in Table 4.4.

In Table 4.4, the misclassification percentages showed for the Gaussian classifier

Table 4.4 Classification Error Percentages for Gaussian Classifier and SecondDegreeCompact Network.

Feature Data Gaussian Classifier Sparse-ConnectionCompact Network

CHEF 3.00 % 3.25 %

FMDF 0.375 % 0.375 %

RWEF 2.875 % 2.75 %

LPTF 1.875 % 1.875 %

RDF 0.25 % 0.25 %

andcompact networkare almost the same. The sparse-connection second degreecompact

network provides us another method to design the Gaussian classifier. To applyBP

algorithm, the replacement of the monomial units with sigmoid unit is taken. Finally, the

61

equivalent network is compared with the network with random weights. In Figure 4.14,

the training result forRWEFdata is demonstrated after 20 iterations. It is obvious that the

network with mapping weights outperforms the same network with random weights.

The second example is tried to map a set of functions to the network using group

Figure 4.14Training Results of aCompact Networkwith RWEFData.

and block approaches, respectively. The first one has a 7-input but with second degree.

The second one is a function with a 5-input, third degree only and the third one is a

function with a 3-input, fourth degree only. The desired coefficients for each case have

the Gaussian distribution with different means and standard deviations as shown in

Table 4.5.

The iterative conjugate-gradient method is applied to find a set of input weights

such that the relative mean square error between the network output coefficients and the

62

desired coefficients is below 1 x 10-5. The training results using conjugate-gradient

Table 4.5Mean and STD Deviation for the Coefficients in Different Functions.

Mean Standard Deviation

7-Input, 2nd Degree 10 5

5-Input, 3rd Degree 5 4

3-Input, 4th Degree 2 1.5

method for the case of a 2nd degree function are displayed from Figure 4.15 to

Figure 4.21. Since there are so many for 2nd and 3rd degree terms (28 and 35), we only

show the mapping results for 4th degree terms in Table 4.6.

Figure 4.15Group Approach for Realizing Allx1 Terms (Second Degree Case).

63

Table 4.6 Results for Mapping a 4th Degree Function toCompact Networks, UsingGroup and Block Approaches.

4th-Degree Terms Desired Coeff. Mapping Results

Group Block

x14 1.784729 1.784731 1.784729

x13x2 2.064153 2.064143 2.064159

x13x3 -3.872893 -3.872891 -3.872889

x12x2

2 1.847609 1.847592 1.847579

x12x2x3 1.177845 1.177817 1.177823

x12x3

2 -0.817769 -0.817781 -0.817769

x1x23 2.289803 2.289832 2.289833

x1x22x3 0.6076442 0.6076502 0.6076581

x1x2x32 1.089452 1.089462 1.089451

x1x33 3.560644 3.560627 3.560652

x24 2.269974 2.270015 2.269949

x23x3 -2.443908 -2.443928 -2.443920

x22x3

2 3.450646 3.450626 3.450643

x2x33 0.5177643 0.5177612 0.5177647

x34 2.423806 2.423808 2.423806

64



65



66

We also estimated the theoreticalUB and LB values and compared them to the


Figure 4.21Block Approach for Realizing All Terms of a Second Degree Function.

67

experimental results. As shown in Table 4.7, the required number of hidden units is

always bounded betweenLB andUB.

The last example is to map a 2-dimensional non-integer degree function to the

Table 4.7Compare the Required Hidden Units with the Theoretical Results.

LB Group Approach Block Approach UB

7-Input, 2nd Degree 4 7 7 28

5-Input, 3rd Degree 7 16 8 35

3-Input, 4th Degree 5 11 15 15

network, using block approach. According to the multidimensional Lagrange interpolation

formula [18], a 2-dimensional function can be expressed as a polynomial with integer

degrees

for x1 ε [1,2] andx2 ε [1,2]. The approximating function is then realized in thecompact

(5.7)

f(x) 1x1x2

≈ 4.694 3.25x1 3.25x2 ⇐ Const. and 1st degee block

0.722x 21 2.25x1x2 0.722x 2

2 ⇐ 2nd degree block

0.5x 21 x2 0.5x1x

22 ⇐ 3rd degree block

0.111x 21 x 2

2 ⇐ 4th degree block

network by dividing the function into 4 blocks. There are 2 units for the 2nd degree

block, 3 units for the 3rd degree block and 3 units for the 4th degree block. Totally, 8

68

hidden units (with monomial activation) approximate the function well. Note that this

number is greater thanLB = 6 and less thanUB = 12. In Table 4.8, we show the desired

coefficients and the mapping results for comparison.

From the examples we have demonstrated, it proves our algorithms successfully

realize given functions to thecompact networksand find the required number of units for

performing such mapping.

4.2.6 Summary

In this chapter, we develop two kinds ofMLP networks based on thePBF model

by observing the rank of theC matrix. If the MLP network has theC matrix with rank

L and with no redundant units, it results in acomplete networkbecause each term of the

desired function is completely represented by onePBF (or one hidden unit). Once the

complete networkis constructed, any additional function can be realized in the same

network without adding any extra hidden units. If theC matrix has rank less thanL, the

network was called acompact networkbecause eachPBF represents several terms of the

desired function and has a more compact network topology.

CHAPTER 5

INVERSE MAPPINGS

In this chapter, a straightforward method is described for generating a polynomial

Table 4.8Using Block Approach to Approximate Function 1/(x1x2).

2nd-Degree Terms Desired Coeff. Mapping Results

x12 0.7222 0.7222

x1x2 2.25 2.249

x22 0.7222 0.7222

3rd-Degree Terms Desired Coeff. Mapping Results

x13 0.0 -1.729 x 10-6

x12x2 0.5 0.499

x1x22 -0.5 -0.500

x23 0.0 -1.490 x 10-8

4th-Degree Terms Desired Coeff. Mapping Results

x14 0.0 2.691 x 10-4

x13x2 0.0 8.211 x 10-5

x12x2

2 0.111 0.111

x1x23 0.0 -6.373 x 10-5

x23 0.0 2.980 x 10-8

basis function model of a trainedMLP neural network. First, we model each hidden unit

activation as a polynomial function of the net input. Then a method is developed for

finding the PBF for each hidden unit. The output polynomial discriminant can be

calculated from thePBFs. Preliminary methods for eliminating linearly dependentPBFs

are given. The processing sequence is illustrated in Figure 5.1.

69

70

Figure 5.1 Block Diagram of Inverse Mappings.

5.1 Polynomial Network Models

In this section we develop two differentPBF models of theMLP network. Given

the network weights, topology, and the training data, the first step is to model each hidden

unit activation as a polynomial function of its net input. As before,Xnet(k)(i) and φ(k)(i)

denote the net input and activation output respectively for thekth input training pattern

of the ith unit. From the analyticity of the activation, we can model each hidden unit’s

output as a power series with finite integer degreep(i) in the variableXnet(k)(i). This leads

to

and the network output is

In Eq. (5.1),X0(i) is defined as the total net input of theith unit divided by the total

number of training vectorsNv. D(i,n)’s are coefficients after polynomial expansion. This

71

process continues until all hidden unit’s output have the same form as Eq. (5.1). This is

(5.1)

φ(k)(i) ≈p(i)

n 0

A(i,n) X (k)net(i) X0(i)

n

p(i)

n 0

D(i,n) X (k)net(i)

n

(5.2)f(x)Nu

i 1

wo(i)φk(i)

referred to as the condensed network model.

In the condensed model of Eq. (5.1) eachPBF is a composition of polynomials.

The exhaustivePBF model is obtained by multiplying out these compositions for each

hidden unit. Then, eachPBF can be expressed as an inner product of a coefficient vector

with X, i.e.

C(i) is directly related to the function of network weights and thresholds of hidden units.

(5.3)φ(k)(i) ≈ C(i)X

Although the exhaustive network model is not easily to obtained, especially when the unit

degreep(i) or number of network inputs,N, is large, it provides a compact form to derive

the polynomial approximations of the network. Application of using both models will be

discussed in details in the following section. First, we begin to calculate the coefficients

D(i,n) in Eq. (5.1) for the condensed network model.

5.2 Calculation of the Condensed Network Model

72

There are several methods for determining the power series coefficientsD(i,n) in

Eq. (5.1). We have chosen to use a method which is optimal in a least-mean-squares

sense. The method is described in the following. Given the desired degreep(i) of the

polynomial approximating the activation function, we minimize the mean square error

function

with respect to theD(i,n) coefficients over the training set,Nv. Thus we set up a matrix

(5.4)E(p(i))

Nv

k 1

φ(k)(i)p(i)

n 0

D(i,n) X (k)net(i)

n

2

form to solveD(i,n)’s as

It can be shown [54] that the resulting (p(i)+1) x (p(i)+1) square matrix is nonsingular as

(5.5)

k

φ(k)

k

φ(k)X (k)net

k

φ(k) X (k)net

p(i)

k

1k

X (k)net

k

X (k)net

p(i)

k

X (k)net

k

X (k)net

2

k

X (k)net

p(i) 1

k

X (k)net

p(i)

k

X (k)net

p(i) 1

k

X (k)net

2p(i)

D(i,0)

D(i,1)

D(i,p(i))

long asXnet(k)(i)’s are distinct.

It is also desirable to find an algorithm for choosing the polynomial degree of each

hidden unit’s activation function automatically. We propose to measure the relative mean

square error, (R(p(i))), over all training vectors and use it to calculate the unit degreep(i).

73

The relativeMSE is defined as theE(p(i)) divides the sample variance of theith hidden

unit’s output, measured over all training patterns. i.e.

where <φ(i)> denotes the average output activation of theith unit. For each hidden unit,

(5.6)R(p(i))

Nv

k 1

φ(k)(i)p(i)

n 0

D(i,n) X (k)net(i)

n

2

Nv

k 1

φ(k)(i) <φ(i)>2

the correspondingR(p(i)) is a measure of the approximation accuracy. For user chosen

thresholdT, which represents a desired or maximum acceptable value ofR(p(i)), the

degree p(i) is increased until the condition,R(p(i)) ≤ T, is satisfied. Figure 5.2

demonstrates the details. It is obvious that smaller values ofT correspond to more

accurate approximations. Also, the degree of each unit reveals the relative importance of

the unit in the approximation.

We demonstrate two examples to show the effectiveness of our approach for

modelling the hidden units. As the first example, consider the exclusive-orMLP network

having two inputs, one hidden unit, and one output unit. The network was trained byBP

algorithm with sigmoid activation until theMSE(Mean Square Errors) is below 0.00574

without misclassification. Then its hidden unit’s output is approximated by ap(i)-degree

polynomial for p(i) = 1, 2, and 3. The truth table of the exclusive-or problem and the

MSE, R(p(i)) and error % of these approximations are shown in Table 5.1 and Table 5.2,

respectively. The second example for approximating the parity-checkMLP network is also

74

listed in Table 5.3 and Table 5.4. In both cases,T is set to be 1 x 10-8.

Figure 5.2 Decide Unit Degreep(i) for the ith Unit.

Table 5.1Truth Table of the Exclusive-OR Problem.

x1 x2 Desired Output Class #

0 0 0 1

0 1 1 2

1 0 1 2

1 1 0 1

It is clear from Table 5.2 and Table 5.4 that the least squares approach works well

for a second or higher degree polynomial. As seen in tables, we can determine the hidden

unit’s degree by observing the behavior of theMSE or the relative MSE during

approximation, as the degree is changed. In general, high degree units are more important

75

than units with low degree.

Table 5.2 Approximation of the Exclusive-ORMLP Network (Layer Structure : 2-1-1).

Degreep(i) MSE R(p(i)) Error %

123

0.5023050.0057420.005742

0.31525.178 x 10-9

6.745 x 10-10

25.00 %0.000 %0.000 %

Table 5.3Truth Table of the Parity-Check Problem.

x1 x2 x3 Desired Output Class #

0 0 0 0 1

0 0 1 1 2

0 1 0 1 2

0 1 1 0 1

1 0 0 1 2

1 0 1 0 1

1 1 0 0 1

1 1 1 1 2Table 5.4 Approximation of the Parity-CheckMLP Network (Layer Structure :

3-10-1).

MSE Error %

Sigmoid Activation 0.008988 0.000 %

Maximum Degree = 1Maximum Degree = 2Maximum Degree = 3

1.004051.004130.009419

62.50 %62.50 %0.000 %

76

5.3 Network Pruning Using the Condensed Network Model

We note that a trainedMLP network is not necessarily making effective use of all

hidden units. Pruning useless units then simplifies the network topology without effecting

the performance and reduces the computational complexity. Existing techniques [55]-[56]

start the network with a large size and then remove hidden units that have little effect on

the classification error. The units are ordered according to the amount by which the

classification error is decreased when the unit is removed. The unit with the smallest

effect is removed or pruned. These methods are not optimal in that for a given

classification error, a smaller network than that found by these algorithms may exist. In

this section, we utilize the condensed network model to detect the presence of linearly

dependent basis vectors by observing the degree of thePBFs, which is in the form as a

function of its net input. If thePBF has lower degree such as 0 or 1, it is useless. If the

PBF has a unique degree term, then it is important. In the following, we state the

properties based on the condensed network model followed by examples.

Property 5.1 : Units which havePBF with degree 0 or 1 are linearly dependent.

This can be shown as follows : according to Eq. (5.1), the activation output of theith unit

with degreep(i) = 1 is

The ith unit can be eliminated without changing the network output. If themth unit feeds

(5.7)φ(i) D(i,0) D(i,1)Xnet(i)

the ith unit and theith unit feeds thejth unit (Figure 5.3), then the weights feeding the

77

jth unit and thresholdθ(j) of the jth unit are compensated as

For the case of constant output wherep(i) = 0, the threshold of thejth unit, which theith

(5.8)θ(j) ← θ(j) w(j,i) D(i,0) D(i,1)θ(i)

(5.9)w(j,m) ← w(j,m) D(i,1)w(i,m)w(j,i)

unit feeds to, is updated as

and the uniti can then be removed.

(5.10)θ(j) ← θ(j) w(j,i)D(i,0)

In the following example, we apply property 5.1 to detect the useless units in an

Figure 5.3 The Shaded Uniti is Ready to be Removed.

existing network. Our results are also compared with Karnin method [22], which estimates

the sensitivity of the error function to the exclusion of each connection. The training data

we generated had 4 input features and 2 classes. There were 200 input vectors for each

class and they had a joint Gaussian probability density. For each class, the training

78

vectors had the same covariance matrix. The optimal discriminant function is a first-

degree polynomial function [6]. AnMLP neural network, having the 4-5-2 structure with

no direct connections from input to output (Figure 5.4), was trained using this data set.

After 100 iterations, the mean square error was reduced to 0.003436. We applied the

methods developed in section 5.2, using a maximum degree ofp(i) = 2 for each unit.

However, the unit degree for all hidden units was automatically reduced to 1 and has

close approximation to the original analytic activation function, forT = 0.005. Thus all

hidden units can be removed and replaced by direct connections, using Eq. (5.8) and Eq.

(5.9). Finally, the network structure becomes 4-2 (only input and output layers left).

Table 5.5 lists the weights shown in Figure 5.5.

For comparison, Karnin’s pruning method was also applied to the same network.

Figure 5.4 Pattern Classifier Network.

In his approach, weight sensitivities between layers are calculated as

79

The subscripts (I) and (F) denote the initial and final connection weights (before and after

Figure 5.5 The Network After Pruning.

Table 5.5Weights in Figure 5.5.

w(i,j) 1 2 3 4

G -12.42 -0.67 5.83 4.38

H 12.42 0.70 -5.90 -4.36

(5.11)Sij

NIT

k 1

∆w (k)(i,j)2 w (F)(i,j)

η w (F)(i,j) w (I)(i,j)

training process) between uniti and j. NIT is the total number of iterations andη is the

learning factor. The sensitivities we obtained are listed in Table 5.6 and Table 5.7.

From Table 5.6 and Table 5.7, we see that the input and output weights of hidden

80

units c ande have smaller sensitivities (refer to Figure 5.6). Thus, hidden unitsc ande

Table 5.6 Input Weight Sensitivities in Figure 5.4.

Sij 1 2 3 4

a 0.0235 0.0189 0.0191 0.0246

b 0.0091 -0.0218 0.0037 0.0060

c 0.0005 0.0009 -0.0001 0.0008

d 0.0248 0.0139 0.0165 0.0160

e 0.0003 -0.0009 0.0001 0.0000

Table 5.7Output Weight Sensitivities in Figure 5.4.

Sij a b c d e

G 0.487 0.496 0.213 0.167 0.162

H 0.522 0.402 0.102 0.150 0.174

can be removed. The final network would have topology 4-3-2. Comparing Figure 5.5

with Figure 5.6, it is obvious that our pruning method results in a more compact network

whenever the original network is effectively linear.

Property 5.2 : Units which have outputs across the training set which mimic the outputs

of other units can be removed.

To illustrate, if one unitm gives approximately the same output as unitn, for the entire

training set, it can be discarded. To remove unitm without changing the solution, at each

81

unit j, on the next layer,w(j,n) can be updated as

Figure 5.6 Results of Karnin’s Analysis (Shaded Units and Dark Lines are Candidatesfor Removal).

Property 5.3 : PBFs which have unique degree terms are linearly independent of the

(5.12)w(j,n) ← w(j,n) w(j,m)

otherPBFs.

For example, if onePBF has a 5th degree term but all others are of 4th degree or less,

the 5th degreePBF is linearly independent of the others. Property 5.3 can considerably

simplify our analysis of the basis functions.

Property 5.4 : If a subset of the coefficient matrix with dimensionNh by Ls, Nh ≤ Ls <

L, has linearly independent rows, then the complete coefficient matrix has

linearly independent rows.

In other words, the linear independence of the rows of the coefficient matrix can be

82

investigated by examining a very small number of its columns. Although this is only a

sufficient condition for detecting the linearly independence of the basis functions, it is

extremely useful whenNh L.

5.4 Calculation of the Exhaustive Network Model

The condensed form of thePBF yields simple form to examine the unit degree as

the function of its net input. However, it is sometimes useful to find the exhaustive

network model. The first step is to find the network output degreeP. If there areH

hidden layers in the network, then

wherep(J) denotes the maximum degree in theJth hidden layer, i.e.

(5.13)PH

J 1

p(J)

Given degreeP, we can findL, the dimension ofX, as in Eq. (2.13).

(5.14)p(J) Maxi ∈LayerJ

p(i)

The next step after findingP andL is to find a method for evaluating the network

PBFs. Rewrite the net input for theith unit as an inner product

Each element of the vectord(i) is taken the weighted sum ofPBFs from previous layers.

Using Eq. (5.15) in Eq. (5.1) we get

83

Defining d(i,n) as the coefficient vector forXnetn(i), (d(i)X)n = d(i,n)X, it is obvious that

(5.15)Xnet(i)

k

C(k)X w(i,k) θ(i)

d(i)X

(5.16)φ(i) ≈p(i)

n 0

D(i,n)(d(i)X )n

d(i,1) = d(i). Therefore,

With the extension of Eq. (5.17), thenth degree term in Eq. (5.16) can be formed through

(5.17)X 2net(i) (d(i)X)2 X TdT(i)d(i)X d(i,2)X

polynomial multiplication as

Finally, the output of each hidden unit can be written as an inner product ofX with a

(5.18)(d(i)X)n X TdT(i,n 1)d(i,1)X d(i,n)X

coefficient vector

In Eq. (5.19), the quantity inside the bracket denotes theith row vector in Eq. (2.16) or

(5.19)φ(i) ≈

p(i)

n 0

D(i,n) d(i,n) X

C(i) in Eq. (2.12). Continuing the process until the last hidden layer is reached, theMLP

network output can be expressed realized as a linear combination ofPBFs as in Eq.

(2.11). Before we start to demonstrate design examples for find the polynomial

84

approximations for theMLP network, we outline the practical procedures in the following

steps.

Step1. Decide a proper network size (how many layers and how many hidden units

per layer) and train the network withBP algorithm such that theMSEbelow

some acceptable values.

Step2. Approximate the activation output of each unit by a maximum finite degree

polynomial.

Step3. Choose the threshold valueT for the relativeMSE, R(p(i)). Decrease the

degree of the polynomial function for each unit as long as the condition,

R(p(i)) ≤ T, is satisfied.

Step4. Remove units which output degree is 0 or 1, using Property 5.1.

Step5. Use the exhaustive network model to find the approximating polynomial for

network output.

Following these procedures, we begin to illustrate several examples to test the

effectiveness of our methods.

5.5 Experiments with the Exhaustive Network Model

In this section, the goal is to use the exhaustive network model in finding the

polynomial approximation for theMLP network’s output. Several design examples such

asMLP filters andMLP classifiers are demonstrated.

85

5.5.1MLP Neural Network Filter

One example is demonstrated for designing the nonlinear filter. The example is

attempted to design a 3-input median filter network, having topology 3-10-1. The training

data is uniformly distributed between 0 and 1 and the desired output is the median of the

3 inputs. Totally, there are 1000 patterns. Again, the maximum allowed value forp(i) was

5. In Figure 5.7, we show the error % and required hidden units for different thresholds.

As shown in Figure 5.7, the model network deviates significantly from the sigmoid

network until T ≤ 0.01 and the number of hidden units having degree greater than 1

increases. The value of the network degreeP here was 5, and we can say that training has

succeeded. Clearly, this median filter network, and all its hidden units, are performing a

nonlinear operations.

In Table 5.8, the degree of each unit is listed and coefficients for all terms (56

terms) are shown in Table 5.9.

Table 5.8Degree of Each Unit for 3-Input Median Filter WhenT = 0.01.

Unit 1 2 3 4 5 6 7 8 9 10

Degree 2 2 2 2 2 3 5 3 3 2

86

5.5.2MLP Neural Network Classifiers

Figure 5.7 3-Input Median Filter Network with Layer Structure 3-10-1 and MaximumDegreep(i) = 5.

For the case of nonlinear classifiers, examples of two-input exclusive-or and three-

input parity-check problem (in section 5.2) are examined first. The thresholdT chosen for

these examples is 1.0 x 10-8 to make sure the best approximations for the original

activation function. For exclusive-or network, the hidden unit is approximated by a

polynomial with degree 2. The polynomial approximation, for the net function of the

output unit, is

87

Using the fact thatxin = xi for binary inputs, we can rewrite Eq. (5.20) as

Table 5.9 Output Coefficients of the Approximating Polynomial for 3-Input MedianFilter Network.

Terms Coeff. Terms Coeff. Terms Coeff. Terms Coeff.

Const. -0.014 x12x3 1.230 x2

3x3 0.008 x13x2x3 1.826

x1 0.271 x1x2x3 -4.865 x12x3

2 -0.002 x12x2

2x3 -5.283

x2 0.457 x22x3 2.305 x1x2x3

2 0.006 x1x23x3 6.794

x3 0.301 x1x32 1.203 x2

2x32 -0.006 x2

4x3 -3.277

x12 0.010 x2x3

2 0.173 x1x33 -0.001 x1

3x32 -0.440

x1x2 0.011 x33 -0.475 x2x3

3 0.002 x12x2x3

2 2.548

x22 0.022 x1

4 -0.000 x34 -0.000 x1x2

2x32 -4.915

x1x3 -0.042 x13x2 0.002 x1

5 -0.051 x23x3

2 3.160

x2x3 -0.044 x12x2

2 -0.007 x14x2 0.491 x1

2x33 -0.409

x32 0.021 x1x2

3 0.009 x13x2

2 -1.893 x1x2x33 1.580

x13 -0.397 x2

4 -0.004 x12x2

3 3.652 x22x3

3 -1.524

x12x2 0.009 x1

3x3 -0.001 x1x24 -3.523 x1x3

4 -0.190

x1x22 2.375 x1

2x2x3 0.007 x25 1.359 x2x3

4 0.367

x23 -1.558 x1x2

2x3 -0.013 x14x3 -0.237 x3

5 -0.035

(5.20)f(x1,x2) 2.74 0.91x1 0.78x2 4.65x 21 11.19x1x2 4.73x 2

2

It is obvious thatf(x1,x2) < 0 when patterns are from class 1 andf(x1,x2) > 0 for patterns

(5.21)f(x1,x2) 2.74 5.56x1 5.51x2 11.19x1x2

in class 2. For parity-check problem, theBP algorithm is used in the network with 3-10-1

topology until theMSE≤ 0.0089. Model each hidden unit with degreep(i) = 3. Finally,

88

the polynomial approximation for the net function (xin = xi) of the output unit is

(5.22)f(x1,x2,x3) 2.764 5.825x1 5.886x2 5.736x3 11.919x1x2

11.915x1x3 11.952x2x3 23.932x1x2x3

5.5.3 Experiments with Quadratic Discriminants

In this subsection, example is shown to design Quadratic discriminant functions

using the conventional approximate Bayesian technique and via second degree

approximations to theMLP neural networks. TheN-input second degree Gaussian

discriminant for theith class,Gi(x), and the Quadratic approximation to theith MLP

discriminant,Qi(x), can be written respectively as

The coefficients Wi( ),wi( ),θi in Eq. (5.23) are found from the covariance matrix and

(5.23)Gi(x) θi

N

j 1

wi(j)xj

N

k 1

k

l 1

Wi(k,l)xkxl

(5.24)Qi(x) θ i

N

j 1

wi(j)xj

N

k 1

k

l 1

Wi(k,l)xkxl

mean vector for each class [6]. The coefficients Wi( ),wi( ),θi in Eq. (5.24) are found

using the technique of the previous sections.

The training data used for both approaches are geometric feature data. The

classification error percentages for theMLP network (with network topology 16-20-4) and

each discriminant are shown in Table 5.10. It is obvious thatGi(x) andQi(x) have quite

89

similar classification performance.

The Gaussian discriminant is an attempt to minimize the probability of error. However,

Table 5.10Error % for Gaussian, Quadratic Discriminants andMLP Network.

Gaussian DiscriminantGi(x)

MLP Network Quadratic DiscriminantQi(x)

RWEF 2.50 % 2.50 % 5.88 %

CHEF 3.00 % 1.63 % 2.88 %

FMDF 0.25 % 1.25 % 2.38 %

the design of theMLP network tries to minimize the mean square error between the

network outputs and the desired values. Thus, the error percentages between them are

slightly different. From the table, The Quadratic discriminant has higher error percentages

than the sigmoidMLP network because the modelling polynomial has too low a degree.

In order to compareGi(x) and Qi(x) further, we can measure the relative mean

square error between the corresponding two-dimensional coefficient arrays of the

Gaussian and Quadratic discriminants. TheEi for each class is of the form

The constantsi is found to minimizeEi. Another way to measure the similarity is to

(5.25)Ei

N

k 1

k

l 1

wi(k,l) siwi(k,l) 2

N

k 1

k

l 1

wi2(k,l)

calculate the correlation coefficientRi for each class [57]. The calculation ofRi is

90

where

(5.26)Ri

Nv

j 1

(Gi(x(j)) Gi)(Qi(x

(j)) Qi)

Nv

j 1

(Gi(x(j)) Gi)

2

Nv

j 1

(Qi(x(j)) Qi)

2

and x(j) denotes thejth pattern number. It is apparent that if theEi is small or theRi is

(5.27)Gi

1Nv

Nv

j 1

Gi(x(j)) , Qi

1Nv

Nv

j 1

Qi(x(j))

large, the two discriminants are more similar. Table 5.11 demonstrates the results of our

analysis. In spite of the results of Wan [58], in which a Bayesian interpretation is given

for the MLP networks, we see that the discriminant functions from the two approaches

can differ significantly. Finally, we show the learning capability of quadratic

discriminants, Eq. (5.24). As the results shown in Table 5.12, the training takes only 50

iterations and both mean square errors and error percentages decrease.

Table 5.11Analysis of Similarity Between Gaussian and Quadratic Discriminants.

Class # 1 2 3 4

E1 R1 E2 R2 E3 R3 E4 R4

RWEF .449 .559 .999 .589 .601 .354 .390 .701

CHEF .229 .723 .442 .372 .691 .148 .399 .743

FMDF .880 .887 .599 .455 .616 .257 .886 .658

91

Table 5.12Training of the Quadratic Discriminants.

Shape Feature Before Training After Training

RWEF 5.88 % 3.75 %

CHEF 2.88 % 2.38 %

FMDF 2.38 % 1.50 %

CHAPTER 6

CONCLUSIONS

In this dissertation, we present aPBF model for the analysis and design of the

MLP neural networks. Applications of thePBF model have been used in forward and

inverse mappings, respectively. In the following, we summarized the major aspects of our

work.

(1). The PBF model leads to approximation theorems for theMLP networks. A

constructive proof for realizing each term of the desired function was shown which

utilizes thePBFs. The required number of hidden units was determined.

(2). ThePBFmodel leads to straightforward mappings between theMLP networks and

conventional filtering and classification algorithms. Given anN-dimensional finite

degree polynomial function, two different kinds of networks,complete and

compact networks, were developed. The upper bound (UB) and lower bound (LB)

on the number of the required hidden units was also derived. The forward mapping

allowed us to determine the required network topology for a given task. The

network can then be improved throughBP learning.

(3). Given a trainedMLP neural network, both condensed and exhaustivePBF models

can be found. In the condensed network model, thePBF for each unit is the

function of its net input. The condensed network model is useful for determining

network degree and network pruning. The pruning methods based on this model

are shown to be more efficient than existing techniques. The exhaustive network

model, which is a polynomial discriminant function, is found by multiplying out

the condensed model.

92

APPENDIX A

BACK-PROPAGATION LEARNING ALGORITHM

93

94

The MLP networks are most often designed using the Back-Propagation (BP)

learning rules or its variants [19]. Basically, theBP algorithm is a gradient descent

technique. Its objective is to adjust the network weights so that applications of a set of

inputs produces the desired set of outputs. Learning in the network is equivalent to

minimizing the sum of the squared errors between the desired and actual network outputs

with respect to these weights. Each input vector is paired with a target vector,T(p)(i),

representing the desired output of theith unit for patternp. The total mean square error,

E(W), at the outputs of the network is

whereO(p)(i) is the actual output forith output unit and the summation is performed over

(A.1)E(W) 12

Nv

p 1

Nc

i 1

T (p)(i) O (p)(i)2

all Nc output units andNv patterns.

Before starting the training process, all of the weights must be initialized as small

random values. Large values could saturate the network. For each training patternp, the

direction of steepest descent in parameter space is determined by the partial derivative of

E(W) with respective to each weight (or threshold)

Then the weights are updated as

(A.2)∆w (p)(i,j) ∼ ∂E(W)

∂w (p)(i,j)

95

In general, Eq. (A.2) is rewritten as

(A.3)w (p)(i,j) w (p 1)(i,j) ∆w (p)(i,j)

η is called learning rate andδ(p)(i), which propagates error signals backward through the

(A.4)∆w (p)(i,j) ηδ (p)(i)O (p)(j)

network, is defined as

Essentially, the determination ofδ(p)(i) is a recursive process which starts with the output

(A.5)δ(p)(i) ∂E(W)

∂X (p)net(i)

layers and working backwards to the first hidden layer (refer to Figure A.1). Then, the

δ(p)(i) is given by

and

(A.6)

δ(p)(i) G (X (p)net(i))(T

(p)(i) O (p)(i)) , for unit i in output layer

δ (p)(i) G (X (p)net(i))

k

δ (p)(k)w (p)(k,i) , for unit i in hidden layers

Finally, all the weights are updated according to Equations (A.3) and (A.4).

(A.7)G (X (p)net(i))

∂φ(i)

∂X (p)net(i)

In summary, theBP algorithm for training theMLP networks is as follows :

Step1. Initialize weights and thresholds between all units.

96

Step2. Present input and desired outputs.

Step3. Calculate actual outputs.

Step4. Adapt weights backward.

Step5. Repeat by going toStep2.

Figure A.1 Backpropagate the Error Signals from Output Layer.

APPENDIX B

REALIZATION OF MONOMIAL AND TWO-INPUT PRODUCT SUBNETS

97

98

In this Appendix, methods to realize monomial and two-input product subnets are

presented. An iterative method for finding the expansion point for the second degree

Taylor series is discussed first. The accuracy of the truncated Taylor series is reviewed.

Example squaring and product subnets are also presented.

B.1 Find X0 for the Second Degree Taylor Series

Our goal here is to closely approximate the sigmoid activation output of Eq. (1.12)

by a power series withp(i) = 2. The method is quite straightforward. As a first step, we

need to decide a specific point of expansion,X0(i), and the range of convergence,M(i),

for the ith hidden unit. Since the sigmoid function is differentiable, we can find the

(p(i)+1)th derivative of the sigmoid activation and estimate an upper bound on the

remainder termRp(i)+1( ) and the radius of convergence on both sides aboutX0(i)

according to the following formula [59]

G(p(i)+1)(ξ) is the (p(i)+1)th derivative of the sigmoid activation function andξ is

(B.1)Rp(i) 1(Xnet(i)) G (p(i) 1)(ξ)(Xnet(i) X0(i))

p(i) 1

(p(i) 1)!

somewhere betweenXnet(i) andX0(i). The choice of the expanding point is very crucial.

It has to make sure that (1)φ(i) can be approximately expressed by the first three terms

of Taylor series and (2) the ratio of maximum remainder to the third term of Taylor series

is as small as possible. That is the truncation error is bounded explicitly. Another

important factor in the Taylor series approximation is to decide the radius of convergence

99

such thatXnet(i) - X0(i) ≤ M(i). This constraint helps us to decide the initial weights and

threshold of each hidden unit. We suggest an iterative method to findX0(i) and M(i).

Initially, we have a rough estimation forX0(i) being about 1.5 in this case. Thus we can

start with an initial guess forX0(i) between [0.8,1.8] and use of numerical analysis to get

the best operating point. The iterative process in thekth step is

(1). X0(k)(i) ← X0

(k)(i) + ∆X0(i), where∆X0(i) = 0.05.

(2). Taking∆M(k)right(i) to be 1, find the remainder term using Eq. (B.1). Calculate

the ratioRp(i)+1( ) with the term of degreep(i). If the ratio is less than some

typical value (sayδ=0.0001), then the right side radius of convergence in the

kth step isM(k)right(i) ← ∆M(k)

right(i). Otherwise, decrease∆M(k)right(i) by δ

value, and then repeat (2).

(3). Repeat (2) for left side region to findM(k)left(i).

From our experimental results, the optimal value ofX0(i) is chosen to be the value

associated with maximum radius of convergence. For the case ofp(i) = 2, the result for

X0(i) is 1.45 and the radius of convergence is 0.505.

B.2 Conditions for Mapping Accuracy of the Truncated Taylor Series

In previous section, we show a method to find the expanding point and radius of

convergence for each unit. It is desirable to find conditions which allow us to neglect

remainder terms (truncation error) after the Taylor series expansion.

Theorem B1 : The ratio of the remainder’s magnitude to that of thep(i) degree term in

100

Eq. (2.7) approaches zero as the radius of convergence for that unit

approaches zero.

Proof : According to Eq. (2.7), the ratio of the remainder to thep(i) degree term is

whereK is a finite constant. For bounded input, therefore bounded activation outputs

(B.2)E(i,ξ) (Xnet(i) X0(i))

p(i) 1

A(i,j) (Xnet(i) X0(i))p(i)

K (Xnet(i) X0(i))

φ(j) ≤ φ(m)(j), and the constraint ofXnet(j) - X0(j) ≤ M(j), we set up the following

equations

For simplicity, we assume that all these weightsw(i,j) are equal to each other. Then the

(B.3)j

w(i,j)φ(m)(j) θ(i) X0(i) M(i)

j

w(i,j)φ(m)(j) θ(i) X0(i) M(i)

weights feeding theith hidden unit and its threshold value are

Substituting Eq. (B.4) into the net input of theith hidden unit (Eq. (2.9)),

(B.4)w(i,j) M(i)

j

φ(m)(j), θ(i) X0(i)

The ratio in Eq. (B.2) becomes

(B.5)Xnet(i)j

φ(j)M(i)

j

φ(m)(j)θ(i)

101

When M(i) → 0, the ratio of the remainder to thep(i) degree term becomes

(B.6)M(i)

Kj

φ(j)

j

φ(m)(j)

insignificant.

From the results of Theorem B1, we develop methods to realize monomial and

two-input product subnets.

B.3 Monomial Subnet

In designing monomial subnets, we must decide the number of hidden units and

control the mapping accuracy.

Theorem B2 : The monomial function,xk, for a bounded input signalx, x ≤ x(m), can

be realized with one processing layer, having (k-1) hidden units, with

arbitrarily small errors (Figure B.1).

Proof : The hidden units are numbered asi equal 2 tok and for ith hidden unit,p(i) = i.

From Eq. (2.7),xi is observed at the output of theith hidden unit, along with unwanted

terms such asxi-1, xi-2, and so on, which need to be subtracted out. The zero degree and

first degree terms can be generated using a bias term and the connection from the input

layer. Thus the total number of hidden units required is (k-1). As in the proof of Theorem

B2, the input weight and threshold for theith hidden unit is

and i = 2,3, ,k.

102

Clearly Eq. (2.7) becomes

Figure B.1 Monomial Subnetxk.

(B.7)wx(i,1) M(i)

x (m), θ(i) X0(i)

(B.8)Xnet(i) X0(i) xwx(i,1)

Normalize Eq. (B.9) to get the normalized activationφ(i) as

(B.9)φ(i)p(i)

j 0

A(i,j)w jx (i,1)x j E(i,ξ)w i 1

x (i,1)x i 1

(B.10)φ(i)p(i)

j 0

d(i,j)x j En(i)

103

with

After normalization, the coefficient of the highest degree in each hidden unit’s output is

(B.11)d(i,j) A(i,j)M j i(i)

A(i,i) x (m) j i, En(i)

E(i,ξ)M(i)

A(i,i)x (m)x i 1

one. Assume thatxk is formed by connecting theφ(i)’s to an output node through weights

wo(i), 2 ≤ i ≤ k. The net input of the output node is the total summation of allk hidden

unit’s outputs

From Eq. (B.11)d(m,m) = 1, by pickingwo(k) = 1 and

(B.12)k

j 2

k

m j

wo(m)d(m,j)x jk

j 2

wo(m) En(m)

then Eq. (B.12) becomes

(B.13)wo(i)k

j i 1

wo(j)d(j,i) , i 2, ,k 1

As stated in Theorem B2,En(m) is proportional toM(m). Clearly En(m) can be made

(B.14)x kk

m 2

wo(m) En(m)

arbitrarily small by makingM(m) small. Therefore the estimated errors can be made

arbitrarily small.

In summary, conditions to constructxk with arbitrarily small error are

104

As an example, we design a subnet forx2. According to Theorem B3, there is only

(B.15)M(k) → 0, M(i) M(j) , for i j

one hidden unit required. Rewriting Eq. (B.10) for the case ofp(i) = 2, the activation

output is

By choosingM(2) small and the weights as shown in Figure B.2,x2 can be observed at

(B.16)

φ(2) A(2,0)

A(2,2)w 2x (2,1)

A(2,1)A(2,2)wx(2,1)

x x 2E(2,ξ)wx(2,1)

A(2,2)x 3

A(2,0)A(2,2)

(x (m))2

M 2(2)

A(2,1)A(2,2)

x (m)

M(2)x x 2 E(2,ξ)

A(2,2)M(2)

x (m)x 3

λ0 λ1x x 2 Err

the output unit with arbitrarily small errors. In Figure B.2,λ0 andλ1 are in Eq. (B.16) and

(B.17)λ2

M(2)

x (m), λ3

(x (m))2

A(2,2)M 2(2)

B.4 Two-Input Product Subnet

In the discussion of monomial subnets, we proved thatxk can easily be

approximated with arbitrarily small error. Here, the realization of 2-input multipliers are

discussed. The result can be extended to a multi-input multiplier.

A 2-input multiplier subnet with bounded inputs, having 3 hidden units in one

processing layer, can be constructed starting with a monomial subnet forx2 (replacex by

105

x1+x2). Therefore, (x1+x2)2 = (x1

2+2x1x2+x22) can be realized with arbitrarily small errors.

Figure B.2 Monomial Subnetx2.

The unwantedx12 and x2

2 terms can be generated with parallel squaring subnets and

subtracted off the square terms, yieldingx1x2 only as shown in Figure B.3. The method

to initialize the product subnet is described as follows :

Given bounded inputs,x1(n) ≤ x1 ≤ x1

(m) andx2(n) ≤ x2 ≤ x2

(m), the net inputXnet(2)

with the constraintXnet(2) - X0(2) ≤ M(2), can be written as

If net output in the product unit (hidden unit 2 in Figure B.3) is normalized with the

(B.18)wx(2,1)x (m)1 wx(2,2)x (m)

2 θ(2) X0(2) M(2)

(B.19)wx(2,1)x (n)1 wx(2,2)x (n)

2 θ(2) X0(2) M(2)

coefficient ofx1x2, the remainder can be rewritten as

(B.20)E(2,ξ) Xnet(2) X0(2) 3

2wx(2,1)wx(2,2)

106

From Eq. (B.20), to minimize the remainder term is equivalent to maximize the product

of wx(2,1)wx(2,2). By subtracting Eq. (B.19) form Eq. (B.18), we get

wherer1 = x1(m) - x1

(n) andr2 = x2(m) - x2

(n). Taking derivative ofwx(2,1)wx(2,2) with respect

(B.21)wx(2,1)r1 wx(2,2)r2 2M(i)

to eitherwx(2,1) orwx(2,2), the input weights and threshold which minimize the remainder

term can be found as

After calculating the input weights, the output weights can be initialized in the same way

(B.22)wx(2,1) M(i)r1

, wx(2,2) M(i)r2

(B.23)θ(i) X0(i) M(i) (1x (m)

1

r1

x (m)2

r2

)

as in the monomial subnet. An example of a two-input product subnet with specific

weights and thresholds is shown in the following.

(B.24)

θ(1) θ(2) θ(3) 1.451

wx(2,1) wx(2,2) 0.253

wx(1,1) wx(3,2) 0.505

λ0 200.753, λ1 164.375

λ2 10.582, λ3 41.094

107

Figure B.3 Product Subnetx1x2.

APPENDIX C

FEATURE DATA SET

108

109

Our goal in this Appendix is to enumerate several different kinds of feature sets

which are used in this dissertation to test the classification capabilities of theMLP

networks. The features were calculated from four classes of geometric shapes. The four

primary geometric shapes are ellipse, triangle, quadrilateral, and pentagon. Several

example shape images are shown in Figure C.1. Each shape image consists of a matrix

of size 64 x 64, and each element in the matrix represents a binary-valued pixel in the

image. The feature types are Circular Harmonic Expansion (CHEF) [60], Fourier-Mellin

Descriptor (FMDF) [61], Ring-Wedge Energy (RWEF) [62], Log-Polar Transform (LPTF)

[63] and Radius Feature (RDF) [64]. In Table C.1, we summarize the different shape

feature sets.

Table C.1 Shape Features.

Shape Features # of Inputs # of Classes # of Patterns per Class

CHEF 16 4 200

FMDF 16 4 200

RWEF 16 4 200

LPTF 16 4 200

RDF 16 4 200

110

Figure C.1 Examples of Geometric Shapes.

REFERENCES

[1] M. Schetzen,The Voterra and Wiener theories of nonlinear systems, Wiley, 1980.

[2] N. Gallagher and G. Wise, "A Theoretical Analysis of the Properties of Median

Filter," IEEE Trans. on Acoust, Speech, Signal Proc., Vol. ASSP-29, pp. 1136-

1141, Dec. 1981.

[3] G. Arce and N. Gallagher, "State Description for the Root-Signal Set of Median

Filters,"IEEE Trans. on Acoust, Speech, Signal Proc., Vol. ASSP-30, pp. 894-902,

Dec. 1982.

[4] J. Fitch, E. Coyle and N. Gallagher, "Median Filtering by Threshold

Decomposition,"IEEE Trans. on Acoust, Speech, Signal Proc., Vol. ASSP-32, pp.

1183-1188, Dec. 1984.

[5] P. Maragos and R. Schafer, "Morphological Filters - Part I: Their Set-Theoretic

Analysis and Relations to Linear Shift-Invariant Filters,"IEEE Trans. on Acoust,

Speech, Signal Proc., Vol. ASSP-35, pp. 1153-1169, Aug. 1987.

[6] R.O. Duda and P.E. Hart.Pattern classification and scene analysis. New York:

Wiley, 1973.

[7] K. Fukunaga.Introduction to statistical pattern recognition. New York: Academic

Press, 1972.

111

112

[8] D.F. Specht, "Probabilistic neural network and the polynomial adaline as

complementary techniques for classification,"IEEE Trans. Neural Networks,

1(1):111-121 Mar. 1990.

[9] M. Caudill, "The polynomial Adaline algorithm,"Comput. Lang., Dec. 1988.

[10] D. Gabor et al., "A universal nonlinear filter, predictor and simulator which

optimize itself by a learning process,"Proc. Inst. Elec. Eng., Vol. 108B, pp. 422-

438, 1961.

[11] T. Cover and P. Hart, "Nearest Neighbor Pattern Classification,"IEEE Trans.

Information Theory. IT-13, pp. 21-27, 1967.

[12] O.J. Murphy, "Nearest neighbor pattern classification perceptrons,"Proc. IEEE,

78(10):1595-1598, Oct. 1990.

[13] W.S. McCulloch and W.H. Pitts, "An Logical Calculus of the Ideas Immanent in

Nervous Activity,"Bulletin of Mathematics and Biophysics, Vol. 5, pp. 115 1943.

[14] F. Rosenblatt, "The Perceptron: A Probabalistic Model for information Storage and

Organization in the Brain,"Psychological Review, Vol. 65, pp. 386, 1958.

[15] W.Y. Huang and R.P. Lippmann, "Neural net and traditional classifier," In D.

Anderson, editor,Neural Info. Processing Syst., pp. 387-396. American Institute

of Physics, New York, 1988.

[16] R.P. Lippman, "Pattern classification using neural networks,"IEEE Commun. Mag.

113

27, pp. 47-64, 1989.

[17] R.P. Lippmann, "An Introduction to Computing with Neural Nets,"IEEE ASSP

Magazine, pp. 4-22, April 1987.

[18] J. F. Steffensen.Interpolation. Chelsea Publishing Company, New York, 1950.

[19] D.E. Rumelhart, G.E. Hinton and R.J. Williams, "Learning internal representation

by error propagation," in D.E. Rumelhart and J.L. McClelland (Eds.),Parallel

Distributed Processing, Vol. I, Cambridge, Massachusetts: The MIT Press, 1986.

[20] S. Knerr, L. Personnaz and G. Dreyfus, "Single-layer learning Revisited: A

Stepwise Producer for Building and Training a Neural Network,"NATO Workshop

on Neurocomputing, Les Arcs, France, Feb. 1989.

[21] M. Mezard and J.P. Nadal, "Learning in Feedforward Layered Networks: the

Tiling Algorithms," J. Phys. A 22, 2191-2203, 1989.

[22] Ehud D. Karnin, "A Simple Procedure for Pruning Back Propagation Trained

Neural Networks,"IEEE Trans. on Neural Networks. Vol. 1, No. 2, 1990.

[23] M.C. Mozer and P. Smolensky, "Skeletonization: A technique for trimming the fat

from a network via relevance assessment," inAdvances in Neural Information

Processing I, D.S. Touretzky, Ed. Morgan Kaufmann, pp. 107-115, 1989.

[24] G. Cybenko, "Approximations by Superpositions of a Sigmoidal Function,"Math.

Contrl., Signals, Syst., Vol. 2, pp. 303-314, 1989.

[25] A. Lapedes and A. Farber, "Nonlinear signal processing using neural networks:

prediction and system modeling,"Los Alamos National Laboratory, Los Alamos,

114

N.M., TR LA-UR-87-2662, 1987.

[26] Robert Hecht-Nielsen, "Theory of backpropagation neural network," in

Proceedings of the International Joint Conference on Neural Networks, Vol. I, pp.

593-605 Washington D.C., June 1989.

[27] Maxwell Stinchcombe and Harbert White, "Universal approximation using

feedforward networks with non-sigmoid hidden layer activation functions," in

Proceedings of the IJCNN, Vol. I, pp. 613-617. Washington D.C., June 1989.

[28] O. Nerrand, P. Roussel-Ragot, L. Personnaz and G. Drefus, "Neural Network

Training Schemes for Non-linear Adaptive Filtering and Modelling," in

Proceedings of the IJCNN, Vol. I, pp. 61-66, 1991.

[29] C. Klimasauskas, "Neural Nets and Noise Filtering,"Dr. Dobb’s Journalpp. 32

Jan. 1989.

[30] Brooke Anderson and Don Montgomery, "A Method for Noise Filtering with

Feed-forward Neural Networks: Analysis and Comparison with low-pass and

Optimal Filtering," inProceedings of the IJCNN, Vol. I, pp. 209-214, 1990.

[31] P. Gallinari, S. Thiria and F. Fogelman Soulie, "Multilayer Perceptrons and data

analysis,"Proceeding of the IJCNN, Vol I, pp. 391-399, 1988.

[32] H. Asoh and N. Ostu, "Nonlinear data analysis and multilayer perceptrons," in

Proceedings of the IJCNN, Vol. II, pp. 411-415, 1989.

[33] Toshio Irino and Hideki Kawahara, "A method for Designing Neural Network

Using Nonlinear Multivariate Analysis: Application to Speaker-Independent Vowel

115

Recognition," inNeural Computation, Vol. 2, No. 3, pp. 386-397, 1990.

[34] Osamu Fujita, "A method for Designing the Internal Representation of Neural

Networks," inProceedings of the IJCNN, Vol. III, pp. 149-154, 1990.

[35] J. Park and I.W. Sandberg, "Universal Approximation Using Radial-Basis-Function

Network," Neural Computation, Vol. 3, No. 2, pp. 246-257, 1991.

[36] S. Qian, Y.C. Lee, R.D. Jones, C.W. Barnes and K. Lee, "Function Approximation

with an Orthogonal Basis Net," inProceedings of the IJCNN, Vol. III, pp. 605-

619, 1990.

[37] Wilson J. Rugh.NONLINEAR SYSTEM THEORY. The Volterra/Wiener Approach.

The Johns Hopkins Univ. Press, 1981.

[38] Matrin Schetzen, "Nonlinear System Modelling Based on the Wiener Theory,"

Proceeding of the IEEE, Vol. 69, No. 12, 1981.

[39] S. Chen, S.A. Billings and P.M. Grant, "Non-linear system identification using

neural networks,"Int. J. ControlVol. 51 pp. 1191-1214, 1990.

[40] D.S. Broomhead and D. Lowe, "Multivariable Functional Interpolation and

Adaptive Networks,"Complex Systems, 2, pp. 321-355, 1988.

[41] M.J.D. Powell, "Radial basis functions for multi-variable interpolation: A review,"

IMA Conference on Algorithms for the Approximation of Functions and Data,

RMCS Shrivenhamn, UK, 1985.

[42] C.A. Micchelli, "Interpolation of Scattered Data: Distance Matrices and

Conditionally Positive Definite Functions,"Constructive Approximation, 2, pp. 11-

116

22, 1986.

[43] S. Chen, C.F.N. Cowan and P.M. Grant, "Orthogonal Least Squares Learning

Algorithm for Radial Basis Function Network,"IEEE Trans. on Neural Networks,

Vol 2, No. 2, pp. 302-309, March 1991.

[44] Mu-Song Chen and M.T. Manry, "Back-Propagation Representation Theorem

using Power Series," inProceedings of the IJCNN, Vol. I, pp. 643-648, 1990.

[45] A.N. Kolmogorov, "On the representation of continuous function of many

variables by superposition of continuous function of one variable and addition,"

Dokl. Akad. NaukUSSR pp. 953-956, 1957.

[46] R.R. Goldberg.Methods of Real Analysis, John Wiley and Sons, New York, 1976.

[47] R. Courant and D. Hilbert.Method of Mathematical Physics, Vol. 1, Interscience

Publishers, New Work, 1955.

[48] D. Quintin Peasley.Coefficients of associated Legendre functions. Washington:

National Aeronautics and Space Administrations, 1976.

[49] J.T. Tou and R.C. Gonzalez.Pattern Recognition principles, Addison-Wesley

Publishing Company, 1981.

[50] R. Fletcher and C. M. Reeves, "Function minimization by conjugate gradients,"

Computer J., Vol. 7, pp. 149-154, 1964.

[51] R. Fletcher, "Conjugate direction methods," inNumerical Methods for

Unconstrained Optimization,W. Murray, Ed. London and New York: Academic

Press, pp 73-86, 1972.

117

[52] H. Crowder and P. Wolfe, "Linear convergence of the conjugate gradient method,"

IBM J. Res. Develop., Vol. 16, pp. 431-433, 1972.

[53] J. Kowalik and M. R. Osborne.Methods for Unconstrained Optimization

Problems. New York: American Elsevier, 1968.

[54] Richard L. Burden and J. Douglas Faires.Numerical Analysis, third edition, 1985.

[55] A. Bjorck, "Solving linear set squares problems by Gram-Schmit

orthogonalization,"Nordisk Tidskr. Information-Behandling, Vol. 7, pp. 1-21,

1967.

[56] G. Golub, "Numerical methods for solving linear set squares problems,"

Numerische Mathematik, Vol. 7, pp. 206-216, 1965.

[57] A. Papoulis.Probability, Random Variables, and Stochastics Processes. New

York: McGraw-Hill, 1965.

[58] E.A. Wan, "Neural Network Classification,"IEEE Trans. on Neural Networks.

Vol. 1, No. 4, 1990.

[59] Kreyszig, Erwin.Advanced Engineering Mathematics. Fourth Edition, New York:

Wiley and Sons, 1979.

[60] Y-N Hsu and H.H. Arsenault, "Pattern discrimination by multiple circular

harmonic components,"Applied Optics, Vol. 23, pp. 841-844, 1984.

[61] Y. Sheng and H.H. Arsenault, "Experiment on pattern recognition using invariant

Fourier-Mellin descriptor,"Optics society of America, Vol. 3, pp. 771-776, 1986.

118

[62] N. George, S. Wang and D.L. Venable, "Pattern recognition using ring-wedge

detector and neural-network software,"SPIEVol. 1134, Optical Recognition II pp.

96-106, 1989.

[63] D. Casasent and D. Psaltis, "Position, rotation, and scale invariant optical

correlation,"Applied Optics, Vol. 15, 1976.

[64] H.C. Yau, "Transform-based shape recognition employing neural networks,"Ph.D.

Dissertation, Univ. of Texas at Arlington, 1990.

Documents

ANALYSIS AND DESIGN OF THE MULTI-LAYER PERCEPTRON