Upload
ngohanh
View
215
Download
0
Embed Size (px)
Citation preview
Chapter �
Supervised Learning� Multilayer
Networks II
Extensive research has been conducted� aimed at devel�
oping improved supervised learning algorithms for feed�
forward networks�
��� Madalines
A �Madaline� is a combination of many Adalines� trained
an algorithm that follows the �Minimum Disturbance�
principle� Learning algorithms for Madalines have gone
through three stages of development� MRI �Madaline
Rule I� MRII and MRIII�
�
� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
Adaline
Adaline
OR
x�
x�
Figure ���� Madaline I network with two Adalines
�Madaline I� has an adaptive hidden layer and a xed
logic device �AND�OR�majority gate at the outer layer�
The goal of its training procedure is to decrease net�
work error �at each input presentation by making the
smallest possible perturbation to the weights of some
Adaline�
Note that the Adaline output must change in sign in
order to have any e�ect on the network output� If the
net input into an Adaline is large� its output reversal re�
quires a large amount of change in its incoming weights�
Therefore� the algorithm attempts to nd an existing
Adaline �hidden node whose net input is smallest in
magnitude� and whose output reversal would reduce
���� MADALINES �
network error� This output reversal is accomplished by
applying the LMS algorithm to weights on connections
leading into that Adaline�
The number of Adalines for which output reversal is
necessary depends on the choice of the outermost node
function� If the outermost node computes the OR of its
inputs� then network output can be changed from to
� by modifying a single Adaline� but changing network
output from � to requires modifying all the Adalines
for which current output is ��
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Madaline II architecture uses Adalines with modiable
weights at the output layer of the network� instead of
xed logic devices�
The weights are initialized to small random values�
and training patterns are repeatedly presented� The
algorithm successively modies weights in earlier layers
before later ones� using the LMS or ��LMS rule�
Adalines whose weights have already been adjusted
� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
Adaline
Adaline
Adaline
Figure ����Madaline II with one hidden layer
may be revisited� when new input patterns are pre�
sented� It is possible to adjust the weights of several
Adalines simultaneously� Since the node functions are
not di�erentiable� there can be no explicit gradient de�
scent� The algorithm may repeatedly cycle through the
same sequence of states�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Madaline III networks are feedforward� with sigmoid
node functions� Weights of all nodes are adapted in
each iteration� using the MRIII training algorithm� Al�
though the approach is equivalent to backpropagation�
each weight change involves considerably more compu�
tation�
���� MADALINES �
Algorithm MRIII�
repeat
Present a training pattern i to the network�
Compute node outputs�
for h � � to the number of hidden layers� do
for each node k in layer h� do
Let E be the current network error�
Let network error be Ek when � � �small
is added to the kth node�s net input�
Adjust weights on connections to kth node
using �w � ��i�E�k � E���
or �w � ��iE�Ek � E��
end�for�
end�for
until network error is satisfactory or
upper bound on no� of iterations is reached�
Figure ����MRIII Training Algorithm for Madaline III
� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Adaptive Multilayer Networks
Small networks are more desirable than large networks
that perform the same task because�
�� training is faster�
�� the number of parameters in the system is smaller�
�� fewer training samples are needed�
�� the system is more likely to generalize well for new
test samples�
There are three approaches to build a network of �op�
timal� size�
�� A large network may be built and then �pruned�
by eliminating nodes and connections that can be
considered unimportant�
�� Starting with a very small network� the size of the
network is repeatedly increased by small increments
���� ADAPTIVE MULTILAYER NETWORKS �
until performance is satisfactory�
�� A su�ciently large network is trained� and unim�
portant connections and nodes are then pruned� fol�
lowing which new nodes with random weights are
re�introduced and the network is retrained� This
process is continued until a network of acceptable
size and performance levels is obtained� or further
pruning attempts become unsuccessful�
These methods are based on sound heuristics� but are
not guaranteed to result in an optimal size network�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Network pruning algorithms train a large network and
then repeatedly �prune� it until size and performance of
the network are both satisfactory� Pruning algorithms
use penalties to help determine if some of the nodes
and links in a trained network can be removed� leading
to a network with better generalization properties� A
generic algorithm for such methods is described�
CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
Algorithm Network�Pruning�
Train a network large enough for the current problem�
repeat
� Find a node or connection whose removal does
not penalize performance beyond a predeter�
mined level�
� Delete this node or connection�
� �Optional� Retrain the resulting network
until further pruning degrades performance excessively�
Figure ���� Generic network pruning algorithm�
���� ADAPTIVE MULTILAYER NETWORKS
The following are some pruning procedures�
�� Connections associated with weights of small mag�
nitude may be eliminated from the trained network�
Nodes whose associated connections have small mag�
nitude weights may also be pruned�
�� Connections whose existence does not signicantly
a�ect network outputs �or error may be pruned�
These may be detected by examining the change in
network output when a connection weight is changed
to or by testing whether �o��w is negligible� i�e��
if the network outputs change little when a weight
is modied�
�� Input nodes can be pruned if the resulting change
in network output is negligible� This results in a re�
duction in relevant input dimensionality� by detect�
ing which network inputs are unnecessary for output
computation�
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
�� Lecun� Denker and Solla ���� identify the weights
that can be pruned from the network� by examining
the second derivatives of the error function� Note
that
�E ��E
�w�w�
�
��E����w� where E�� �
�
�w
��E
�w
��
If a network has already been trained until a local
minimum of E has been reached� using an algorithm
such as backpropagation� then �E��w � � and the
above equation simplies to
�E ��
��E����w��
Pruning a connection corresponds to changing the
connection weight from w to � i�e�� �w � �w� The
connection is pruned if
�E ��
��E����w� � �E��w���
is small enough� The pruning criterion hence exam�
ines whether �E��w��� is below some threshold�
���� ADAPTIVE MULTILAYER NETWORKS ��
Marchand�s algorithm obtains an �optimal� size net�
work for classication problems� repeatedly adding a
perceptron node to the hidden layer�
Let T�k and T
�k represent the training samples of two
classes that remain to be correctly classied at the kth
iteration� At each iteration� a node�s weights are trained
such that either jT�k��j � jT�
k j or jT�k��j � jT�
k j� and
�T�k�� � T�
k�� � �T�k � T�
k � ensuring that the algorithm
terminates eventually at the mth step� when either T�m
or T�m is empty�
Eventually� we have two sets of hidden units �H� and
H�� that together give the correct answers on all sam�
ples of each class� The connection from the kth hidden
node to the output node carries the weight
wk �
��� ���
k if the kth node belongs to H�
����k if the kth node belongs to H�
The above weight assignment scheme ensures that
nodes added later do not modify the correct results ob�
tain in earlier iterations of the algorithm�
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Upstart algorithm develops additional �subnets� for sam�
ples misclassied in earlier iterations� Each new node is
trained using the pocket algorithm� If any samples are
misclassied by a node� two subnets are introduced be�
tween the input nodes and other nodes in the network�
Each subnet attempts to correctly separate samples of
one class misclassied by the �parent� node from all
samples of the other class for which the parent node
was invoked� This is a simpler task than the one con�
fronted by the parent node� ensuring that the recursive
algorithm terminates� Large magnitude weights are at�
tached to the connections from new nodes to the parent
node� in such a way that neither of the new nodes can
a�ect performance of the parent node on samples cor�
rectly classied by the parent alone�
The upstart algorithm can be extended to introduce
a small network module at a time� instead of a single
node� as in the �block�start� algorithm�
���� ADAPTIVE MULTILAYER NETWORKS ��
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Neural Tree algorithm develops a decision tree� in which
the output of a node �� f � �g determines whether the
nal answer is to come from its left subtree or its right
subtree� Each node is trained using the LMS or pocket
algorithm� A node is a leaf node �new subtrees are not
required if all or most of the corresponding training
samples belong to the same class�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Cascade correlation� This algorithm has two important
features� �� the cascade architecture development� and
�� correlation learning� Moreover� the architecture is
not strictly feedforward�
In the cascade architecture development process� new
single�node hidden layers are successively added to a
steadily growing layered neural network until perfor�
mance is judged adequate� Each node may employ a
nonlinear node function such as the hyperbolic tangent�
whose output lies in the closed interval ���� � �� ��
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
Fahlman and Lebiere suggest using the Quickprop learn�
ing algorithm� When a node is added� its input weights
are trained rst� Then all the weights on the connec�
tions to the output layer are trained while leaving other
weights unchanged�
Weights to each new hidden node are trained to maxi�
mize covariance with current network error� i�e�� to max�
imize
S�wnew �KXk��
������PXp��
�xnew�p � �xnew�Ek�p � �Ek
������
where wnew is the vector of weights to the new node from
all the pre�existing input and hidden units� xnew�p is the
output of the new node for the pth input� Ek�p is the
error of the kth output node for the pth sample before
the new node is added� and �xnew and �Ek are averages
�over the training set�
���� ADAPTIVE MULTILAYER NETWORKS ��
1
Output
x y
Input nodes
node
1
Output
x y
Input
First
nodes
hiddennode
node
1
Output
x y
Input
First
nodes
hiddennode
node
1
Output
x y
Input
Second
First
nodes
hiddennode
hiddennode
node
1
Output
x y
Input
Second
First
nodes
hiddennode
hiddennode
node
Figure ���� �modied Fig����� from text Cascade correla�
tion example� solid lines indicate connections currently
being modied�
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
Tiling algorithm develops a multilayer feedforward net�
work� successively adding new hidden layers into which
�ow the outputs of the previous layer� Within each
layer� a �master� node is rst trained using the pocket
algorithm� and then �ancillary� nodes are successively
added to separate samples of di�erent classes that are
not yet separated by nodes currently in that layer� Each
layer is constructed such that for every pair of training
samples that belong to di�erent classes� some node in
that layer produces di�erent outputs�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Prediction Networks
Prediction problems constitute a special subclass of func�
tion approximation problems� in which the values of
variables need to be determined from values at previ�
ous instants� Two classes of neural networks have been
used for prediction tasks� recurrent networks and feed�
forward networks�
���� PREDICTION NETWORKS ��
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Recurrent Networks
Recurrent neural networks contain connections from
output nodes to hidden layer and�or input layer nodes�
and they allow interconnections between nodes of the
same layer� particularly between the nodes of hidden
layers�
Rumelhart� Hinton� and Williams�s ����� training
procedure is essentially the same as the backpropaga�
tion algorithm� They view recurrent networks as feed�
forward networks with a large number of layers� Each
layer is thought of as representing a time delay in the
network� Using this approach� the fully connected neu�
ral network with three nodes is considered equivalent
to a feedforward neural network with k hidden layers�
for some value of k� Weights in di�erent layers are con�
strained to be identical� to capture the structure of a
recurrent network� w�������i�j � w
���������i�j �
� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
1,1w
1,1w
1,1w
.
Input Input Input
k+1
k
1
0
(b)(a)
1 2 3
Output Output Output Time
w1,3
w
w
3,1
3,1
Figure ���� �a Fully connected recurrent neural network
with � nodes �b Equivalent feedforward version� for
Rumelhart�s training procedure�
���� PREDICTION NETWORKS �
OutputOutput
Input
Figure ���� Recurrent network with hidden nodes� to which
Williams and Zipser�s algorithm can be applied�
Williams and Zipser proposed another training pro�
cedure for a recurrent network with hidden nodes� The
net input to the kth node consists of inputs from other
nodes as well as external inputs� The output of each
node depends on outputs of other nodes at the previous
instant� ok�t� � � f�netk�t�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Feedforward networks for forecasting
The generic network model consists of a preliminary
preprocessing component that transforms an external
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
Algorithm Recurrent Network Training�
Assume randomly chosen weights� t � � and
�ok�
�wi�j
� � for each i� j� k�
while MSE is unsatisfactory and computational
bounds are not exceeded� do
Modify the weights�
�wi�j�t � �Xk�U
�dk�t� ok�t�ok�t
�wi�j
where U is the set of nodes with a
specied target value dk�t�
�ok�t� �
�wi�j
� f ��netk�t
�X��U
wk��
�z��t
�wi�j
� i�kzj�t
�
Increment t�
end�while�
Figure ��� Williams and Zipser�s Recurrent network train�
ing algorithm
���� PREDICTION NETWORKS ��
input vector x�t into a preprocessed vector x�t� The
feedforward network is trained to compute the desired
output value for a specic input x�t�
NeuralNetworkPredictor
x�t �x�t� x�t�� Predicted
Preprocessor
Figure ��� Generic neural network model for prediction�
Tapped delay�line neural network �TDNN� Consider a
prediction task where x�t is to be predicted from x�t�
�� x�t � �� In this simple case� x at time t consists of
a single input x�t� and x at time t consists of the vec�
tor �x�t� �� x�t� � supplied as input to the feedfor�
ward network� For this example� preprocessing consists
merely of storing past values of the variable and supply�
ing them to the network along with the latest value�
Many preprocessing transformations for prediction prob�
lems can be described as�
�xi�t �tX
���
ci�t� x�
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
where c is a known �kernel� function� For example� in
�exponential trace memories�� ci�j � ����i�ji � where
�i � ���� ��� The kernel function for �discrete�time
gamma memories� is
ci�j �
����jli
��� �i
li���j�lii if j � li
otherwise�
where delay li is a non�negative integer and �i � � � ���
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Radial Basis Functions �RBF
A function is radially symmetric �or is an RBF if its
output depends on the distance of the input vector from
a stored vector specic to that function� Neural net�
works whose node functions are radially symmetric func�
tions are referred to as RBF�nets� RBF nodes compute
a non�increasing function � of distance� with ��u� �
��u� whenever u� � u�� The Gaussian�
�g�u � e��u�c��
�
���� RADIAL BASIS FUNCTIONS �RBF� ��
is the most widely used RBF�
RBF�nets are generally called upon for use in function
approximation problems� particularly for interpolation�
In many function approximation problems� we need to
determine the behavior of the function at a new input�
given the behavior of the function at training samples�
Such problems are often solved by linear interpolation�
�x����
D�
D�
�x��
�x���
D�
�x��
�x���� �x����
�x��
�x����
x�
D�
Figure ����� Dj is the Euclidean distance between xj and
x�� �x� � � indicates that f�x� � �� The four nearest
observations can be used for interpolation at x�� giving
��D��� � �D��
� �D�� � �D��
� ��D��� �D��
�D�� �D��
� �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
If nodes in an RBF�net with linear radial basis node
functions ���D � D�� corresponding to each of the
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
training samples xp for p � �� � � � � P� surround the new
input x� the output of the network at x is proportional
to
�
P�
Xp
dp���kx� xpk�
where dp � f�xp is the output corresponding to xp� the
p�th pattern� To avoid the problem of deciding which
of the training samples do surround the new input� we
may use the output
�
P
PXp��
dp��kx� xpk�
where d�� � � � � dP are the outputs corresponding to the
entire set of training samples� and � is any RBF�
For network size to be reasonably small� we cannot
have one node to represent each xp� Hence similar train�
ing samples are clustered together� and output
o ��
N
NXi��
i��k�i � xk
where N is the number of clusters� �i is the center of
the ith cluster� and i is the desired mean output of all
samples of the ith cluster�
���� RADIAL BASIS FUNCTIONS �RBF� ��
Training involves learning the values of
w� � �
N� � � � � wN �
N
N� ��� � � � � �N �
minimizing E �PXp��
Ep �PXp��
�dp � op� �
Gradient descent suggests the following update rules�
�wi � �i�dp � op��kxp � �ik and
��i�j � ��i�jwi�dp � opR��kxp � �ik
��xp�j � �i�j
where R is dened for convenience such that R�z� �
��z� Note that �i and �i�j may di�er for each wi and
�i�j�
This learning algorithm requires considerable compu�
tation� A faster alternative� �partially o��line training��
consists of two steps�
�� Some clustering procedure is used to estimate the
centroid ��i and spread for each cluster�
�� Using one node per cluster �with xed �i� gradient
descent on E is invoked to nd weights �wi�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
��� Polynomial Networks
Many practical problems require computing or approxi�
mating functions that are polynomials of the input vari�
ables� a task that can require many nodes and extensive
training if familiar node functions �sigmoids� Gaussians�
etc� are employed�
� Networks whose node functions allow them to di�
rectly compute polynomials and functions of poly�
nomials are referred to as �polynomial networks��
The following are some examples�
�� Higher�order networks contain no hidden layers� but
have nodes referred to as �higher�order processing
units� that perform complex computations� A higher�
order node applies a nonlinear function f to a poly�
nomial in the input variables� and the resulting node
output is
f�w��Xj�
wj�ij� � � � ��X
j��j������jk
wj��j������jkij�ij� � � � ijk
���� POLYNOMIAL NETWORKS ��
The LMS training algorithm can be applied to train
such a network� However� the number of weights re�
quired increases rapidly with the input dimensional�
ity n and polynomial degree k�
�� Sigma�pi networks contain �sigma�pi units� that ap�
ply nonlinear functions to sums of weighted products
of input variables�
f�w� � � � ��X
j� ��j� ��j� ��j�
wj��j��j�ij�ij�ij� � � � �
This model does not allow terms with higher powers
of input variables� therefore sigma�pi networks are
not universal approximators�
�� �Product units� are nodes that compute products
nj��i
pj�ij � where each pj�i is an integer whose value is
to be determined� Networks may contain these in
addition to some �ordinary� nodes that apply sig�
moids or step functions to their net input� The learn�
ing algorithm must determine both the weights and
exponents pji for product units� in addition to deter�
� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
mining weights for ordinary nodes� Backpropagation
training may be used� and learning is slow�
�� Networks with functional link architecture conduct
linear regression� estimating the weights inP
j wj�j�i�
where each �basis function� �j is chosen from a pre�
dened set of components �such as linear functions�
products and sinusoidal functions�
�� Pi�sigma networks contain a hidden layer� in which
each hidden node computes wk�� �P
j wk�jij� and
the output node computes fQ
k�wk�� �P
j wk�jij��
Weights between the input layer and the hidden
layer are adapted during the training process� and
can be trained quickly using the LMS algorithm�
�� Ridge�polynomial networks consist of components that
generalize Pi�sigma networks� and are trained using
an adaptive network construction algorithm� These
networks have universal approximation capabilities�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� REGULARIZATION �
��� Regularization
Regularization involves optimizing a function E��jP j��
where E is the original cost �or error function� P is a
�stabilizer� that incorporates a priori problem�specic
requirements or constraints� and � is a constant that
controls the relative importance of E and P � Regular�
ization can be implemented explicitly or implicitly�
Examples of explicit regularization�
� Introduction of P � �P
j w�j into the cost function
penalizes large magnitude weights�
� A weight decay term may be introduced into the
generalized delta rule� so that the weight change
�w � ����E��w� �w
depends on the negative gradient of the mean square
error as well as the current weight� This favors the
development of networks with smaller weight mag�
nitudes�
�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II
� �Smoothing� penalties may be introduced into the
error function� e�g�� terms proportional to j �E�w�i j
and other higher�order derivatives of the error func�
tion� Such penalties prevent a network�s output
function from having very high curvature� and may
prevent a network from over�specializing on the train�
ing data to account for outliers in the data�
Examples of implicit regularization�
� Training data may be perturbed articially� using
Gaussian noise� This is equivalent to using a sta�
bilizer that imposes a smoothness constraint on the
derivative of the squared error function with respect
to the input variables�
� Adding random noise to the weights imposes a smooth�
ness constraint on the derivatives of the squared er�
ror function with respect to the weights�
� RBF�nets constitute a special case of networks that
accomplish regularization�