h has b een conducted aimed at dev el - NYU Tandon School ...cis.poly.edu/~mleung/CS6673/s09/Extra/m4.pdf · h has b een conducted aimed at dev el ... ard net w orks Madalines A Madaline

Chapter �

Supervised Learning� Multilayer

Networks II

Extensive research has been conducted� aimed at devel�

oping improved supervised learning algorithms for feed�

forward networks�

�� Madalines

A �Madaline� is a combination of many Adalines� trained

an algorithm that follows the �Minimum Disturbance�

principle� Learning algorithms for Madalines have gone

through three stages of development� MRI �Madaline

Rule I� MRII and MRIII�

�

� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II

Adaline

Adaline

OR

x�

x�

Figure �� Madaline I network with two Adalines

�Madaline I� has an adaptive hidden layer and a xed

logic device �AND�OR�majority gate at the outer layer�

The goal of its training procedure is to decrease net�

work error �at each input presentation by making the

smallest possible perturbation to the weights of some

Adaline�

Note that the Adaline output must change in sign in

order to have any e�ect on the network output� If the

net input into an Adaline is large� its output reversal re�

quires a large amount of change in its incoming weights�

Therefore� the algorithm attempts to nd an existing

Adaline �hidden node whose net input is smallest in

magnitude� and whose output reversal would reduce

�� MADALINES �

network error� This output reversal is accomplished by

applying the LMS algorithm to weights on connections

leading into that Adaline�

The number of Adalines for which output reversal is

necessary depends on the choice of the outermost node

function� If the outermost node computes the OR of its

inputs� then network output can be changed from to

� by modifying a single Adaline� but changing network

output from � to requires modifying all the Adalines

for which current output is ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Madaline II architecture uses Adalines with modiable

weights at the output layer of the network� instead of

xed logic devices�

The weights are initialized to small random values�

and training patterns are repeatedly presented� The

algorithm successively modies weights in earlier layers

before later ones� using the LMS or ��LMS rule�

Adalines whose weights have already been adjusted


Adaline

Adaline

Adaline

Figure ��Madaline II with one hidden layer

may be revisited� when new input patterns are pre�

sented� It is possible to adjust the weights of several

Adalines simultaneously� Since the node functions are

not di�erentiable� there can be no explicit gradient de�

scent� The algorithm may repeatedly cycle through the

same sequence of states�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Madaline III networks are feedforward� with sigmoid

node functions� Weights of all nodes are adapted in

each iteration� using the MRIII training algorithm� Al�

though the approach is equivalent to backpropagation�

each weight change involves considerably more compu�

tation�

�� MADALINES �

Algorithm MRIII�

repeat

Present a training pattern i to the network�

Compute node outputs�

for h � � to the number of hidden layers� do

for each node k in layer h� do

Let E be the current network error�

Let network error be Ek when � � �small

is added to the kth node�s net input�

Adjust weights on connections to kth node

using �w � ��i�E�k � E��

or �w � ��iE�Ek � E��

end�for�

end�for

until network error is satisfactory or

upper bound on no� of iterations is reached�

Figure ��MRIII Training Algorithm for Madaline III


� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Adaptive Multilayer Networks

Small networks are more desirable than large networks

that perform the same task because�

�� training is faster�

�� the number of parameters in the system is smaller�

�� fewer training samples are needed�

�� the system is more likely to generalize well for new

test samples�

There are three approaches to build a network of �op�

timal� size�

�� A large network may be built and then �pruned�

by eliminating nodes and connections that can be

considered unimportant�

�� Starting with a very small network� the size of the

network is repeatedly increased by small increments

�� ADAPTIVE MULTILAYER NETWORKS �

until performance is satisfactory�

�� A su�ciently large network is trained� and unim�

portant connections and nodes are then pruned� fol�

lowing which new nodes with random weights are

re�introduced and the network is retrained� This

process is continued until a network of acceptable

size and performance levels is obtained� or further

pruning attempts become unsuccessful�

These methods are based on sound heuristics� but are

not guaranteed to result in an optimal size network�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Network pruning algorithms train a large network and

then repeatedly �prune� it until size and performance of

the network are both satisfactory� Pruning algorithms

use penalties to help determine if some of the nodes

and links in a trained network can be removed� leading

to a network with better generalization properties� A

generic algorithm for such methods is described�

CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II

Algorithm Network�Pruning�

Train a network large enough for the current problem�

repeat

� Find a node or connection whose removal does

not penalize performance beyond a predeter�

mined level�

� Delete this node or connection�

� �Optional� Retrain the resulting network

until further pruning degrades performance excessively�

Figure �� Generic network pruning algorithm�

�� ADAPTIVE MULTILAYER NETWORKS

The following are some pruning procedures�

�� Connections associated with weights of small mag�

nitude may be eliminated from the trained network�

Nodes whose associated connections have small mag�

nitude weights may also be pruned�

�� Connections whose existence does not signicantly

a�ect network outputs �or error may be pruned�

These may be detected by examining the change in

network output when a connection weight is changed

to or by testing whether �o��w is negligible� i�e��

if the network outputs change little when a weight

is modied�

�� Input nodes can be pruned if the resulting change

in network output is negligible� This results in a re�

duction in relevant input dimensionality� by detect�

ing which network inputs are unnecessary for output

computation�

�� CHAPTER �� SUPERVISED LEARNING� MULTILAYER NETWORKS II

�� Lecun� Denker and Solla �� identify the weights

that can be pruned from the network� by examining

the second derivatives of the error function� Note

that

�E ��E

�w�w�

�

��E��w� where E��

�

�w

��E

�w

��

If a network has already been trained until a local

minimum of E has been reached� using an algorithm

such as backpropagation� then �E��w � � and the

above equation simplies to

�E ��

��E��w��

Pruning a connection corresponds to changing the

connection weight from w to � i�e�� w � �w� The

connection is pruned if

�E ��

��E��w� � �E��w��

is small enough� The pruning criterion hence exam�

ines whether �E��w�� is below some threshold�

�� ADAPTIVE MULTILAYER NETWORKS ��

Marchand�s algorithm obtains an �optimal� size net�

work for classication problems� repeatedly adding a

perceptron node to the hidden layer�

Let T�k and T

�k represent the training samples of two

classes that remain to be correctly classied at the kth

iteration� At each iteration� a node�s weights are trained

such that either jT�k��j � jT�

k j or jT�k��j � jT�

k j� and

�T�k�� T�

k�� T�k � T�

k � ensuring that the algorithm

terminates eventually at the mth step� when either T�m

or T�m is empty�

Eventually� we have two sets of hidden units �H� and

H�� that together give the correct answers on all sam�

ples of each class� The connection from the kth hidden

node to the output node carries the weight

wk �

��

k if the kth node belongs to H�

��k if the kth node belongs to H�

The above weight assignment scheme ensures that

nodes added later do not modify the correct results ob�

tain in earlier iterations of the algorithm�


� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Upstart algorithm develops additional �subnets� for sam�

ples misclassied in earlier iterations� Each new node is

trained using the pocket algorithm� If any samples are

misclassied by a node� two subnets are introduced be�

tween the input nodes and other nodes in the network�

Each subnet attempts to correctly separate samples of

one class misclassied by the �parent� node from all

samples of the other class for which the parent node

was invoked� This is a simpler task than the one con�

fronted by the parent node� ensuring that the recursive

algorithm terminates� Large magnitude weights are at�

tached to the connections from new nodes to the parent

node� in such a way that neither of the new nodes can

a�ect performance of the parent node on samples cor�

rectly classied by the parent alone�

The upstart algorithm can be extended to introduce

a small network module at a time� instead of a single

node� as in the �block�start� algorithm�


� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Neural Tree algorithm develops a decision tree� in which

the output of a node �� f � �g determines whether the

nal answer is to come from its left subtree or its right

subtree� Each node is trained using the LMS or pocket

algorithm� A node is a leaf node �new subtrees are not

required if all or most of the corresponding training

samples belong to the same class�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Cascade correlation� This algorithm has two important

features� �� the cascade architecture development� and

�� correlation learning� Moreover� the architecture is

not strictly feedforward�

In the cascade architecture development process� new

single�node hidden layers are successively added to a

steadily growing layered neural network until perfor�

mance is judged adequate� Each node may employ a

nonlinear node function such as the hyperbolic tangent�

whose output lies in the closed interval ��


Fahlman and Lebiere suggest using the Quickprop learn�

ing algorithm� When a node is added� its input weights

are trained rst� Then all the weights on the connec�

tions to the output layer are trained while leaving other

weights unchanged�

Weights to each new hidden node are trained to maxi�

mize covariance with current network error� i�e�� to max�

imize

S�wnew �KXk��

��PXp��

�xnew�p � �xnew�Ek�p � �Ek

��

where wnew is the vector of weights to the new node from

all the pre�existing input and hidden units� xnew�p is the

output of the new node for the pth input� Ek�p is the

error of the kth output node for the pth sample before

the new node is added� and �xnew and �Ek are averages

�over the training set�


1

Output

x y

Input nodes

node

1

Output

x y

Input

First

nodes

hiddennode

node

1

Output

x y

Input

First

nodes

hiddennode

node

1

Output

x y

Input

Second

First

nodes

hiddennode

hiddennode

node

1

Output

x y

Input

Second

First

nodes

hiddennode

hiddennode

node

Figure �� modied Fig�� from text Cascade correla�

tion example� solid lines indicate connections currently

being modied�


Tiling algorithm develops a multilayer feedforward net�

work� successively adding new hidden layers into which

�ow the outputs of the previous layer� Within each

layer� a �master� node is rst trained using the pocket

algorithm� and then �ancillary� nodes are successively

added to separate samples of di�erent classes that are

not yet separated by nodes currently in that layer� Each

layer is constructed such that for every pair of training

samples that belong to di�erent classes� some node in

that layer produces di�erent outputs�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Prediction Networks

Prediction problems constitute a special subclass of func�

tion approximation problems� in which the values of

variables need to be determined from values at previ�

ous instants� Two classes of neural networks have been

used for prediction tasks� recurrent networks and feed�

forward networks�

�� PREDICTION NETWORKS ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Recurrent Networks

Recurrent neural networks contain connections from

output nodes to hidden layer and�or input layer nodes�

and they allow interconnections between nodes of the

same layer� particularly between the nodes of hidden

layers�

Rumelhart� Hinton� and Williams�s �� training

procedure is essentially the same as the backpropaga�

tion algorithm� They view recurrent networks as feed�

forward networks with a large number of layers� Each

layer is thought of as representing a time delay in the

network� Using this approach� the fully connected neu�

ral network with three nodes is considered equivalent

to a feedforward neural network with k hidden layers�

for some value of k� Weights in di�erent layers are con�

strained to be identical� to capture the structure of a

recurrent network� w��i�j � w

��i�j �


1,1w

1,1w

1,1w

.

Input Input Input

k+1

k

1

0

(b)(a)

1 2 3

Output Output Output Time

w1,3

w

w

3,1

3,1

Figure �� a Fully connected recurrent neural network

with � nodes �b Equivalent feedforward version� for

Rumelhart�s training procedure�

�� PREDICTION NETWORKS �

OutputOutput

Input

Figure �� Recurrent network with hidden nodes� to which

Williams and Zipser�s algorithm can be applied�

Williams and Zipser proposed another training pro�

cedure for a recurrent network with hidden nodes� The

net input to the kth node consists of inputs from other

nodes as well as external inputs� The output of each

node depends on outputs of other nodes at the previous

instant� ok�t� � � f�netk�t�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Feedforward networks for forecasting

The generic network model consists of a preliminary

preprocessing component that transforms an external


Algorithm Recurrent Network Training�

Assume randomly chosen weights� t � � and

�ok�

�wi�j

� � for each i� j� k�

while MSE is unsatisfactory and computational

bounds are not exceeded� do

Modify the weights�

�wi�j�t � �Xk�U

�dk�t� ok�t�ok�t

�wi�j

where U is the set of nodes with a

specied target value dk�t�

�ok�t� �

�wi�j

� f ��netk�t

�X��U

wk��

�z��t

�wi�j

� i�kzj�t

�

Increment t�

end�while�

Figure �� Williams and Zipser�s Recurrent network train�

ing algorithm

�� PREDICTION NETWORKS ��

input vector x�t into a preprocessed vector x�t� The

feedforward network is trained to compute the desired

output value for a specic input x�t�

NeuralNetworkPredictor

x�t �x�t� x�t�� Predicted

Preprocessor

Figure �� Generic neural network model for prediction�

Tapped delay�line neural network �TDNN� Consider a

prediction task where x�t is to be predicted from x�t�

�� x�t � �� In this simple case� x at time t consists of

a single input x�t� and x at time t consists of the vec�

tor �x�t� �� x�t� � supplied as input to the feedfor�

ward network� For this example� preprocessing consists

merely of storing past values of the variable and supply�

ing them to the network along with the latest value�

Many preprocessing transformations for prediction prob�

lems can be described as�

�xi�t �tX

��

ci�t� x�


where c is a known �kernel� function� For example� in

�exponential trace memories�� ci�j � ��i�ji � where

�i � �� The kernel function for �discrete�time

gamma memories� is

ci�j �

��jli

�� i

li��j�lii if j � li

otherwise�

where delay li is a non�negative integer and �i � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Radial Basis Functions �RBF

A function is radially symmetric �or is an RBF if its

output depends on the distance of the input vector from

a stored vector specic to that function� Neural net�

works whose node functions are radially symmetric func�

tions are referred to as RBF�nets� RBF nodes compute

a non�increasing function � of distance� with ��u� �

��u� whenever u� � u�� The Gaussian�

�g�u � e��u�c��

�

�� RADIAL BASIS FUNCTIONS �RBF� ��

is the most widely used RBF�

RBF�nets are generally called upon for use in function

approximation problems� particularly for interpolation�

In many function approximation problems� we need to

determine the behavior of the function at a new input�

given the behavior of the function at training samples�

Such problems are often solved by linear interpolation�

�x��

D�

D�

�x��

�x��

D�

�x��

�x�� x��

�x��

�x��

x�

D�

Figure �� Dj is the Euclidean distance between xj and

x�� x� � � indicates that f�x� � �� The four nearest

observations can be used for interpolation at x�� giving

��D�� D��

� �D�� D��

� ��D�� D��

�D�� D��

� �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

If nodes in an RBF�net with linear radial basis node

functions ��D � D�� corresponding to each of the


training samples xp for p � �� P� surround the new

input x� the output of the network at x is proportional

to

�

P�

Xp

dp��kx� xpk�

where dp � f�xp is the output corresponding to xp� the

p�th pattern� To avoid the problem of deciding which

of the training samples do surround the new input� we

may use the output

�

P

PXp��

dp��kx� xpk�

where d�� dP are the outputs corresponding to the

entire set of training samples� and � is any RBF�

For network size to be reasonably small� we cannot

have one node to represent each xp� Hence similar train�

ing samples are clustered together� and output

o ��

N

NXi��

i��k�i � xk

where N is the number of clusters� �i is the center of

the ith cluster� and i is the desired mean output of all

samples of the ith cluster�

�� RADIAL BASIS FUNCTIONS �RBF� ��

Training involves learning the values of

w� � �

N� � � � � wN �

N

N� �� N �

minimizing E �PXp��

Ep �PXp��

�dp � op� �

Gradient descent suggests the following update rules�

�wi � �i�dp � op��kxp � �ik and

��i�j � ��i�jwi�dp � opR��kxp � �ik

��xp�j � �i�j

where R is dened for convenience such that R�z� �

��z� Note that �i and �i�j may di�er for each wi and

�i�j�

This learning algorithm requires considerable compu�

tation� A faster alternative� �partially o��line training��

consists of two steps�

�� Some clustering procedure is used to estimate the

centroid ��i and spread for each cluster�

�� Using one node per cluster �with xed �i� gradient

descent on E is invoked to nd weights �wi�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��


�� Polynomial Networks

Many practical problems require computing or approxi�

mating functions that are polynomials of the input vari�

ables� a task that can require many nodes and extensive

training if familiar node functions �sigmoids� Gaussians�

etc� are employed�

� Networks whose node functions allow them to di�

rectly compute polynomials and functions of poly�

nomials are referred to as �polynomial networks��

The following are some examples�

�� Higher�order networks contain no hidden layers� but

have nodes referred to as �higher�order processing

units� that perform complex computations� A higher�

order node applies a nonlinear function f to a poly�

nomial in the input variables� and the resulting node

output is

f�w��Xj�

wj�ij� � � � ��X

j��j��jk

wj��j��jkij�ij� � � � ijk

�� POLYNOMIAL NETWORKS ��

The LMS training algorithm can be applied to train

such a network� However� the number of weights re�

quired increases rapidly with the input dimensional�

ity n and polynomial degree k�

�� Sigma�pi networks contain �sigma�pi units� that ap�

ply nonlinear functions to sums of weighted products

of input variables�

f�w� � � � ��X

j� ��j� ��j� ��j�

wj��j��j�ij�ij�ij� � � � �

This model does not allow terms with higher powers

of input variables� therefore sigma�pi networks are

not universal approximators�

�� Product units� are nodes that compute products

nj��i

pj�ij � where each pj�i is an integer whose value is

to be determined� Networks may contain these in

addition to some �ordinary� nodes that apply sig�

moids or step functions to their net input� The learn�

ing algorithm must determine both the weights and

exponents pji for product units� in addition to deter�


mining weights for ordinary nodes� Backpropagation

training may be used� and learning is slow�

�� Networks with functional link architecture conduct

linear regression� estimating the weights inP

j wj�j�i�

where each �basis function� �j is chosen from a pre�

dened set of components �such as linear functions�

products and sinusoidal functions�

�� Pi�sigma networks contain a hidden layer� in which

each hidden node computes wk�� P

j wk�jij� and

the output node computes fQ

k�wk�� P

j wk�jij��

Weights between the input layer and the hidden

layer are adapted during the training process� and

can be trained quickly using the LMS algorithm�

�� Ridge�polynomial networks consist of components that

generalize Pi�sigma networks� and are trained using

an adaptive network construction algorithm� These

networks have universal approximation capabilities�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� REGULARIZATION �

�� Regularization

Regularization involves optimizing a function E��jP j��

where E is the original cost �or error function� P is a

�stabilizer� that incorporates a priori problem�specic

requirements or constraints� and � is a constant that

controls the relative importance of E and P � Regular�

ization can be implemented explicitly or implicitly�

Examples of explicit regularization�

� Introduction of P � �P

j w�j into the cost function

penalizes large magnitude weights�

� A weight decay term may be introduced into the

generalized delta rule� so that the weight change

�w � ��E��w� �w

depends on the negative gradient of the mean square

error as well as the current weight� This favors the

development of networks with smaller weight mag�

nitudes�


� �Smoothing� penalties may be introduced into the

error function� e�g�� terms proportional to j �E�w�i j

and other higher�order derivatives of the error func�

tion� Such penalties prevent a network�s output

function from having very high curvature� and may

prevent a network from over�specializing on the train�

ing data to account for outliers in the data�

Examples of implicit regularization�

� Training data may be perturbed articially� using

Gaussian noise� This is equivalent to using a sta�

bilizer that imposes a smoothness constraint on the

derivative of the squared error function with respect

to the input variables�

� Adding random noise to the weights imposes a smooth�

ness constraint on the derivatives of the squared er�

ror function with respect to the weights�

� RBF�nets constitute a special case of networks that

accomplish regularization�

Documents

h has b een conducted aimed at dev el - NYU Tandon School ...cis.poly.edu/~mleung/CS6673/s09/Extra/m4.pdf · h has b een conducted aimed at dev el ... ard net w orks Madalines A Madaline