distancecis.poly.edu/~mleung/CS6673/s09/Extra/m5.pdfe learning net w ork no des in the outer la y er de ned as the no de j whose w eigh t v ector or protot yp e or no de p osition

Chapter �

Unsupervised Learning

�Unsupervised� learning proceeds teacher�less� to dis�

cover special features and patterns from available data

without using external help� for tasks such as clustering�

vector quantization� approximating probability distribu�

tions� feature extraction and dimensionality reduction�

�� Winner�Take�All Networks

Most unsupervised neural networks rely on algorithms

to compute and compare distances� determine the �win�

ner� node with the highest level of activation� and use

these to adapt network weights�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�

� CHAPTER �� UNSUPERVISED LEARNING

Hamming networks� shown in Figure �� compute the

�Hamming distance�� the number of di�ering bits in

two vectors� Weights on links from an input layer to

an output layer represent components of stored input

patterns�

i3,3 21,1i 2

(new input vector)

Output of third node = - H(new input vector, i3)

Figure �� A network to calculate Hamming distances H

between stored vectors and input vectors��

Let ip�j � f�� g be the jth element of the pth in�

put vector� Components of the weight matrix W and

threshold vector � are given by wp�j � ip�j� � and � �

�n� � respectively�

When a new input vector i is presented to this net�

�� WINNER�TAKE�ALL NETWORKS �

work� its upper level nodes generate the output

o � Wi ��

�BBBBBBB�

i� � i� n

i� � i� n

��

iP � i� n

�CCCCCCCA

Network outputs can be used to compare Hamming

distances between stored patterns and input patterns�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Maxnet is a recurrent one�layer network cf� Figure ��

used to determine which node has the highest initial

activation�

��

��

��

��

��

��

��

��

��

��

��

Figure �� Maxnet� a simple competitive network�


All nodes update their outputs simultaneously� Each

node receives inhibitory inputs from all other nodes� via

�lateral� intra�layer connections� Choosing � � � and

� � ��number of nodes ensures that a single node can

prevail as the only active or �winner� node� while the

activation of every other node subsides to zero�

The node function is fnet � max�� net where net �Pn

i��wixi�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Let initial activation values be ��

�� and � � �� In the �rst iteration�

the output of the �rst node falls to max��

�� the outputs of the second� fourth

and �fth nodes fall to max��

�� and the output of the third node falls to

max��

In further iterations� ��

�� which

is stable� Only three iterations were su�cient� even


though nodes had initial values close to the winner�s�

The maxnet allows for greater parallelism in execution

than the algebraic solution� since every computation is

local to each node rather than centralized�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Simple competitive learning�

The Hamming net and maxnet assist more complex

unsupervised learning networks� helping to determine

the node whose weight vector is nearest to an input

pattern�

Simple competitive learning can be accomplished us�

ing a network that connects each input node to each

output node� with inhibitory connections among output

nodes� as shown in Figure ��

The jth output node is described by the vector of

weights to it from the input nodes� wj � wj�� wj�n�

Initially� all connection weights are assigned randomly�

with values between � and ��

A competition occurs to �nd the �winner� among


inhibitory connectionsOutput Layer with

Input Layer

Output �

��

� if node is a winner

� otherwise

Figure �� Simple competitive learning network�

nodes in the outer layer� de�ned as the node j� whose

weight vector or �prototype� or �node position�� is

nearest to the input vector i�� i�e��

dwj�� i� � dwj� i� for j � f�� mg

where dwj� i� �pPn

��i��k � wjk��

If each jjwjjj � � and each jji�jj � �� then d�w� i sim�

pli�es to � P

�w�i� � � i �w� This distance will be

minimum when the dot product i �w takes its maximum

value� A maxnet can be invoked for this purpose�

Connection weights of the winner node� j�� are mod�

i�ed so that the weight vector w�

j moves closer to i��


while the other weights remain unchanged� This is the

winner�take�all phase� The weight update rule can be

described as follows�

wjnew � wjold � �i� �wjold� �j��

where

�j��

�� if the jth node is �winner� for input i��

� otherwise�

Each weight vector wj converges moves successively

closer to

wj ��P� �j��

X�

i� �j��

averaging input vectors for which wj is the winner�

The learning rate � may vary such that �t�� t�

resulting in faster convergence�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Input vectors T � fi� � �� i� �

�� i� � ��

i� � �� i� � �� i� � �� g�

The network contains three input nodes� assume that

there are also three output nodes� A�B�C with initial

CHAPTER �� UNSUPERVISED LEARNING

Algorithm Simple Competitive Learning�

Initialize weights randomly�

repeat

� Optional� Adjust learning rate �t�

� Select an input pattern ik�

� Find node j� whose weight vector wj� is closest to

ik�

� Update each weight wj�� wj��n using the rule�

�wj�� tik�� wj�� for � � f�� ng

until network converges or computational bounds are

exceeded�

Figure �� Simple competitive learning algorithm�

�� WINNER�TAKE�ALL NETWORKS

weights randomly chosen�

W � �

�BBBB�wA � ��

wB � ��

wC � � � �

�CCCCA �

Assume � � ��

The �rst sample presented is i� � �� Squared

Euclidean distance between A and i� � ��

�� Similarly� squared dis�

tances to B and C are �� and �� respectively� hence

C is the winner� Weights of C are updated� moving

halfway towards the sample since � � �� The result�

ing weight matrix is�

W � �


wB � ��

wC � ��

�CCCCA �

Assuming input vectors are presented repeatedly in

�� CHAPTER �� UNSUPERVISED LEARNING

the sequence� the weight matrix after � steps is�

W � �


wB � � ��

wC � � ��

�CCCCA �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Each node tends to be the winner for the same input

vectors� in later iterations� and moves towards their

centroid�

� Convergence is not smooth when � is high�

� Di�erent results ensue if we used another metric

such as the Manhattan distance�

� Results depend on initial values of node positions�

� Using three nodes is no guarantee that three �clus�

ters� will be obtained by the network�

� Results depend on the sequence in which samples

are presented�

�� WINNER�TAKE�ALL NETWORKS ��

� Weight vectors will continue to change at every sam�

ple presentation unless the sample presented coin�

cides with a weight vector�

� Each node can store its position at the beginning of

the previous epoch� and compare this with its posi�

tion at the beginning of the next epoch� Computa�

tions can be terminated when node positions cease

to change�

� If the set of input patterns is �xed� and if the same

nodes continue to be the winners for the same input

patterns� computation can be speeded up by non�

neurally computing the centroids of input pattern

clusters� stored as the weight of the relevant node�

The simple competitive learning algorithm conducts

stochastic gradient descent on the quantization error

E �Xp

j ip � W ip j�

where W ip is the weight vector nearest to ip�


The number of nodes is assumed to be �xed� but the

right choice may not be obvious� We may attempt to

minimizeE�cnumber of nodes instead of E� for c � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

k�Means Clustering cf� Figure �� is a statistical pro�

cedure closely related to simple competitive learning�

which computes cluster centroids directly instead of mak�

ing small updates to node positions�

The k�means algorithm is fast� and converges to a

state in which each prototype changes little� assuming

that successive vectors presented to the algorithm are

drawn by independent random trials from the input

data distribution� Both k�means and simple competi�

tive learning algos� can lead to local minima of E�

Example �� Given �ve one�dimensional inputs� f��

�� g� weights of three nodes may move by simple

competitive learning to w� � �� w� � �� w� � ��

with quantization error �� which is not optimal� lower

error is obtained with w� � �� w� � �� w� � ��

�� WINNER�TAKE�ALL NETWORKS ��

Algorithm k�means Clustering�

Initialize k prototypes

wj � i�� j � f�� kg� � � f�� Pg�

Each cluster Cj is associated with prototype wj�

repeat

� for each input vector i�� do

Place i� in the cluster with nearest prototype wj��

end�for�

� for each cluster Cj� do

wj �Xi��Cj

i�� j Cj j

end�for�

� Compute E �Pk

j��

Pi��Cj

j i� � wj j�

until E no longer decreases� or cluster memberships sta�

bilize�

Figure �� k�means clustering algorithm�


�� Learning Vector Quantizers

Learning vector quantizers LVQ apply simple compet�

itive learning to solve supervised learning tasks� where

class membership is known for each training pattern�

Each node is associated with a predetermined class

label� The number of nodes chosen for each class is

roughly proportional to the number of training patterns

in that class�

The learning rate varies with time� e�g�� t � ��t or

�t � a�� t�A� where a � � and A � �� This helps

the network converge to a state in which weight vectors

are stable�

Example �� fi� � �� i� � �� i� �

�� i� � �� i� � �� and i� �

�� g� Assume only the �rst and last samples come

�� LEARNING VECTOR QUANTIZERS ��

Algorithm LVQ��

Initialize each weight � ��

Associate each node with one of the classes�

repeat

Adjust �t�

for each ik� do

�nd node j� whose weight vector wj� is closest to ik�

for � � �� n� do

if the class label of node j� equals

the desired class of ik�

then �wj�� tik�� wj��

else �wj�� tik�� wj��

until network converges or comp� bounds are exceeded�

Figure �� LVQ� algorithm�


from Class �� and

W � �

�BBBB�w� � ��

w� � ��

w� � � � �

�CCCCA �

w� is associated with Class �� and the other two nodes

with Class �

Let �t � �� until t � �� then �t � �� until

t � � � and �t � �� thereafter�

�� Sample i�� winner w� distance �� w� changed to

��

��

Associations between input samples and weight vec�

tors stabilize by the second cycle of pattern presenta�

tions� Weight vectors continue to change� converging to

centroids of associated input vectors in �� iterations�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Learning rule for �LVQ �� Let w�� w� be the two weight

vectors nearest the input vector i� and both weight

�� COUNTERPROPAGATION ��

vectors are at comparable distances from i� but only w�

is associated with the desired class� Then w� is moved

closer to the presented pattern� and w� is moved farther

from the presented pattern� using an update rule similar

to that of LVQ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Counterpropagation

The �forward�only� counterpropagation network con�

sists of an input layer� a hidden layer� and an output

layer� Connections exist from every input node to ev�

ery hidden node� and from every hidden node to every

hidden and output node�

Forward�only counterpropagation consists of two phases�

each of which uses separate time�varying learning rates

��t and ��t�

Phase �� Hidden layer nodes are used to divide the

training set into clusters containing similar input pat�

terns� using the simple competitive learning rule�


Weights trained

by outstar rule

Weights trained

by simple

competitive learning

inhibitory connections

Input Layer

Hidden Layer with

Output Layer

Figure �� Architecture of the forward�only counterpropa�

gation neural network�

�� COUNTERPROPAGATION �

Phase � Weights on edges out of the winner hidden

node k� are modi�ed using gradient descent�

�vj�k� � ��td��j � vj�k�

where d� is the desired output vector�

The system can evolve with time since small correc�

tions may be made to the �rst layer weights even during

the second phase�

After training� the network is used to �nd the hidden

node whose weight vector matches best with the input

vector� The output pattern generated by this network

is identical to the vector of weights leading out of the

winner hidden node�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Full Counterpropagation works in both directions�

�� to predict ip� given dp� and

� to predict dp given ip�

Initially� all four sets of the weights w�� w�� v�� v�

in Fig� �� are assigned random values between � and ��


Algorithm Forward�only Counterpropagation�

Initialize weights � �� randomly� and select ��

Phase �� repeat

� for p � �� P � do �

�� Find winner k�� minimizingPn

j��wk��j � ip�j��

� Update each wk�� wk��old � ��ip��

end�for�

� Decrease �� e�g�� by a small constant�

until performance or comp� bounds are met�

Phase � repeat

� for p � �� P � do �

�� Find winner k� as in Phase ��

� For each j� �� update

wk�� wk��old � ��ip��

vj�k� � �� vj�k�old � ��p�jdp�j�

end�for�

� Decrease ��

until performance or computational bounds are met�

F d l t ti l ith

�� COUNTERPROPAGATION ��

ip

Hidden layer with inhibitory connections

pdPredicted output pattern

Predicted input pattern

wv

v

(2)

w(1) (2)

(1)

Figure �� Full counterpropagation neural network�


Phase � adjusts weights w� and w� associated with

connections leading into the hidden nodes� ip and dp are

both used in this process� Phase adjusts weights v�

and v� associated with connections leading away from

the hidden nodes� After training� the network can be

used to predict ip from dp� or conversely�

In a variation of this algorithm� all hidden nodes par�

ticipate in predicting a value in proportion to their dis�

tance from the input�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Adaptive Resonance Theory

Adaptive resonance theory ART networks allow the

number of clusters to vary with problem size� The user

can control the dissimilarity between members of the

same cluster by selecting a vigilance parameter�

ART networks undergo unsupervised learning to as�

sociate patterns with prototypes� Each input layer node

corresponds to an input vector component� Each outer

�� ADAPTIVE RESONANCE THEORY ��

layer node represents a cluster centroid�

Each input pattern is presented several times� and

may be associated with di�erent clusters before the net�

work stabilizes� At each successive presentation� net�

work weights are modi�ed�

The �rst layer transmits the input patterns to the

second layer� Matching succeeds if the pattern returned

in response is su�ciently similar to the input pattern�

Otherwise� signals travel back and forth between the

output layer and the input layer �resonance� until a

match is discovered� If no satisfactory match is found� a

new cluster of patterns is formed around the new input

vector� the system is adaptive�

ART� restricts all inputs to be binary valued�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Let � �� n � �� and

input vectors � f��

�� g� For the

�rst node� let t�� b��


�b�� t��

�

��

��Input layer

Output layer with

connectionsinhibitory

Figure �� ART� network�

For the �rst input vector�

y� ��

��

�

��

�

��

�

��

�

��

�

��

and y� is declared the uncontested winner� SinceP�

�� t�� x�P�� x�

��

��

b��

��

��

� ��

for � � ��

� otherwise�

Likewise�

t�� t��x��

B� �h

��

��

� � � � ��

iT

T � �h� � � � � � �

iT�


Algorithm ART��

Initialize each t��j� � �� bj��

n��

while the network has not stabilized� do

�� Let A contain all nodes�

� For a randomly chosen input vector x � compute

yj � bj � x for each j � A�

�� repeat

a Let j� be a node in A with largest yj�

b Compute s� �s�� s�

n where s�

� � t��j� x��

c If

Pn�� s��Pn�� x�

� then remove j� from set A

else associate x with node j� and update weights�

bj��new �t��j�old x�

�� Pn

�� t��j�oldx�

t��j�new � t��j�oldx�

until A is empty or x is associated with some node�

�� If A is empty� create new node with weight vector

x�

end�while�

Figure �� ART� Learning Algorithm�


For the second sample �� y� � �� butP

� t��x��P

� x� � � � A new node is generated� with

B �

��

��

� � � � ��

� � ��

��

��

��

�

�T

and

T �

��

� � � � � � �

�T

�

The network stabilizes with three nodes� in two pre�

sentations of each input vector�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Observations�

�� Samples may switch allegiance� a sample previously

activating node now activates node ��

� When the initial winner with highest y� fails the

vigilance test� another node may subsequently sat�

isfy the vigilance test�

�� Top�down weights are modi�ed by computing inter�

sections conjunctions with the input vector� so that


the number of ��s does not increase� hence we ini�

tialize each top�down weight to ��

�� Given su�cient number of nodes� �outliers� that

ought not to belong to any cluster will also be as�

signed separate nodes�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Minor Variations of ART��

�� Each bottom�up weight is assigned some common

value such that � bj�� L

L��nwhere L typically

or � is a preassigned positive integer� Given inte�

gers B and D such that D � B � �� the top�down

weights are chosen such that

B � �

D t��j ��

The bottom�up weight updates depend on L� with

bj�� L x�

L� � �Pn

�� x��

and the top�down weights are updated as before�


� An additional check can be performed to determine

the appropriateness of membership of a given pat�

tern to a cluster� Instead of working with a set of

bottom�up and top�down weights� we consider proto�

types representing clusters� Let wj � wj�� wj�n

denote the prototype of the jth cluster� for j �

�� m� Given a pattern x � x�� xn� the sim�

ilarity of active prototypes is found by examining the

value of Pn�� wj�� x�

� �Pn

�� wj��

�

where � � �� If the best matching prototype j� is

su�ciently similar to x� i�e��Pn�� wj�� x�

� �Pn

�� wj��

�

Pn�� x�

� � �

then examine whetherPwj�� x�Pn�� x�

� �

Both conditions must be satis�ed to associate x with

cluster j� and modify its prototype�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� TOPOLOGICALLY ORGANIZED NETWORKS �

�� Topologically Organized Networks

SELF�ORGANIZING MAP SOM� proposed by Ko�

honen� combines competitive learning with a topolog�

ical structuring such that adjacent nodes attain simi�

lar weight vectors� Competitive learning requires in�

hibitory connections among all nodes� topological struc�

turing implies that each node also has excitatory con�

nections to a small number of nodes in the network�

The topology is speci�ed in terms of a neighborhood

relation among nodes� Weight vectors of the winner as

well as its neighbors move towards an input sample�

The SOM training algorithm updates the winner node

and also nodes in its topological vicinity� when an input

pattern is presented� The only criterion is the topologi�

cal distance between nodes� there is no other excitatory

�weight� between the winner node and its neighbors�

The neighborhood Njt contains nodes that are within

a topological distance of Dt from node j at time t�


Algorithm SelfOrganize�

� Select network topology neighborhood relation�

� Initialize weights randomly� and select D� � ��

� while computational bounds are not exceeded� do

�� Select an input sample i��

� Find the output node j� with minimumPn

k��i��kt� wj�kt��

�� Update weights to all nodes within a topological

distance of Dt from j�� using

wjt � � � wjt � �ti�t� wjt�

where � �t � �t� � � ��

�� Increment t�

end�while�

Figure �� Kohonen�s SOM Learning algorithm

�� TOPOLOGICALLY ORGANIZED NETWORKS ��

t <

t > After 39

t <

Neighborhood for 20

Neighborhood for 30

Neighborhood for 10 t <

t <

10Neighborhood for 0

20

30

40

Figure �� Time�varying neighborhoods of a node in a SOM

network with grid topology� Dt � � for � � t ��

Dt � � for �� t �� Dt � for � � t ��

Dt � � for �� t �� and Dt � � for t � ��

��CHAPTER��

UNSUPERVISEDLEARNING

where

Dt

decre

aseswith

time�Dt

doesnotrefer

toEuclid

eandista

ncein

inputspace�itrefers

only

to

thelength

ofthepath

connectin

gtwonodes�

forthe

prespeci�edtopologychosenforthenetwork�

Topological distance fromthe winner node

Tim

e period of node adaptation

Figu

re��

Shrin

kageoftheneighborhoodofthewinner

nodein

topologically

organize

dSOM

networks�astim

e

increases�

Weights

changeatarate

�t

thatdecre

aseswith

time�Ifjisthewinnernode�

Nj t

�fjgfneighbors

ofjattim

etg�

and

iistheinputvecto

rpresentedto

thenetwork

at


time t� then the weight change rule is�

w�t � � �

��w�t � �t i� w�t� if � � Njt

w�t� if � � Njt�

Weight vectors often become ordered� i�e�� topological

neighbors become associated with weight vectors that

are near each other in the input space� This is depicted

in Figure �� for input vectors uniformly distributed�

The weight vectors are initially random but should align

themselves as shown in Figure ��aii and bii�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Consider the same data as in example ��

fi� � �� i� � �� i� � ��

i� � �� i� � �� i� � �� g�

Assume a ��node network with node A adjacent to nodes

B and C respectively�CB A


(i) (ii)(i)(b)(a)

(ii)

Figure �� Weight vectors in the input space before and

after the SOM learning algorithm is applied� a using a

��dimensional linear network topology for �dimensional

input data uniformly distributed in a triangular region�

and b using a �dimensional grid network topology

for �dimensional input data uniformly distributed in a

rectangular region� Weight vectors are initially located

randomly� as shown in ai and bi� By the end of

the learning process� the weights become �ordered�� as

shown in ii of each example�


Let the initial weights be given by

W � �


wB � ��

wC � � � �

�CCCCA �

Let Dt � � for one initial epoch sequence of six

input vector presentations� and Dt � � thereafter�

Let initial � � �� reduced to �� in the second epoch

and to �� in the third epoch�

t � �� Sample presented� i� � ��

Squared Euclidean distance between A and i�� d�A� �

�� Similarly�

d�B� � �� and d�C� � ��

C is the winner since d�C� d�A� and d�C� d�B��

Since Dt � � at this stage� the weights of both C and

its neighbor A are updated according to Equation ��

e�g�� wA�� wA�� x�� wA��


The resulting weight matrix is�

W � �


wB � ��

wC � ��

�CCCCA �

Weights attached to B are not modi�ed since B falls

outside the neighborhood of the winner node C�

t � �� is now reduced to �� and the neighborhood

relation shrinks� so that only the winner node is updated

henceforth� Sample presented� i�� C is the winner� only

wC is updated�

wC � ��

t � �� is reduced to �� i� is presented�

wC � ��

t � �� i� is presented�

wB � ��


t � �� i� is presented� wC is updated�

W ��


wB � ��

wC � ��

�CCCCA �

At the beginning of the training process� the Euclidean

distances between various nodes were given by�

jwA�wBj � �� jwB�wCj � �� jwA�wCj � ��

The training process increases the relative distance be�

tween non�adjacent nodes B�C� whileA remains roughly

in between B and C�

jwA�wBj � �� jwB�wCj � �� jwA�wCj � ��

Note that samples switch allegiance between nodes�

especially in the early phases of computation� Compu�

tation continues until the same nodes continue to be the

winners for the same input patterns�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

EXAMPLE� Eight ��dimensional samples at the corners

of the unit cube are presented� and one node is initially


in the cube while all others are far away� only the node

inside the cube is the winner for all samples� If the

neighborhood relation for the inner node does include

other nodes� then those nodes are pulled towards the

cube� If computation continues long enough� some of

them eventually become winners for some inputs�

The SOM algorithm with steadily shrinking neighbor�

hood relations has been shown to perform better than

networks with static neighborhood functions�

The SOM learning algorithm can also be extended

for supervised learning tasks� using a variant of LVQ

that maintains a topological relation between nodes and

updates nodes in a neighborhood whose size decreases

with time�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Convergence

One�Dimensional SOM� in which the �th node with

weight vector w� of the network is connected to the

�� TOPOLOGICALLY ORGANIZED NETWORKS �

�� st node� is ordered if

jr � sj jr � qj � jwr � wsj jwr �wqj�

where r� s� q are node labels andwr� ws� wq are the weights

corresponding to those nodes� If the neighborhood re�

lation satis�es certain properties� an ordered con�gura�

tion is stationary and stable� This state will be reached

with probability � as long as the neighborhood function

is positive�valued� normalized and decreasing with dis�

tance� so that there exists a sequence of patterns which

will lead the network into the ordered con�guration�

However� for some sequences of input pattern pre�

sentations� e�g�� if patterns are presented in a �xed

sequence� even the one�dimensional Kohonen network

may not converge to an ordered state� The network may

instead settle into a state in which topologically adja�

cent nodes may be physically distant from each other in

the weight space�


A B C D

�b

AB CD

�a

Figure �� Emergence of an �ordered� map� The dark

circles represent data points in one�dimensional space�

and nodes are identi�ed as A�B�C� and D� with con�

nections indicating linear topology� Figure a depicts

initial positions of nodes before applying SOM learn�

ing algorithm� and Figure b shows �nal positions of

nodes� with topologically adjacent nodes having similar

weights�


� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Multi�Dimensional Maps� In practice� the SOM often

converges quickly to ordered maps� but there is no guar�

antee that this will occur� One may view an SOM�s exe�

cution as consisting of two phases� where nodes are more

�volatile� in the �rst phase� searching for niches to move

into� with the second �sober� phase consisting of nodes

settling into cluster centroids in the vicinity of positions

found in the earlier phase� The sober phase converges�

but the emergence of an ordered map depends on the

result of the volatile phase�

A topologically ordered con�guration is reachable� but

the network may move out of such an ordered con�gu�

ration� In general� there is no energy function on which

the SOM weight update rule performs gradient descent�

Example �� Consider the network in Figure �� with

vertices at the corners of a square� If input patterns

are repeatedly chosen near the center of the square but

slightly closer to w�� then the nodes previously at w�


and w� will eventually move closer to each other than to

node w�� Topologically adjacency no longer guarantees

proximity of weight vectors�

�a �b

w�

�winner

w�

w�w�

w�

w�

w�

w�

iInput sample

i

Figure �� Topologically non�adjacent nodes w�� w� move

closer to each other than to an adjacent node w�� on

repeated presentation of i winner w�� a initial posi�

tions of weight vectors� b �nal positions�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Extensions

Conscience mechanism enables most nodes in a network

to achieve useful learning� When �rich� enough� nodes


give up some �wealth� inputs for which they are win�

ners� For each jth node� the conscience mechanism

monitors the frequency freq j with which it becomes the

winner� The winner is chosen as that node for which

dw�x � gfreq j is minimum� where g is a monotoni�

cally increasing function� e�g��

gfreq j � cfreq j � ��m�

where c � � and m is the number of competing nodes�

Another possible implementation temporarily deacti�

vates a node whose frequency exceeds a predetermined

limit such as P�m where � c �� P is the number of

patterns and m is the number of nodes� further compe�

tition is conducted only among other active nodes�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Hierarchical Maps�

�Hierarchical� clustering algorithms develop a tree of

clusters� in which subclusters are associated with chil�

dren of each node� In �bottom�up� algorithms� smaller

lower level clusters are successively merged to obtain


higher level clusters� �Top�down� algorithms succes�

sively partition each cluster into subclusters�

The simple SOM is inadequate for applications with

elongated� ellipsoid� and arbitrarily�shaped clusters� since

it attempts to minimize Euclidean distance between sam�

ples and weight vectors of nodes� generating symmetric

clusters� In a multilayer self�organizing map MSOM�

the outputs of the �rst SOM are fed into the second

SOM as inputs� The �rst layer clusters together all

similar elements of the training set� but this similarity is

con�ned to circular symmetry in the original space� The

second layer of nodes manages to combine these circular

clusters together into arbitrary shapes in the desirable

cluster� as in bottom�up hierarchical algorithms�

Luttrell�s approach steadily increases the number of

nodes� repeatedly applying the SOM learning rule at

each step� Weight updates are restricted to a few topo�

logical neighbors of the winner� Each new node is in�

serted halfway between two existing nodes� High�level


clusters obtained at early phases of the algorithm are

partitioned into sub�clusters in later phases of the algo�

rithm� as in top�down algorithms�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Growing Cell Structures GCS adapt network size� ap�

proximating some probability distributions better than

the SOM� GCS networks maintain a hyper�tetrahedral

topology of �xed dimensionality� e�g�� in Figure ��

the topology consists of triangles� The choice of neigh�

bors may be modi�ed as the network evolves� tracking

changes in the nature of the problem�

A �signal counter� �j is associated with each jth

node� and estimates the proportion of samples currently

associated with a cluster located at a given node�

The algorithm has three main components�

Node adaptation� The signal counter �j� of the winner

is incremented� while others decay� �� for

each �� The winner j� and its immediate neighbors


�only� are adapted�

wjt�� wjt��it�wjt� if j is adjacent to j��

wj�t� � � wj�t � ��it�wj�t� with ��

Node with

(a)

Initial networkτ=0

τ=4

τ=3

τ=3

τ=5τ=5 τ=7

τ=9

New node

highest τ

New edge

Node to bedeleted

τ=4

τ=2

(c)(b)

Figure �� GCS with triangular topology� a Data distri�

bution and initial network� b Node addition� c Node

deletion�

Node insertion� After some adaptation steps� a new node

�new is added midway between node � with largest

signal counter value and ��s farthest neighbor �far�


Neighbors of �new are �� far� and others as needed

to preserve the chosen topological structure of the

network� Signal counter values change� in rough pro�

portion to the fraction of the volume of the Voronoi

region of each node lost to �new�

Node deletion� Nodes with counter valuesthreshold are

deleted� with incident edges� Nodes with no other

neighbors are then deleted� Other nodes are also

deleted� maintaining the topological structure of the

network�

The GCS network grows and shrinks repeatedly� and

may not converge to a stable state�

Instead of the signal counter value� one may use any

other resource value� e�g�� quantization error�

GCS for supervised learning� The weight vector for each

hidden node is obtained using unsupervised GCS� Each

hidden node applies a Gaussian function to its net in�

put� each output node applies a sigmoid node function�

When a sample is presented� the resource value of the


current best matching unit j is updated by

��j �

mX��

o� � d��

for function approximation problems� for classi�cation

problems� ��j � f�� g� with a cluster of samples�

Weights to output nodes are modi�ed using the delta

rule�

�w��j � �d� � o�yj

where yj is the output of the jth hidden layer node�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Distance�based Learning

Weight vectors may be updated using Euclidean dis�

tance instead of topological proximity� with the advan�

tage of obtaining convergence results by showing that

the algorithm conducts gradient descent on an energy

function� Simple competitive learning updates only the

winner node� SOM updates the winner and its topolog�

ical neighbors� �maximum entropy� algorithm updates

�� DISTANCE�BASED LEARNING �

Rel

ativ

e st

reng

thof

ada

ptat

ion

Rel

ativ

e st

reng

thof

ada

ptat

ion

Rank of weight vectorin terms of distance from the input vector

Euclidean distance of the weight vector from input pattern

i

(a) (b)

Figure �� Weight adaptation rules� a Neural Gas

b Maximum Entropy� The y�axis measures

j�wjj�ji�wjj�

all� and �neural gas� algorithm makes signi�cant up�

dates to only a few nodes� E�ects of the new update

rules are shown in in Figure �� compare with Figure

�� for SOM�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Maximum entropy procedure� Each weight vector is drawn

closer to the input vector� by a quantity that depends on

their Euclidean distance and a �temperature� parame�

ter T which is steadily decreased�


�wj � �i� wjexp�ji� wjj��T P� exp�ji� w�j��T

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Neural Gas Algorithm� The change in the jth weight

depends on its �rank� kji�W � the number of weight

vectors w� �W such that jjw� � ijj jjwj � ijj�

�wj � �i�wjhkji�W � j � f�� Ng

where h is monotonically decreasing with h� � �� e�g��

hx � h�x � e�x��

This rule conducts stochastic gradient descent on

�

C

NXj��

ZiP ih�kji� wjji �wjjj

� di

where C is a normalization factor� and P i repre�

sents the probability distribution of the input vectors�

If the input vector set is �nite� a summation over all

input vectors can be used instead of the integral� is

decreased with time� when small enough� the rule re�

sembles �winner�take�all��

�� NEOCOGNITRON ��

Although slower weight vectors must be sorted at

each input presentation� this algorithm yielded better

weight vectors than SOM� k�means clustering� and max�

imum entropy algorithms� on a larger version of the clus�

tering problem shown in Figure ��

��

�

��

�

�

�

�

�

��

�

�

�

��

��

�

�

��

�

��

�

��

�

�

��

��

�

�

�

��

�

�

��

�

��

�

�

��

�

��

�

�

��

�

��

�

�

�

�

�

��

��

�

��

�

��

�

�

��

�

��

�

�

�

��

�

��

�

�

�

�

�

��

�

�

�

��

��

�

�

��

�

��

�

��

�

�

��

��

�

��

�

��

�

�

��

�

��

�

�

��

�

��

�

�

��

�

��

�

�

�

�

�

��

��

�

��

��

�

�

��

�

��

�

�

�b �a

Figure �� Node positions a before and b after apply�

ing Neural Gas algorithm with �n nodes� when data is

contained within n small rectangles�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��


U�

Module � Module � Module � Module �InputLayer

US� UC� US�UC�

Figure �� The neocognitron network�

�� Neocognitron

The input images are two�dimensional arrays� and the

�nal result of pattern recognition indicates which high�

level feature or shape has been found in the entire input

image� activating the appropriate output node�

The network uses a hierarchy of many modules� with

each module extracting features from the previous mod�

ule� Each module consists of an S�layer and a C�layer�

� An S�layer node searches for a feature in a speci�c

region of the input image U� in Figure �� e�g��

for a vertical line segment in the top�left corner of

the input image�

�� NEOCOGNITRON ��

� A C�layer node corresponds to one relatively position�

independent feature� receiving inputs from S�layer

nodes for that feature� e�g�� to determine whether a

vertical line segment occurs anywhere in the image�

The trainer must �rst �x the receptive �eld S�nodes

that provide inputs for each C�node� Modules closer

to the input layer lower in the hierarchy are trained

before others� Higher level modules represent complex

position�independent features that depend on the de�

tection of simpler features in preceding layers�

The simple competitive learning rule is applied for

weights on connections to S�nodes� Since each S�layer

contains many nodes that detect identical features at

di�erent locations� only one such node is trained for

each feature� and later replicated�

The selective�attention neural network is a neocognitron

with additional �backward� connections� i�e�� a complete

copy of the forward connections wired to run in reverse�

Backward� owing signals identify portions of the input


LayerInput

0

10

1

US1

UC1

(a)

(b)

12

3 4

3

2

4

1

LayerInput 1

0

1

Module 1

Figure �� Connections between input layer and �rst

neocognitron module� a Each S�layer node receives

inputs from one quadrant of input space� Combined

results are output by the C�layer node for each fea�

ture� b Vertical line segments in second and third

input quadrants activate two S�layer nodes�

�� PRINCIPAL COMPONENT ANALYSIS PCA� NETWORKS ��

image that contributed to the activated network output

node� A gating network assists temporary deactivation

of some nodes� when network attention needs to be re�

focused�

GatingSignal

GatingSignal

GatingSignal

GatingSignal

Input Layer

Module 1 Module 3 Module 4Module 2

Module 1 Module 3 Module 4Module 2

ClassificationOutput

U4

WS 1

UC1US1U0 U U U U US C S C S4 C2 2 3 3

WS WC WS WC WSWC1 22 3 3 4

Figure �� The selective�attention neural network�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Principal Component Analysis PCA Networks

The PCA statistical procedure extracts important fea�

tures of data� reducing dimensionality of input vectors


while minimizing information loss� Interesting features

are obtained as linear combinations of available attributes�

To avoid redundancy� these features are linearly inde�

pendent�

EXAMPLE� ��dim� x is transformed to �dim� y �

y �

�� y�y�

� �

�� a b c

p q r

�

��x�

x�

x�

��

�� ax� � bx� � cx�

px� � qx� � rx�

� �

If the rows of

W �

�� a b c

p q r

�

are orthormal orthogonal� unit length� then

WW T �

��

� �

� �

W T is the pseudo�inverse ofW � irrespective of the num�

ber of orthonormal features and their dimension�

Generalization� For m n� if W minimizes �informa�

tion� loss in mapping n�dimensional vectors x to m�

dimensional vectors y� then applying the �opposite�


transformation pseudo�inverse W T to y should re�

sult in a vector close to x � i�e�� jjx�W TWxjj should

be minimized� Considering all input vectors together�

W must be chosen to minimizeNXi��

jjxi �W TWxijj��

Let ST be the variance�covariance matrix de�ned

in Appendix A�� of the training set T � To extract m

orthonormal features with maximum variance� PCA se�

lects them eigen�vectors of ST that correspond to the

m largest eigen�values� exploiting the following result�

� Let b be any vector such that jjbjj � �� Then the

variance of b�x� where x � T � is maximized when b

is chosen to be the eigen�vector of ST that corre�

sponds to the largest magnitude eigen�value of ST �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Let training set T � f��

�� g� First sub�

tract the mean vector �� from each pattern in


T � obtaining x� � �� x� � ��

� � �� with covariance matrix

S ��

�

�BBBB�

��

��

��

�CCCCA � with

eigen�values �� and ��

and associated eigen�vectors ��

�� and ��

For m � �� we consider the largest magnitude eigen�

value �� and this gives W � W� where

W� � ��

For m � � we use both �� and ��

W �W� �

��

��

�A �

Note that �� im�

plying that about ��! of the variation in the training

set can be explained by a single eigen�vector� using W�

alone� To capture more than ��! of the input character�

�� PRINCIPAL COMPONENT ANALYSIS PCA� NETWORKS �

istics� we may map input vectors into a two�dimensional

space� with y � W�x�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

PCA Networks are feedforward networks without hid�

den nodes that extract principal components� Each

y� y� y�

x� x� x� x�

Weights wji�s

Figure �� A one layer neural network for principal com�

ponent analysis�

jth output node calculates yj �Pn

i��wjixi� Weights

are randomly assigned initially� and updated by�

�W � ��y�xT� �K�W

where K� is one of the following�

�� K� � y�yT� Williams


� K� � � D y�yT� � Lyny

Tn Oja and Karhunen

�� K� � Ly�yT� Sanger

where D A has diagonal elements same as A other

elements �� and LA has elements below the main

diagonal same as A other elements �� For example�

D

��

� �

� �

��

� �

� and L

��

� �

� �

��

� �

� �

x�x�x�

y�

w� w� w�

Figure �� A neural network that generates the �rst prin�

cipal component�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Consider the data from Example �� with

�� and initial weights ��


For the �rst input� y � ��

�� Using the K� suggested by Williams�

�W ��

�

W � �� W � ��

For the next input x�� y � �� and

�W ��

W � �� W � ��

Subsequent presentations of x�� x�� and x� change W to

��

and �� respectively�

This process is repeated� cycling through the input

vectors x�� x�� By the end of the second iteration�

the weight vector becomes �� and

the third iteration changes it to ��

converging towards the �rst principal component�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Better features may be extractable by employing non�

linear functions of x� e�g�� minimizing E�jjx�W TSWxjj�

�


where S is a nonlinear function with scalar arguments�

extended to apply to vector arguments�

S

�BBBB�

y�

��

ym

�CCCCA �

�BBBB�

Sy�

��

Sym

�CCCCA �

A gradient descent technique may then be used for non�

linear principal component analysis� especially when S

is a continuous� monotonically increasing� odd function�

i�e�� S�y � �Sy�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Conclusion

Biological neural networks are capable of learning to

recognize patterns by abstracting what is common to

many di�erent instances of similar data� The driving

force in this learning process is the similarity among

various patterns� not an external reinforcement� nor the

relationships between various input dimensions�

Simple competitive learning networks that modify the

�� CONCLUSION ��

weights of connections to a single node� when an input

pattern is presented� ART networks allow the number

of nodes to increase� and also utilize a user�speci�ed

threshold to determine the degree of success in matching

an input pattern and the best matching prototype repre�

sented by a node� An SOM has a prede�ned topological

structure� and adapts nodes such that neighbors in the

topology respond similarly to input patterns� PCA net�

works determine features that are linear combinations

of input dimensions� preserving the maximum possible

information that distinguishes input samples�

These algorithms can also be used for supervised learn�

ing tasks� as in LVQ� counterpropagation� neocognitron�

and supervised GCS� especially when the training data

consists of readily identi�able clusters� Such algorithms

can be misled by noisy training data� for which algo�

rithms such as backpropagation have better generaliza�

tion properties�

Documents

distancecis.poly.edu/~mleung/CS6673/s09/Extra/m5.pdfe learning net w ork no des in the outer la y er de ned as the no de j whose w eigh t v ector or protot yp e or no de p osition