Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Chapter �
Unsupervised Learning
�Unsupervised� learning proceeds teacher�less� to dis�
cover special features and patterns from available data
without using external help� for tasks such as clustering�
vector quantization� approximating probability distribu�
tions� feature extraction and dimensionality reduction�
��� Winner�Take�All Networks
Most unsupervised neural networks rely on algorithms
to compute and compare distances� determine the �win�
ner� node with the highest level of activation� and use
these to adapt network weights�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�
� CHAPTER �� UNSUPERVISED LEARNING
Hamming networks� shown in Figure ���� compute the
�Hamming distance�� the number of di�ering bits in
two vectors� Weights on links from an input layer to
an output layer represent components of stored input
patterns�
i3,3 21,1i 2
(new input vector)
Output of third node = - H(new input vector, i3)
Figure ���� A network to calculate Hamming distances H
between stored vectors and input vectors��
Let ip�j � f��� �g be the jth element of the pth in�
put vector� Components of the weight matrix W and
threshold vector � are given by wp�j � ip�j� � and � �
�n� � respectively�
When a new input vector i is presented to this net�
���� WINNER�TAKE�ALL NETWORKS �
work� its upper level nodes generate the output
o � Wi �� ��
�BBBBBBB�
i� � i� n
i� � i� n
���
iP � i� n
�CCCCCCCA
Network outputs can be used to compare Hamming
distances between stored patterns and input patterns�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Maxnet is a recurrent one�layer network cf� Figure ��
used to determine which node has the highest initial
activation�
��
��
��
����
��
��
����
�� ��
��
��
����
Figure ���� Maxnet� a simple competitive network�
� CHAPTER �� UNSUPERVISED LEARNING
All nodes update their outputs simultaneously� Each
node receives inhibitory inputs from all other nodes� via
�lateral� intra�layer connections� Choosing � � � and
� � ��number of nodes ensures that a single node can
prevail as the only active or �winner� node� while the
activation of every other node subsides to zero�
The node function is fnet � max�� net where net �Pn
i��wixi�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Example ��� Let initial activation values be ���� ���� ��
���� ���� � � � and � � ���� In the �rst iteration�
the output of the �rst node falls to max�� ���� ������
��� � ��� � ��� � �� the outputs of the second� fourth
and �fth nodes fall to max�� ��� � ����� � ��� � ��� �
��� � �� �� and the output of the third node falls to
max�� ���� ����� � ��� � ��� � ��� � �����
In further iterations� �� �� �� ����� �� �� �� � �
�� ���� � �� ��� ���� � ���� � �� �� ���� �� �� �� which
is stable� Only three iterations were su�cient� even
���� WINNER�TAKE�ALL NETWORKS �
though nodes had initial values close to the winner�s�
The maxnet allows for greater parallelism in execution
than the algebraic solution� since every computation is
local to each node rather than centralized�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Simple competitive learning�
The Hamming net and maxnet assist more complex
unsupervised learning networks� helping to determine
the node whose weight vector is nearest to an input
pattern�
Simple competitive learning can be accomplished us�
ing a network that connects each input node to each
output node� with inhibitory connections among output
nodes� as shown in Figure ����
The jth output node is described by the vector of
weights to it from the input nodes� wj � wj��� � � � � wj�n�
Initially� all connection weights are assigned randomly�
with values between � and ��
A competition occurs to �nd the �winner� among
� CHAPTER �� UNSUPERVISED LEARNING
inhibitory connectionsOutput Layer with
Input Layer
Output �
���
� if node is a winner
� otherwise
Figure ���� Simple competitive learning network�
nodes in the outer layer� de�ned as the node j� whose
weight vector or �prototype� or �node position�� is
nearest to the input vector i�� i�e��
dwj�� i� � dwj� i� for j � f�� � � � �mg
where dwj� i� �pPn
���i��k � wjk��
If each jjwjjj � � and each jji�jj � �� then d�w� i sim�
pli�es to � P
�w�i� � � i �w� This distance will be
minimum when the dot product i �w takes its maximum
value� A maxnet can be invoked for this purpose�
Connection weights of the winner node� j�� are mod�
i�ed so that the weight vector w�
j moves closer to i��
���� WINNER�TAKE�ALL NETWORKS �
while the other weights remain unchanged� This is the
winner�take�all phase� The weight update rule can be
described as follows�
wjnew � wjold � �i� �wjold� �j�� �����
where
�j�� �
���� if the jth node is �winner� for input i��
� otherwise�
Each weight vector wj converges moves successively
closer to
wj ��P� �j��
X�
i� �j���
averaging input vectors for which wj is the winner�
The learning rate � may vary such that �t�� � �t�
resulting in faster convergence�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Example �� Input vectors T � fi� � ���� ���� ���� i� �
�� �� �� i� � �� ���� ����
i� � �� �� �� i� � ���� ���� ���� i� � �� �� �g�
The network contains three input nodes� assume that
there are also three output nodes� A�B�C with initial
CHAPTER �� UNSUPERVISED LEARNING
Algorithm Simple Competitive Learning�
Initialize weights randomly�
repeat
� Optional� Adjust learning rate �t�
� Select an input pattern ik�
� Find node j� whose weight vector wj� is closest to
ik�
� Update each weight wj���� � � � � wj��n using the rule�
�wj��� � �tik�� � wj���� for � � f�� � � � � ng
until network converges or computational bounds are
exceeded�
Figure ���� Simple competitive learning algorithm�
���� WINNER�TAKE�ALL NETWORKS
weights randomly chosen�
W � �
�BBBB�wA � �� ��� ���
wB � ��� ��� ���
wC � � � �
�CCCCA �
Assume � � ����
The �rst sample presented is i� � ���� ���� ���� Squared
Euclidean distance between A and i� � ��� � �� � �
���� ���� � ���� ���� � ���� Similarly� squared dis�
tances to B and C are ��� and ��� respectively� hence
C is the winner� Weights of C are updated� moving
halfway towards the sample since � � ���� The result�
ing weight matrix is�
W � �
�BBBB�wA � �� ��� ���
wB � ��� ��� ���
wC � ���� ���� ���
�CCCCA �
Assuming input vectors are presented repeatedly in
�� CHAPTER �� UNSUPERVISED LEARNING
the sequence� the weight matrix after � steps is�
W � �
�BBBB�wA � ���� ��� ���
wB � � ��� ����
wC � � �� �� �
�CCCCA �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Each node tends to be the winner for the same input
vectors� in later iterations� and moves towards their
centroid�
� Convergence is not smooth when � is high�
� Di�erent results ensue if we used another metric
such as the Manhattan distance�
� Results depend on initial values of node positions�
� Using three nodes is no guarantee that three �clus�
ters� will be obtained by the network�
� Results depend on the sequence in which samples
are presented�
���� WINNER�TAKE�ALL NETWORKS ��
� Weight vectors will continue to change at every sam�
ple presentation unless the sample presented coin�
cides with a weight vector�
� Each node can store its position at the beginning of
the previous epoch� and compare this with its posi�
tion at the beginning of the next epoch� Computa�
tions can be terminated when node positions cease
to change�
� If the set of input patterns is �xed� and if the same
nodes continue to be the winners for the same input
patterns� computation can be speeded up by non�
neurally computing the centroids of input pattern
clusters� stored as the weight of the relevant node�
The simple competitive learning algorithm conducts
stochastic gradient descent on the quantization error
E �Xp
j ip � W ip j�
where W ip is the weight vector nearest to ip�
�� CHAPTER �� UNSUPERVISED LEARNING
The number of nodes is assumed to be �xed� but the
right choice may not be obvious� We may attempt to
minimizeE�cnumber of nodes instead of E� for c � ��
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
k�Means Clustering cf� Figure ��� is a statistical pro�
cedure closely related to simple competitive learning�
which computes cluster centroids directly instead of mak�
ing small updates to node positions�
The k�means algorithm is fast� and converges to a
state in which each prototype changes little� assuming
that successive vectors presented to the algorithm are
drawn by independent random trials from the input
data distribution� Both k�means and simple competi�
tive learning algos� can lead to local minima of E�
Example ��� Given �ve one�dimensional inputs� f�� �
�� �� �g� weights of three nodes may move by simple
competitive learning to w� � �� w� � �� w� � ��
with quantization error �� which is not optimal� lower
error is obtained with w� � �� w� � ��� w� � ����
���� WINNER�TAKE�ALL NETWORKS ��
Algorithm k�means Clustering�
Initialize k prototypes
wj � i�� j � f�� � � � � kg� � � f�� � � � � Pg�
Each cluster Cj is associated with prototype wj�
repeat
� for each input vector i�� do
Place i� in the cluster with nearest prototype wj��
end�for�
� for each cluster Cj� do
wj �Xi��Cj
i�� j Cj j
end�for�
� Compute E �Pk
j��
Pi��Cj
j i� � wj j�
until E no longer decreases� or cluster memberships sta�
bilize�
Figure ���� k�means clustering algorithm�
�� CHAPTER �� UNSUPERVISED LEARNING
�� Learning Vector Quantizers
Learning vector quantizers LVQ apply simple compet�
itive learning to solve supervised learning tasks� where
class membership is known for each training pattern�
Each node is associated with a predetermined class
label� The number of nodes chosen for each class is
roughly proportional to the number of training patterns
in that class�
The learning rate varies with time� e�g�� �t � ��t or
�t � a�� � t�A� where a � � and A � �� This helps
the network converge to a state in which weight vectors
are stable�
Example ��� fi� � ���� ���� ���� i� � �� �� �� i� �
�� ���� ���� i� � �� �� �� i� � ���� ���� ���� and i� �
�� �� �g� Assume only the �rst and last samples come
���� LEARNING VECTOR QUANTIZERS ��
Algorithm LVQ��
Initialize each weight � ������
Associate each node with one of the classes�
repeat
Adjust �t�
for each ik� do
�nd node j� whose weight vector wj� is closest to ik�
for � � �� � � � � n� do
if the class label of node j� equals
the desired class of ik�
then �wj��� � �tik�� � wj���
else �wj��� � ��tik�� � wj���
until network converges or comp� bounds are exceeded�
Figure ���� LVQ� algorithm�
�� CHAPTER �� UNSUPERVISED LEARNING
from Class �� and
W � �
�BBBB�w� � �� ��� ���
w� � ��� ��� ���
w� � � � �
�CCCCA �
w� is associated with Class �� and the other two nodes
with Class �
Let �t � ��� until t � �� then �t � �� � until
t � � � and �t � ��� thereafter�
�� Sample i�� winner w� distance ����� w� changed to
����� ����� �����
��� ��� ��� ��� ���
Associations between input samples and weight vec�
tors stabilize by the second cycle of pattern presenta�
tions� Weight vectors continue to change� converging to
centroids of associated input vectors in ��� iterations�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Learning rule for �LVQ �� Let w�� w� be the two weight
vectors nearest the input vector i� and both weight
���� COUNTERPROPAGATION ��
vectors are at comparable distances from i� but only w�
is associated with the desired class� Then w� is moved
closer to the presented pattern� and w� is moved farther
from the presented pattern� using an update rule similar
to that of LVQ��
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Counterpropagation
The �forward�only� counterpropagation network con�
sists of an input layer� a hidden layer� and an output
layer� Connections exist from every input node to ev�
ery hidden node� and from every hidden node to every
hidden and output node�
Forward�only counterpropagation consists of two phases�
each of which uses separate time�varying learning rates
��t and ��t�
Phase �� Hidden layer nodes are used to divide the
training set into clusters containing similar input pat�
terns� using the simple competitive learning rule�
� CHAPTER �� UNSUPERVISED LEARNING
Weights trained
by outstar rule
Weights trained
by simple
competitive learning
inhibitory connections
Input Layer
Hidden Layer with
Output Layer
Figure ���� Architecture of the forward�only counterpropa�
gation neural network�
���� COUNTERPROPAGATION �
Phase � Weights on edges out of the winner hidden
node k� are modi�ed using gradient descent�
�vj�k� � ��td��j � vj�k�
where d� is the desired output vector�
The system can evolve with time since small correc�
tions may be made to the �rst layer weights even during
the second phase�
After training� the network is used to �nd the hidden
node whose weight vector matches best with the input
vector� The output pattern generated by this network
is identical to the vector of weights leading out of the
winner hidden node�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Full Counterpropagation works in both directions�
�� to predict ip� given dp� and
� to predict dp given ip�
Initially� all four sets of the weights w�� w�� v�� v�
in Fig� ��� are assigned random values between � and ��
�� CHAPTER �� UNSUPERVISED LEARNING
Algorithm Forward�only Counterpropagation�
Initialize weights � �� � randomly� and select ��� ���
Phase �� repeat
� for p � �� � � � � P � do �
�� Find winner k�� minimizingPn
j��wk��j � ip�j��
� Update each wk��� � � � ��wk���old � ��ip��
end�for�
� Decrease ��� e�g�� by a small constant�
until performance or comp� bounds are met�
Phase � repeat
� for p � �� � � � � P � do �
�� Find winner k� as in Phase ��
� For each j� �� update
wk��� � �� ��wk���old � ��ip���
vj�k� � �� ��vj�k�old � ��p�jdp�j�
end�for�
� Decrease ���
until performance or computational bounds are met�
F d l t ti l ith
���� COUNTERPROPAGATION ��
ip
Hidden layer with inhibitory connections
pdPredicted output pattern
Predicted input pattern
wv
v
(2)
w(1) (2)
(1)
Figure ��� Full counterpropagation neural network�
�� CHAPTER �� UNSUPERVISED LEARNING
Phase � adjusts weights w� and w� associated with
connections leading into the hidden nodes� ip and dp are
both used in this process� Phase adjusts weights v�
and v� associated with connections leading away from
the hidden nodes� After training� the network can be
used to predict ip from dp� or conversely�
In a variation of this algorithm� all hidden nodes par�
ticipate in predicting a value in proportion to their dis�
tance from the input�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Adaptive Resonance Theory
Adaptive resonance theory ART networks allow the
number of clusters to vary with problem size� The user
can control the dissimilarity between members of the
same cluster by selecting a vigilance parameter�
ART networks undergo unsupervised learning to as�
sociate patterns with prototypes� Each input layer node
corresponds to an input vector component� Each outer
���� ADAPTIVE RESONANCE THEORY ��
layer node represents a cluster centroid�
Each input pattern is presented several times� and
may be associated with di�erent clusters before the net�
work stabilizes� At each successive presentation� net�
work weights are modi�ed�
The �rst layer transmits the input patterns to the
second layer� Matching succeeds if the pattern returned
in response is su�ciently similar to the input pattern�
Otherwise� signals travel back and forth between the
output layer and the input layer �resonance� until a
match is discovered� If no satisfactory match is found� a
new cluster of patterns is formed around the new input
vector� the system is adaptive�
ART� restricts all inputs to be binary valued�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Example ��� Let � ���� n � �� and
input vectors � f�������������� ��������������
�������������� �������������� �������������g� For the
�rst node� let t���� � �� b���� ����
�� CHAPTER �� UNSUPERVISED LEARNING
�b���� t���
�
���
���Input layer
Output layer with
connectionsinhibitory
Figure ����� ART� network�
For the �rst input vector�
y� ��
�� � �
�
�� � �
�
�� � � � � ��
�
�� � �
�
�� � �
�
��
and y� is declared the uncontested winner� SinceP�
��� t��� x�P���� x�
��
�� � � ����
b���� �
���
������
� ����
for � � �� � ��
� otherwise�
Likewise�
t���� � t����x��
B� �h
����
����
� � � � ����
iT
T � �h� � � � � � �
iT�
���� ADAPTIVE RESONANCE THEORY ��
Algorithm ART��
Initialize each t��j� � �� bj��� ��
n���
while the network has not stabilized� do
�� Let A contain all nodes�
� For a randomly chosen input vector x � compute
yj � bj � x for each j � A�
�� repeat
a Let j� be a node in A with largest yj�
b Compute s� �s��� � � � � s�
n where s�
� � t��j� x��
c If
Pn��� s��Pn��� x�
� then remove j� from set A
else associate x with node j� and update weights�
bj���new �t��j�old x�
��� �Pn
��� t��j�oldx�
t��j�new � t��j�oldx�
until A is empty or x is associated with some node�
�� If A is empty� create new node with weight vector
x�
end�while�
Figure ����� ART� Learning Algorithm�
�� CHAPTER �� UNSUPERVISED LEARNING
For the second sample �� �� �� �� �� �� �� y� � �� butP
� t���x��P
� x� � � � A new node is generated� with
B �
�� �
�������
� � � � ����
� � ����
����
����
����
�
�T
and
T �
�� � � � � � � �
� � � � � � �
�T
�
The network stabilizes with three nodes� in two pre�
sentations of each input vector�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Observations�
�� Samples may switch allegiance� a sample previously
activating node now activates node ��
� When the initial winner with highest y� fails the
vigilance test� another node may subsequently sat�
isfy the vigilance test�
�� Top�down weights are modi�ed by computing inter�
sections conjunctions with the input vector� so that
���� ADAPTIVE RESONANCE THEORY ��
the number of ��s does not increase� hence we ini�
tialize each top�down weight to ��
�� Given su�cient number of nodes� �outliers� that
ought not to belong to any cluster will also be as�
signed separate nodes�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Minor Variations of ART��
�� Each bottom�up weight is assigned some common
value such that � bj�� L
L���nwhere L typically
or � is a preassigned positive integer� Given inte�
gers B and D such that D � B � �� the top�down
weights are chosen such that
B � �
D t��j ��
The bottom�up weight updates depend on L� with
bj�� �L x�
L� � �Pn
��� x��
and the top�down weights are updated as before�
� CHAPTER �� UNSUPERVISED LEARNING
� An additional check can be performed to determine
the appropriateness of membership of a given pat�
tern to a cluster� Instead of working with a set of
bottom�up and top�down weights� we consider proto�
types representing clusters� Let wj � wj��� � � � � wj�n
denote the prototype of the jth cluster� for j �
�� � � � �m� Given a pattern x � x�� � � � � xn� the sim�
ilarity of active prototypes is found by examining the
value of Pn��� wj�� x�
� �Pn
��� wj��
�
where � � �� If the best matching prototype j� is
su�ciently similar to x� i�e��Pn��� wj��� x�
� �Pn
��� wj���
�
Pn��� x�
� � �
then examine whetherPwj��� x�Pn��� x�
� �
Both conditions must be satis�ed to associate x with
cluster j� and modify its prototype�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
���� TOPOLOGICALLY ORGANIZED NETWORKS �
��� Topologically Organized Networks
SELF�ORGANIZING MAP SOM� proposed by Ko�
honen� combines competitive learning with a topolog�
ical structuring such that adjacent nodes attain simi�
lar weight vectors� Competitive learning requires in�
hibitory connections among all nodes� topological struc�
turing implies that each node also has excitatory con�
nections to a small number of nodes in the network�
The topology is speci�ed in terms of a neighborhood
relation among nodes� Weight vectors of the winner as
well as its neighbors move towards an input sample�
The SOM training algorithm updates the winner node
and also nodes in its topological vicinity� when an input
pattern is presented� The only criterion is the topologi�
cal distance between nodes� there is no other excitatory
�weight� between the winner node and its neighbors�
The neighborhood Njt contains nodes that are within
a topological distance of Dt from node j at time t�
�� CHAPTER �� UNSUPERVISED LEARNING
Algorithm SelfOrganize�
� Select network topology neighborhood relation�
� Initialize weights randomly� and select D� � ��
� while computational bounds are not exceeded� do
�� Select an input sample i��
� Find the output node j� with minimumPn
k��i��kt� wj�kt��
�� Update weights to all nodes within a topological
distance of Dt from j�� using
wjt � � � wjt � �ti�t� wjt�
where � �t � �t� � � ��
�� Increment t�
end�while�
Figure ����� Kohonen�s SOM Learning algorithm
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
t <
t > After 39
t <
Neighborhood for 20
Neighborhood for 30
Neighborhood for 10 t <
t <
10Neighborhood for 0
20
30
40
Figure ����� Time�varying neighborhoods of a node in a SOM
network with grid topology� Dt � � for � � t ���
Dt � � for �� � t �� Dt � for � � t ���
Dt � � for �� � t ��� and Dt � � for t � ���
��CHAPTER��
UNSUPERVISEDLEARNING
where
Dt
decre
aseswith
time�Dt
doesnotrefer
toEuclid
eandista
ncein
inputspace�itrefers
only
to
thelength
ofthepath
connectin
gtwonodes�
forthe
prespeci�edtopologychosenforthenetwork�
Topological distance fromthe winner node
Tim
e period of node adaptation
Figu
re�����
Shrin
kageoftheneighborhoodofthewinner
nodein
topologically
organize
dSOM
networks�astim
e
increases�
Weights
changeatarate
�t
thatdecre
aseswith
time�Ifjisthewinnernode�
Nj t
�fjgfneighbors
ofjattim
etg�
and
iistheinputvecto
rpresentedto
thenetwork
at
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
time t� then the weight change rule is�
w�t � � �
���w�t � �t i� w�t� if � � Njt
w�t� if � � Njt�
Weight vectors often become ordered� i�e�� topological
neighbors become associated with weight vectors that
are near each other in the input space� This is depicted
in Figure ����� for input vectors uniformly distributed�
The weight vectors are initially random but should align
themselves as shown in Figure ����aii and bii�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Example ��� Consider the same data as in example �� �
fi� � ���� ���� ���� i� � �� �� �� i� � �� ���� ����
i� � �� �� �� i� � ���� ���� ���� i� � �� �� �g�
Assume a ��node network with node A adjacent to nodes
B and C respectively�CB A
�� CHAPTER �� UNSUPERVISED LEARNING
(i) (ii)(i)(b)(a)
(ii)
Figure ����� Weight vectors in the input space before and
after the SOM learning algorithm is applied� a using a
��dimensional linear network topology for �dimensional
input data uniformly distributed in a triangular region�
and b using a �dimensional grid network topology
for �dimensional input data uniformly distributed in a
rectangular region� Weight vectors are initially located
randomly� as shown in ai and bi� By the end of
the learning process� the weights become �ordered�� as
shown in ii of each example�
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
Let the initial weights be given by
W � �
�BBBB�wA � �� ��� ���
wB � ��� ��� ���
wC � � � �
�CCCCA �
Let Dt � � for one initial epoch sequence of six
input vector presentations� and Dt � � thereafter�
Let initial � � ���� reduced to �� � in the second epoch
and to ��� in the third epoch�
t � �� Sample presented� i� � ���� ���� ����
Squared Euclidean distance between A and i�� d�A� �
���� �� ������ ��������� ���� � ���� Similarly�
d�B� � ��� and d�C� � ����
C is the winner since d�C� d�A� and d�C� d�B��
Since Dt � � at this stage� the weights of both C and
its neighbor A are updated according to Equation �� �
e�g�� wA��� � wA��� � �� � x��� � wA��� � �����
�� CHAPTER �� UNSUPERVISED LEARNING
The resulting weight matrix is�
W � �
�BBBB�wA � ���� �� ����
wB � ��� ��� ���
wC � ���� ���� ���
�CCCCA �
Weights attached to B are not modi�ed since B falls
outside the neighborhood of the winner node C�
t � �� � is now reduced to �� �� and the neighborhood
relation shrinks� so that only the winner node is updated
henceforth� Sample presented� i�� C is the winner� only
wC is updated�
wC � ����� ���� ������
t � ��� � is reduced to ���� i� is presented�
wC � ����� ���� ��� ��
t � ��� i� is presented�
wB � ����� �� � ������
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
t � ��� i� is presented� wC is updated�
W �� �
�BBBB�wA � ���� ���� ����
wB � ���� �� � ����
wC � ���� ���� ����
�CCCCA �
At the beginning of the training process� the Euclidean
distances between various nodes were given by�
jwA�wBj � ����� jwB�wCj � �� �� jwA�wCj � �� �
The training process increases the relative distance be�
tween non�adjacent nodes B�C� whileA remains roughly
in between B and C�
jwA�wBj � �� �� jwB�wCj � ����� jwA�wCj � �����
Note that samples switch allegiance between nodes�
especially in the early phases of computation� Compu�
tation continues until the same nodes continue to be the
winners for the same input patterns�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
EXAMPLE� Eight ��dimensional samples at the corners
of the unit cube are presented� and one node is initially
� CHAPTER �� UNSUPERVISED LEARNING
in the cube while all others are far away� only the node
inside the cube is the winner for all samples� If the
neighborhood relation for the inner node does include
other nodes� then those nodes are pulled towards the
cube� If computation continues long enough� some of
them eventually become winners for some inputs�
The SOM algorithm with steadily shrinking neighbor�
hood relations has been shown to perform better than
networks with static neighborhood functions�
The SOM learning algorithm can also be extended
for supervised learning tasks� using a variant of LVQ
that maintains a topological relation between nodes and
updates nodes in a neighborhood whose size decreases
with time�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Convergence
One�Dimensional SOM� in which the �th node with
weight vector w� of the network is connected to the
���� TOPOLOGICALLY ORGANIZED NETWORKS �
�� �st node� is ordered if
jr � sj jr � qj � jwr � wsj jwr �wqj�
where r� s� q are node labels andwr� ws� wq are the weights
corresponding to those nodes� If the neighborhood re�
lation satis�es certain properties� an ordered con�gura�
tion is stationary and stable� This state will be reached
with probability � as long as the neighborhood function
is positive�valued� normalized and decreasing with dis�
tance� so that there exists a sequence of patterns which
will lead the network into the ordered con�guration�
However� for some sequences of input pattern pre�
sentations� e�g�� if patterns are presented in a �xed
sequence� even the one�dimensional Kohonen network
may not converge to an ordered state� The network may
instead settle into a state in which topologically adja�
cent nodes may be physically distant from each other in
the weight space�
�� CHAPTER �� UNSUPERVISED LEARNING
A B C D
�b
AB CD
�a
Figure ����� Emergence of an �ordered� map� The dark
circles represent data points in one�dimensional space�
and nodes are identi�ed as A�B�C� and D� with con�
nections indicating linear topology� Figure a depicts
initial positions of nodes before applying SOM learn�
ing algorithm� and Figure b shows �nal positions of
nodes� with topologically adjacent nodes having similar
weights�
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Multi�Dimensional Maps� In practice� the SOM often
converges quickly to ordered maps� but there is no guar�
antee that this will occur� One may view an SOM�s exe�
cution as consisting of two phases� where nodes are more
�volatile� in the �rst phase� searching for niches to move
into� with the second �sober� phase consisting of nodes
settling into cluster centroids in the vicinity of positions
found in the earlier phase� The sober phase converges�
but the emergence of an ordered map depends on the
result of the volatile phase�
A topologically ordered con�guration is reachable� but
the network may move out of such an ordered con�gu�
ration� In general� there is no energy function on which
the SOM weight update rule performs gradient descent�
Example ��� Consider the network in Figure ����� with
vertices at the corners of a square� If input patterns
are repeatedly chosen near the center of the square but
slightly closer to w�� then the nodes previously at w�
�� CHAPTER �� UNSUPERVISED LEARNING
and w� will eventually move closer to each other than to
node w�� Topologically adjacency no longer guarantees
proximity of weight vectors�
�a �b
w�
�winner
w�
w�w�
w�
w�
w�
w�
iInput sample
i
Figure ����� Topologically non�adjacent nodes w�� w� move
closer to each other than to an adjacent node w�� on
repeated presentation of i winner w�� a initial posi�
tions of weight vectors� b �nal positions�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Extensions
Conscience mechanism enables most nodes in a network
to achieve useful learning� When �rich� enough� nodes
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
give up some �wealth� inputs for which they are win�
ners� For each jth node� the conscience mechanism
monitors the frequency freq j with which it becomes the
winner� The winner is chosen as that node for which
dw�x � gfreq j is minimum� where g is a monotoni�
cally increasing function� e�g��
gfreq j � cfreq j � ��m�
where c � � and m is the number of competing nodes�
Another possible implementation temporarily deacti�
vates a node whose frequency exceeds a predetermined
limit such as P�m where � c �� P is the number of
patterns and m is the number of nodes� further compe�
tition is conducted only among other active nodes�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Hierarchical Maps�
�Hierarchical� clustering algorithms develop a tree of
clusters� in which subclusters are associated with chil�
dren of each node� In �bottom�up� algorithms� smaller
lower level clusters are successively merged to obtain
�� CHAPTER �� UNSUPERVISED LEARNING
higher level clusters� �Top�down� algorithms succes�
sively partition each cluster into subclusters�
The simple SOM is inadequate for applications with
elongated� ellipsoid� and arbitrarily�shaped clusters� since
it attempts to minimize Euclidean distance between sam�
ples and weight vectors of nodes� generating symmetric
clusters� In a multilayer self�organizing map MSOM�
the outputs of the �rst SOM are fed into the second
SOM as inputs� The �rst layer clusters together all
similar elements of the training set� but this similarity is
con�ned to circular symmetry in the original space� The
second layer of nodes manages to combine these circular
clusters together into arbitrary shapes in the desirable
cluster� as in bottom�up hierarchical algorithms�
Luttrell�s approach steadily increases the number of
nodes� repeatedly applying the SOM learning rule at
each step� Weight updates are restricted to a few topo�
logical neighbors of the winner� Each new node is in�
serted halfway between two existing nodes� High�level
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
clusters obtained at early phases of the algorithm are
partitioned into sub�clusters in later phases of the algo�
rithm� as in top�down algorithms�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Growing Cell Structures GCS adapt network size� ap�
proximating some probability distributions better than
the SOM� GCS networks maintain a hyper�tetrahedral
topology of �xed dimensionality� e�g�� in Figure �����
the topology consists of triangles� The choice of neigh�
bors may be modi�ed as the network evolves� tracking
changes in the nature of the problem�
A �signal counter� �j is associated with each jth
node� and estimates the proportion of samples currently
associated with a cluster located at a given node�
The algorithm has three main components�
Node adaptation� The signal counter �j� of the winner
is incremented� while others decay� ��� � ����� for
each �� The winner j� and its immediate neighbors
�� CHAPTER �� UNSUPERVISED LEARNING
�only� are adapted�
wjt�� � wjt��it�wjt� if j is adjacent to j��
wj�t� � � wj�t � ��it�wj�t� with �� � ��
Node with
(a)
Initial networkτ=0
τ=4
τ=3
τ=3
τ=5τ=5 τ=7
τ=9
New node
highest τ
New edge
Node to bedeleted
τ=4
τ=2
(c)(b)
Figure ���� GCS with triangular topology� a Data distri�
bution and initial network� b Node addition� c Node
deletion�
Node insertion� After some adaptation steps� a new node
�new is added midway between node � with largest
signal counter value and ��s farthest neighbor �far�
���� TOPOLOGICALLY ORGANIZED NETWORKS ��
Neighbors of �new are �� �far� and others as needed
to preserve the chosen topological structure of the
network� Signal counter values change� in rough pro�
portion to the fraction of the volume of the Voronoi
region of each node lost to �new�
Node deletion� Nodes with counter valuesthreshold are
deleted� with incident edges� Nodes with no other
neighbors are then deleted� Other nodes are also
deleted� maintaining the topological structure of the
network�
The GCS network grows and shrinks repeatedly� and
may not converge to a stable state�
Instead of the signal counter value� one may use any
other resource value� e�g�� quantization error�
GCS for supervised learning� The weight vector for each
hidden node is obtained using unsupervised GCS� Each
hidden node applies a Gaussian function to its net in�
put� each output node applies a sigmoid node function�
When a sample is presented� the resource value of the
� CHAPTER �� UNSUPERVISED LEARNING
current best matching unit j is updated by
��j �
mX���
o� � d��
for function approximation problems� for classi�cation
problems� ��j � f�� �g� with a cluster of samples�
Weights to output nodes are modi�ed using the delta
rule�
�w��j � �d� � o�yj
where yj is the output of the jth hidden layer node�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Distance�based Learning
Weight vectors may be updated using Euclidean dis�
tance instead of topological proximity� with the advan�
tage of obtaining convergence results by showing that
the algorithm conducts gradient descent on an energy
function� Simple competitive learning updates only the
winner node� SOM updates the winner and its topolog�
ical neighbors� �maximum entropy� algorithm updates
���� DISTANCE�BASED LEARNING �
Rel
ativ
e st
reng
thof
ada
ptat
ion
Rel
ativ
e st
reng
thof
ada
ptat
ion
Rank of weight vectorin terms of distance from the input vector
Euclidean distance of the weight vector from input pattern
i
(a) (b)
Figure ���� Weight adaptation rules� a Neural Gas
b Maximum Entropy� The y�axis measures
j�wjj�ji�wjj�
all� and �neural gas� algorithm makes signi�cant up�
dates to only a few nodes� E�ects of the new update
rules are shown in in Figure ����� compare with Figure
���� for SOM�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Maximum entropy procedure� Each weight vector is drawn
closer to the input vector� by a quantity that depends on
their Euclidean distance and a �temperature� parame�
ter T which is steadily decreased�
�� CHAPTER �� UNSUPERVISED LEARNING
�wj � �i� wjexp�ji� wjj��T P� exp�ji� w�j��T
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Neural Gas Algorithm� The change in the jth weight
depends on its �rank� kji�W � the number of weight
vectors w� �W such that jjw� � ijj jjwj � ijj�
�wj � �i�wjhkji�W � j � f�� � � � � Ng
where h is monotonically decreasing with h� � �� e�g��
hx � h�x � e�x��� � ��
This rule conducts stochastic gradient descent on
�
C
NXj��
ZiP ih�kji� wjji �wjjj
� di
where C is a normalization factor� and P i repre�
sents the probability distribution of the input vectors�
If the input vector set is �nite� a summation over all
input vectors can be used instead of the integral� is
decreased with time� when small enough� the rule re�
sembles �winner�take�all��
���� NEOCOGNITRON ��
Although slower weight vectors must be sorted at
each input presentation� this algorithm yielded better
weight vectors than SOM� k�means clustering� and max�
imum entropy algorithms� on a larger version of the clus�
tering problem shown in Figure �� ��
��
�
��
�
�
�
�
�
��
�
�
�
��
��
�
�
���
�
��
�
���
�
�
���
��
�
�
�
��
�
�
��
�
��
�
�
��
�
��
�
�
��
�
��
�
�
�
�
�
���
��
�
��
�
��
�
�
��
�
��
�
�
�
��
�
��
�
�
�
�
�
��
�
�
�
��
��
�
�
���
�
��
�
���
�
�
���
��
�
��
�
��
�
�
��
�
��
�
�
��
�
��
�
�
��
�
��
�
�
�
�
�
���
��
�
��
���
�
�
��
�
��
�
�
�b �a
Figure ����� Node positions a before and b after apply�
ing Neural Gas algorithm with �n nodes� when data is
contained within n small rectangles�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� CHAPTER �� UNSUPERVISED LEARNING
U�
Module � Module � Module � Module �InputLayer
US� UC� US�UC�
Figure ����� The neocognitron network�
��� Neocognitron
The input images are two�dimensional arrays� and the
�nal result of pattern recognition indicates which high�
level feature or shape has been found in the entire input
image� activating the appropriate output node�
The network uses a hierarchy of many modules� with
each module extracting features from the previous mod�
ule� Each module consists of an S�layer and a C�layer�
� An S�layer node searches for a feature in a speci�c
region of the input image U� in Figure �� �� e�g��
for a vertical line segment in the top�left corner of
the input image�
���� NEOCOGNITRON ��
� A C�layer node corresponds to one relatively position�
independent feature� receiving inputs from S�layer
nodes for that feature� e�g�� to determine whether a
vertical line segment occurs anywhere in the image�
The trainer must �rst �x the receptive �eld S�nodes
that provide inputs for each C�node� Modules closer
to the input layer lower in the hierarchy are trained
before others� Higher level modules represent complex
position�independent features that depend on the de�
tection of simpler features in preceding layers�
The simple competitive learning rule is applied for
weights on connections to S�nodes� Since each S�layer
contains many nodes that detect identical features at
di�erent locations� only one such node is trained for
each feature� and later replicated�
The selective�attention neural network is a neocognitron
with additional �backward� connections� i�e�� a complete
copy of the forward connections wired to run in reverse�
Backward� owing signals identify portions of the input
�� CHAPTER �� UNSUPERVISED LEARNING
LayerInput
0
10
1
US1
UC1
(a)
(b)
12
3 4
3
2
4
1
LayerInput 1
0
1
Module 1
Figure ����� Connections between input layer and �rst
neocognitron module� a Each S�layer node receives
inputs from one quadrant of input space� Combined
results are output by the C�layer node for each fea�
ture� b Vertical line segments in second and third
input quadrants activate two S�layer nodes�
��� PRINCIPAL COMPONENT ANALYSIS PCA� NETWORKS ��
image that contributed to the activated network output
node� A gating network assists temporary deactivation
of some nodes� when network attention needs to be re�
focused�
GatingSignal
GatingSignal
GatingSignal
GatingSignal
Input Layer
Module 1 Module 3 Module 4Module 2
Module 1 Module 3 Module 4Module 2
ClassificationOutput
U4
WS 1
UC1US1U0 U U U U US C S C S4 C2 2 3 3
WS WC WS WC WSWC1 22 3 3 4
Figure ����� The selective�attention neural network�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Principal Component Analysis PCA Networks
The PCA statistical procedure extracts important fea�
tures of data� reducing dimensionality of input vectors
�� CHAPTER �� UNSUPERVISED LEARNING
while minimizing information loss� Interesting features
are obtained as linear combinations of available attributes�
To avoid redundancy� these features are linearly inde�
pendent�
EXAMPLE� ��dim� x is transformed to �dim� y �
y �
�� y�y�
� �
�� a b c
p q r
�
��x�
x�
x�
����� �
�� ax� � bx� � cx�
px� � qx� � rx�
� �
If the rows of
W �
�� a b c
p q r
�
are orthormal orthogonal� unit length� then
WW T �
�� � �
� �
� �
W T is the pseudo�inverse ofW � irrespective of the num�
ber of orthonormal features and their dimension�
Generalization� For m n� if W minimizes �informa�
tion� loss in mapping n�dimensional vectors x to m�
dimensional vectors y� then applying the �opposite�
��� PRINCIPAL COMPONENT ANALYSIS PCA� NETWORKS ��
transformation pseudo�inverse W T to y should re�
sult in a vector close to x � i�e�� jjx�W TWxjj should
be minimized� Considering all input vectors together�
W must be chosen to minimizeNXi��
jjxi �W TWxijj��
Let ST be the variance�covariance matrix de�ned
in Appendix A�� of the training set T � To extract m
orthonormal features with maximum variance� PCA se�
lects them eigen�vectors of ST that correspond to the
m largest eigen�values� exploiting the following result�
� Let b be any vector such that jjbjj � �� Then the
variance of b�x� where x � T � is maximized when b
is chosen to be the eigen�vector of ST that corre�
sponds to the largest magnitude eigen�value of ST �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Example ��� Let training set T � f���� �� � ���� ���� ��� ����
���� ���� ���� �� � ��� ���� ���� ���� ���g� First sub�
tract the mean vector ���� ���� ��� from each pattern in
� CHAPTER �� UNSUPERVISED LEARNING
T � obtaining x� � ���� �� ������ x� � ������� ������
� � �� with covariance matrix
S ��
�
�BBBB�
���� ���� �����
���� ���� �����
����� ����� ����
�CCCCA � with
eigen�values �� � ������ �� � ������ and �� � ������
and associated eigen�vectors ���� ������� ��������
����������� � ���� �� and ���� �� ������� ������
For m � �� we consider the largest magnitude eigen�
value ����� and this gives W � W� where
W� � ���� � � ���� � ������
For m � � we use both �� and ���
W �W� �
�� ���� � ����� ������
����� ����� ���� �
�A �
Note that ����� � �� � �� � ����������� � ����� im�
plying that about ��! of the variation in the training
set can be explained by a single eigen�vector� using W�
alone� To capture more than ��! of the input character�
��� PRINCIPAL COMPONENT ANALYSIS PCA� NETWORKS �
istics� we may map input vectors into a two�dimensional
space� with y � W�x�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
PCA Networks are feedforward networks without hid�
den nodes that extract principal components� Each
y� y� y�
x� x� x� x�
Weights wji�s
Figure ����� A one layer neural network for principal com�
ponent analysis�
jth output node calculates yj �Pn
i��wjixi� Weights
are randomly assigned initially� and updated by�
�W � ��y�xT� �K�W
where K� is one of the following�
�� K� � y�yT� Williams
�� CHAPTER �� UNSUPERVISED LEARNING
� K� � � D y�yT� � Lyny
Tn Oja and Karhunen
�� K� � Ly�yT� Sanger
where D A has diagonal elements same as A other
elements ��� and LA has elements below the main
diagonal same as A other elements ��� For example�
D
�� �
� �
� �
�� � �
� �
� and L
�� �
� �
� �
�� � �
� �
� �
x�x�x�
y�
w� w� w�
Figure ����� A neural network that generates the �rst prin�
cipal component�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Example ��� Consider the data from Example ���� with
�� � ���� and initial weights ���� ���� ����
��� PRINCIPAL COMPONENT ANALYSIS PCA� NETWORKS ��
For the �rst input� y � ���� ���� ��� � �� �� ����� �
�� �� Using the K� suggested by Williams�
�W ����� ������ �� ������� � ��� ������� ����� ����
�
W � ����� ����� ���� � �W � �� �� ��� � �����
For the next input x�� y � ��� �� and
�W ����� �������� ������ ��� �� �� �� ��� � ����
W � �� ��� ������ ���� ��W � �� ��� ������ ������
Subsequent presentations of x�� x�� and x� change W to
�� � � ������ ������ �� ��� ������ ������
and ���� � �� ��� ������ respectively�
This process is repeated� cycling through the input
vectors x�� � � � � x�� By the end of the second iteration�
the weight vector becomes ������� ������ ������ and
the third iteration changes it to ����������� �� ������
converging towards the �rst principal component�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Better features may be extractable by employing non�
linear functions of x� e�g�� minimizing E�jjx�W TSWxjj�
�
�� CHAPTER �� UNSUPERVISED LEARNING
where S is a nonlinear function with scalar arguments�
extended to apply to vector arguments�
S
�BBBB�
y�
���
ym
�CCCCA �
�BBBB�
Sy�
���
Sym
�CCCCA �
A gradient descent technique may then be used for non�
linear principal component analysis� especially when S
is a continuous� monotonically increasing� odd function�
i�e�� S�y � �Sy�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Conclusion
Biological neural networks are capable of learning to
recognize patterns by abstracting what is common to
many di�erent instances of similar data� The driving
force in this learning process is the similarity among
various patterns� not an external reinforcement� nor the
relationships between various input dimensions�
Simple competitive learning networks that modify the
���� CONCLUSION ��
weights of connections to a single node� when an input
pattern is presented� ART networks allow the number
of nodes to increase� and also utilize a user�speci�ed
threshold to determine the degree of success in matching
an input pattern and the best matching prototype repre�
sented by a node� An SOM has a prede�ned topological
structure� and adapts nodes such that neighbors in the
topology respond similarly to input patterns� PCA net�
works determine features that are linear combinations
of input dimensions� preserving the maximum possible
information that distinguishes input samples�
These algorithms can also be used for supervised learn�
ing tasks� as in LVQ� counterpropagation� neocognitron�
and supervised GCS� especially when the training data
consists of readily identi�able clusters� Such algorithms
can be misled by noisy training data� for which algo�
rithms such as backpropagation have better generaliza�
tion properties�