Upload
dylan-atkinson
View
215
Download
1
Embed Size (px)
Citation preview
Dynamics of Learning VQand Neural Gas
Aree Witoelar, Michael BiehlMathematics and Computing ScienceUniversity of Groningen, Netherlands
in collaboration with Barbara Hammer (Clausthal), Anarta Ghosh (Groningen)
Dagstuhl Seminar, 25.03.2007
Outline
Vector Quantization (VQ)
Analysis of VQ Dynamics
Learning Vector Quantization (LVQ)
Summary
Dagstuhl Seminar, 25.03.2007
Vector Quantization
Objective:representation of (many) data with (few) prototype vectors
Assign data ξμ to nearest prototype vector wj
(by a distance measure, e.g. Euclidean)
grouping data into clusters e.g. for classification
P
j ),d(E(W)
wξ
data distance to nearest prototype
Find optimal set W for lowest quantization error
Dagstuhl Seminar, 25.03.2007
Example: Winner Takes All (WTA)
• initialize K prototype vectors
• present a single example
• identify the closest prototype, i.e the so-called winner • move the winner even closer towards the example
• stochastic gradient descent with respect to a cost function
• prototypes at areas with high density of data
Dagstuhl Seminar, 25.03.2007
Problems
Winner Takes All “winner takes most”: update according to “rank”e.g. Neural Gas
sensitive to initialization
less sensitive to initialization?
Dagstuhl Seminar, 25.03.2007
(L)VQ algorithms
• intuitive
• fast, powerful algorithms
• flexible
• limited theoretical background w.r.t. convergence speed, robustness to initial conditions, etc.
Analysis of VQ Dynamics
• exact mathematical description in very high dimensions
• study of typical learning behavior
Dagstuhl Seminar, 25.03.2007
Model: two Gaussian clusters of high dimensional data
Random vectors ξ ∈ ℝN according to
),(σ)P(
σ)P(p )P(
σσ
1σσ
Bξ
ξξ
Ν
prior prob.: p+, p-
p+ + p- = 1
B+
B-
(p+)
(p-)
ℓ
separable in projectionto (B+ , B-) plane
(p+)
(p-)
not separable on other planes
cluster centers: B+, B- ∈ ℝN
variance: υ+, υ-
separation ℓ
only separable in 2 dimensions simple model, but not trivial
classes: σ = {+1,-1}
Dagstuhl Seminar, 25.03.2007
sequence of independent random data P...,1,2,3,μ},{ μμ ξ
learning rate,step size
strength,direction ofupdate etc.
move prototypetowards currentdata
1-μs
μμsss
1-μs
μs ... ,σ,c,rankf
N
ηwξww
K,...,2,1rank
1σ,
s
sc
update of prototype vector
prototypeclass
data class
“winner”
fs […] describes the algorithm used
Online learning
ws ∈ ℝN
Dagstuhl Seminar, 25.03.2007
μt
μs
μstσ
μs
μsσ QBR www projections tocluster centers
length and overlapof prototypes
1. Define few characteristic quantities of the system
μstQ μ
sσR
1-μs
μμsss
1-μs
μs ... ,σ,c,rankf
N
ηwξww
μμμ1-μs
μs ξBbh ξwrandom vector ξμ enters as projections
1-μsσ
μσs
1-μsσ
μsσ Rb...fη)R(R N
N
N
/1Ο...f...fη
Qh...fηQh...fη)QQ(
ts2
1-μst
μst
1-μst
μts
1-μst
μst
2. Derive recursion relations of the quantities for new input data
1,1 σ
},...,2,1{t s,
K
3. Calculate average recursions
Dagstuhl Seminar, 25.03.2007
In the thermodynamic limit N∞ ...
• characteristic quantities • self average w.r.t. random sequence of data (fluctuations vanish)
μstQ μ
sσR
• the projections• become correlated Gaussian quantities completely specified in terms of first and second moments:
sσσs R h st σtσsσt s Q hh- hh
sσσsσ s R bh- bh
bb- bbσσσ
else0
σ if b
σ
• define continuous learning time N
μ t μ : discrete (1,2,…,P)
t : continuous
Dagstuhl Seminar, 25.03.2007
4. Derive ordinary differential equations
1-μsσ
μσs
sσ Rb...fηdR
dt
...f...fη
Qh...fηQh...fηQ
ts2
1-μst
μst
1-μst
μts
st
dt
d
5. Solve for Rsσ(t), Qst(t)• dynamics/asymptotic behavior (t ∞)• quantization/generalization error • sensitivity to initial conditions, learning rates, structure of data
Dagstuhl Seminar, 25.03.2007
Nt /
Q11
Q22
Q12
ResultsVQ 2 prototypes
Nt /
R1+
R2-
R2+R1-
1-μs
μμs
μ1-μs
μs dd
N
ηwξww
jsj
ws winner
Numerical integration of the ODEs
(ws(0)≈0 p+=0.6, ℓ=1.0, υ+=1.5, υ-
=1.0, =0.01)
E(W)
t
characteristic quantities
quantization error
Dagstuhl Seminar, 25.03.2007
B+
B-
ℓ
RS+
RS-
2 prototypes
Projections of prototypes on the B+,B- plane at t=50
RS+
RS-
p+ > p-
Two prototypes move to the stronger cluster
3 prototypes
Dagstuhl Seminar, 25.03.2007
Neural Gas: a winner take most algorithm3 prototypes
1-μs
μs1-μs
μs )
)(
r(exp
1
N
ηwξww
tC
update strength decreases exponentially by rank
RS+
RS-
quantizationerror E(W)
t
λi=2; λf=10-2
λ(t) large initially,decreased over time
λ(t)0: identical to WTA
t=0 t=50
Dagstuhl Seminar, 25.03.2007
Sensitivity to initialization
at t=50
t=0
Neural GasWTA
at t=50RS+
RS-
RS+
RS-
Neural Gas:• more robust w.r.t. initialization
WTA:• (eventually) reaches minimum E(W)• depends on initialization: possible large learning timeE(W)
t
“plateau”∇HVQ≈0
Dagstuhl Seminar, 25.03.2007
Learning Vector Quantization (LVQ)
Objective:classification of data using prototype vectors
Find optimal set W for lowest generalization error
else
cccg
0
1 ),(;),(
jjj
misclassified by nearest prototype
Assign data {ξ,σ}; ξ ∈ ℝN to nearest prototype vector(distance measure, e.g. Euclidean)
NRss
s w};c,{w
Dagstuhl Seminar, 25.03.2007
1-μs
μμs
μ1-μs
μs dd
N
ηwξww
jsj
sc
ws winner ±1
LVQ1
c={+1, -1}
RS+
RS-
two prototypes
c={+1,+1,-1}
RS+
RS-
three prototypes
c={+1,-1,-1}
RS+
RS-
which class to add the 3rd prototype?
update winner towards/ away from data
no cost function related to generalization error
Dagstuhl Seminar, 25.03.2007
Generalization error
p+=0.6, p-= 0.4υ+=1.5, υ-=1.0
εg
t
class
1
K K
g ),(ddp
s
ssjsj
ε
misclassified data
Dagstuhl Seminar, 25.03.2007
Optimal decision boundary
B+
B-
(p+>p- )
(p-)
ℓ
d
equal variance (υ+=υ-):linear decision boundary
unequal variance υ+>υ-
K=2
optimal with K=3
more prototypes better approximation to optimal decision boundary
(hyper)plane where
1)σP( p 1)σP( p ξξ
Dagstuhl Seminar, 25.03.2007
Asymptotic εg
p+
• Optimal: K=3 better• LVQ1: K=3 better
• best: more prototypes on the class with the larger variance
• more prototypes not always better for LVQ1
υ+ >υ- (υ+=0.81, υ- =0.25)
εg(t∞)
c={+1,+1,-1}
p+
εg
c={+1,-1,-1}
• Optimal: K=3 equal to K=2• LVQ1: K=3 worse
εg(t∞)
Dagstuhl Seminar, 25.03.2007
Summary
dynamics of (Learning) Vector Quantization for high dimensional data
Neural Gas: more robust w.r.t. initialization than WTA LVQ1: more prototypes not always better
Outlook
study different algorithms e.g. LVQ+/-, LFM, RSLVQ more complex models multi-prototype, multi-class problems
ReferenceDynamics and Generalization Ability of LVQ Algorithms M . Biehl, A . Ghosh, and B . Hammer Journal of Machine Learning Research (8 ): 323-360 (2007 )http://jmlr.csail.mit.edu/papers/v8/biehl07a.html
Questions?
Dagstuhl Seminar, 25.03.2007
Dagstuhl Seminar, 25.03.2007
Central Limit Theorem
•Let x1, x2,…, xN be independent random numbers from arbitrary probability distribution with mean and finite variance
•The distribution of the average of xj approaches a normal distribution as N becomes large.
N=1
N=2 N=5 N=50
Example: non-normal distribution
Distribution of average of xj:
p(xj)
N
1jjx
N
1p
Dagstuhl Seminar, 25.03.2007
Self Averaging
Monte Carlo simulations over 100 independent runs
Fluctuations decreases with larger degree of freedom N
At N∞, fluctuations vanish (variance becomes zero)
Dagstuhl Seminar, 25.03.2007
“LVQ +/-” t}s,{j;N
η 1-μj
μ1-μj
μj wξww jc
ds = min {dk} with cs = σμ
update correct and incorrect winners
dt = min {dk} with ct ≠σμ
t
t
strongly divergent!
p+ >> p- : strong repulsion by stronger class
to overcome divergence: e.g. early stopping (difficult in practice)
stop at εg(t)=εg,min
εg(t)
Dagstuhl Seminar, 25.03.2007
Comparison LVQ1 and LVQ +/-
υ+ = υ- =1.0
LVQ1 outperforms LVQ+/- with early stopping
c={+1,+1,-1}
p+
υ+ = 0.81, υ- =0.25
LVQ+/- with early stopping outperforms LVQ1 in a certain p+ interval
p+
LVQ+/- performance depends on initial conditions