Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
1
Eric Xing 1
Machine LearningMachine Learning
1010--701/15701/15--781, Spring 2008781, Spring 2008
Factor Analysis and Metric Factor Analysis and Metric LearningLearning
Eric XingEric Xing
Lecture 26, April 23, 2008
Reading: Chap. 12.3-4, C.B book
Eric Xing 2
OutlineProbabilistic PCA (breif)
Factor Analysis (somewhat detail)
ICA (will skip)
Distance metric learning from very little side info (a very coolmethod)
2
Eric Xing 3
Popular dimensionality reduction techniqueProject data onto directions of greatest variation
Consequence:xi are uncorrelated such that the covariance matrix is
Truncation error
Recap of PCA
( )
( )uyCovu
uyym
u
uym
u
T
m
i
Tii
T
m
i
Ti
)(maxarg
maxarg
maxarg
=
⎟⎠
⎞⎜⎝
⎛=
=
∑
∑
=
=
1
1
2
1
1
rr
r
qi
Tq
iTq
iT
iT
i yU
yu
yuyu
x R∈=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=r
rM
r
r
r 2
1
y1
y 2
( ) ( ) x
q
k
Tkkk
K
k
Tkkky uuuu Σ=≈=Σ ∑∑
== 11γγ
∑=
m
i
Tii xx
m 1
1 rr
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
qγ
γO
1
ii xUy rr=
Eric Xing 4
Popular dimensionality reduction techniqueProject data onto directions of greatest variation
Useful tool for visualising patterns and clusters within the data set, but …
Need centering
Does not explicitly model data noise
Recap of PCA
3
Eric Xing 5
Probabilistic Interpretation?
AY
Xcontinuous
continuous
regression
AY
Xcontinuous
continuous
?
Eric Xing 6
Probabilistic PCAPCA can be cast as a probabilistic model
with q-dimensional latent variables
The resulting data distribution is
Maximum likelihood solution is equivalent to PCA
Diagonal Γq contains the top q sample covariance eigen-values and Uqcontains associated eigenvectors
Tipping and Bishop, J. Royal Stat. Soc. 6, 611 (1999).
nnn xy εµ ++Λ= ),(~ In20 σε N
),(~ Ixn 0N
),(~ Iy Tn
2σµ +ΛΛN
∑=n
nML y
N1µ 212 /)( IU qq
ML σ−Γ=Λ
4
Eric Xing 7
Factor analysisAn unsupervised linear regression model
Geometric interpretation
To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.
AY
X
where Λ is called a factor loading matrix, and Ψ is diagonal.
),;()(),;()(
ΨΛ+=
=
xyxyxx
µN
N
pp I0
Eric Xing 8
Relationship between PCA and FA
Probabilistic PCA is equivalent to factor analysis with equalnoise for every dimension, i.e., εn~ isotropic Gaussian
In factor analysis for a diagonal covariance matrix
An iterative algorithm (eg. EM) is required to find parameters if precisions are not known in advance
),(~ Ψ0NnεΨ
),( I20 σN
5
Eric Xing 9
Factor analysisAn unsupervised linear regression model
Geometric interpretation
To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.
AY
X
where Λ is called a factor loading matrix, and Ψ is diagonal.
),;()(),;()(
ΨΛ+=
=
xyxyxx
µN
N
pp I0
Eric Xing 10
Marginal data distribution A marginal Gaussian (e.g., p(x)) times a conditional Gaussian (e.g., p(y|x)) is a joint GaussianAny marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is also a Gaussian
Since the marginal is Gaussian, we can determine it by just computing its mean and variance. (Assume noise uncorrelated with data.)
[ ] [ ][ ] [ ]
[ ] ( )( )[ ]( )( )[ ]( )( )[ ][ ] [ ]
Ψ+ΛΛ=
+ΛΛ=
+Λ+Λ=
−+Λ+−+Λ+=
−−=
=++=+Λ+=
Ψ+Λ+=
T
TTT
T
T
T
EEE
E
EVar
EEEE
WWXXWXWX
WXWX
YYY
WXWWXY
µµµµ
µµ
µµµµ
00
0 ),(~ here w N
AY
X
6
Eric Xing 11
FA = Constrained-Covariance Gaussian
Marginal density for factor analysis (y is p-dim, x is k-dim):
So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matrix:
In other words, factor analysis is just a constrained Gaussian model. (If were not diagonal then we could model any Gaussian and it would be pointless.)
),;()|( Ψ+ΛΛ= Tµθ yy Np
Eric Xing 12
Review:A primer to multivariate Gaussian
Multivariate Gaussian density:
A joint Gaussian:
How to write down p(x1), p(x1|x2) or p(x2|x1) using the block elements in µ and Σ?
Formulas to remember:
{ })-()-(-exp)(
),|( //µµ
πµ xxx 1
21
21221 −ΣΣ
=Σ Tn
p
) , (),|( ⎥⎦
⎤⎢⎣
⎡ΣΣΣΣ
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=Σ⎥
⎦
⎤⎢⎣
⎡
2221
1211
2
1
2
1
2
1
µµ
µxx
xx
Np
211
22121121
221
2212121
2121121
ΣΣΣ−Σ=
−ΣΣ+=
=
−
−
|
|
||
)(
),|()(
V
xm
Vmxxx
µµ
Np
222
22
2222
Σ=
=
=
m
m
mmp
V
m
Vmxx
µ
),|()( N
AX2
X1
7
Eric Xing 13
Review:Some matrix algebra
Trace and derivativesCyclical permutations
Derivatives
Determinants and derivatives
[ ] ∑=i
iiadef
tr A
[ ] TBBAA
=∂∂ tr
[ ] [ ] TTT xxAxxA
AxxA
=∂∂
=∂∂ trtr
[ ] [ ] [ ]BCACABABC trtrtr ==
-TAAA
=∂∂ log
Eric Xing 14
FA joint distributionModel
Covariance between x and y
Hence the joint distribution of x and y:
Assume noise is uncorrelated with data or latent variables.
),;()(),;()(
ΨΛ+=
=
xyxyxx
µN
N
pp I0
[ ] ( )( )[ ] ( )[ ][ ]T
TTT
TT
EEECov
Λ=
+Λ=
−+Λ+=−−=
XWXXWXXYXYX, µµµ0
) , ()( ⎥⎦
⎤⎢⎣
⎡
Ψ+ΛΛΛΛ
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡T
TIµ0
yx
yx
Np
8
Eric Xing 15
Inference in Factor AnalysisApply the Gaussian conditioning formulas to the joint distribution we derived above, where
we can now derive the posterior of the latent variable x given observation y, , where
Applying the matrix inversion lemma
Here we only need to invert a matrix of size |x|×|x|, instead of |y|×|y|.
( ) )(
)(|
µ
µµ
−Ψ+ΛΛΛ=
−ΣΣ+=−
−
y
ym1
21
2212121
TT
( )Ψ+ΛΛ=Σ
Λ=Σ=Σ
=Σ
T
TT
I
22
1212
11
),|()( || 2121 Vmxyx N=p
⇒⇒( ) ( ) 1-1-1- GEGE-HFEEFH-E
1111 −−−−+= FG
( ) ΛΨ+ΛΛΛ−=
ΣΣΣ−Σ=−
−
1
211
22121121
TTI
|V
( ) 1121
−− ΛΨΛ+= TI|V )(|| µ−ΨΛ= − yVm 12121
T
Eric Xing 16
Geometric interpretation: inference is linear projection
The posterior is:
Posterior covariance does not depend on observed data y!Computing the posterior mean is just a linear operation:
),;()( || 2121 Vmxyx N=p
( ) 1121
−− ΛΨΛ+= TI|V )(|| µ−ΨΛ= − yVm 12121
T
9
Eric Xing 17
EM for Factor AnalysisIncomplete data log likelihood function (marginal density of y)
Estimating m is trivial: Parameters Λ and Ψ are coupled nonlinearly in log-likelihood
Complete log likelihood
( )
( )[ ] Tn
nn
nT
nn
µyµy N
yyND
)()( where , trlog
)()(log),(
−−=Ψ+ΛΛ−Ψ+ΛΛ−=
−Ψ+ΛΛ−−Ψ+ΛΛ−=
∑
∑−
−
SS1
1
21
2
21
2TT
TT µµθl
∑= n nNML y1µ̂
[ ] [ ] Tnn
nnn
n
Tnn
nnT
nnn
nn
Tn
nnnn
nnn
xyxyN
NxxN
xyxyNxxN
xypxpyxpD
)()( where , trtrlog
)()(loglog
)|(log)(log),(log),(
Λ−Λ−=Ψ−−Ψ−=
Λ−ΨΛ−−Ψ−−−=
+==
∑∑
∑∑
∑∑
−
−
122
12
21
221
21
1
SS
I
θcl
Eric Xing 18
E-step for Factor Analysis Compute
Recall that we have derived:
[ ] [ ] trtrlog),( 1
221
2−Ψ−−Ψ−= ∑ SNXXND
n
Tnnθcl
)|(),( yxpDθcl
)( T
n
Tnn
Tn
Tn
TTnn
Tnn XXyXXyyy
N ΛΛ+Λ−Λ−= ∑1S
[ ] [ ] [ ]TnnnnnnTnn yXyXyXXX ||| EEVar +=
[ ]nnn yXX |E=
( ) 1121
−− ΛΨΛ+= TI|V )(|| µ−ΨΛ= − yVm 12121
T
)(|| µ−ΨΛ== −nyxn yX
nn
121
TVm⇒⇒ Tyxyx
Tnn nnnn
XX ||| and mmV += 21
10
Eric Xing 19
M-step for Factor Analysis Take the derivates of the expected complete log likelihood wrt. parameters.
Using the trace and determinant derivative rules:
[ ] [ ]
S
S
22
221
21
11
NN
NXXNn
Tnn
−Ψ=
⎟⎠
⎞⎜⎝
⎛Ψ−−Ψ−
Ψ∂∂
=Ψ∂∂ −
−− ∑ trtrlogcl
[ ] [ ]
∑∑
∑
∑
ΛΨ−Ψ=
⎟⎠
⎞⎜⎝
⎛ΛΛ+Λ−Λ−
Λ∂∂
Ψ−=
Λ∂∂
Ψ−=⎟⎠
⎞⎜⎝
⎛Ψ−−Ψ−
Λ∂∂
=Λ∂∂
−−
−
−−
n
Tnn
n
Tnn
T
n
Tnn
Tn
Tn
TTnn
Tnn
n
Tnn
XXXy
XXyXXyyyN
N
NNXXN
11
1
11
12
2221
2
)(
trtrlog SScl
⇒⇒ S=Ψ +1t
⇒⇒1
1−
+ ⎟⎠
⎞⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛=Λ ∑∑
n
Tnn
n
Tnn
t XXXy
Eric Xing 20
Comparison of PCA and FAFA
)(|| µ−ΨΛ== −nyxn yX
nn
121
TVm
Tyxyx
Tnn nnnn
XX ||| and mmV += 21
11
−+ ⎟
⎠
⎞⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛=Λ ∑∑
n
Tnn
n
Tnn
t XXXy
),(~ Ψ0Nnε
nnn xy εµ ++Λ=
( )uyCovuu T )(maxarg=
qi
Tq
iTk
iT
iT
i yU
yu
yuyu
x R∈=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=r
rM
r
r
r 2
1
nn Uxy =
PCA
y1
y 2
11
Eric Xing 21
Comparison of PCA and FAFA
Invert a q×q matrix
Covariant under rescaling: diag(α)y
Neither of the factors found by a two-factor model is necessarily the same as that found by a single factor model, and …
)(|| µ−ΨΛ== −nyxn yX
nn
121
TVm
Tyxyx
Tnn nnnn
XX ||| and mmV += 21
11
−+ ⎟
⎠
⎞⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛=Λ ∑∑
n
Tnn
n
Tnn
t XXXy
( )uyCovuu T )(maxarg=
qi
Tq
iTk
iT
iT
i yU
yu
yuyu
x R∈=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=r
rM
r
r
r 2
1
PCA
SVD on a K×K matrix
Covariant under rotation: Ay
Principle axis can be found incrementally
S=Ψ +1t
Eric Xing 22
Original data matrix Correlation matrix Factor matrix
Obs
Observationaldata
Variablesv1 v2 . . . v10
o1o2.
.
.
on
Factor loadings
FactorsF1 F2 F3
Spd Str End
100mLongHigh110m400mDiscusShotJavelinPole1500m
.87
.77
.51
.63
.74
.22
.31
.02
.24
.02
.07
.34
.27
.32
.06
.79
.82
.70
.40
.12
.14
.13
.34
.05
.38
.06
.10
.15
.50
.89
Correlationcoefficients
Variablesv1 v2 . . . v10
v1v2
.
.
.v10
Decisions •Factoring method• # of factors to retain• Factor rotation
Example:
12
Eric Xing 23
100 Meters
400 Meters
Javelin
High Jump
100m Hurdles
Shot Put
Long Jump
Discus
Pole Vault
1500 Meters
“SPEED”
“STRENGTH”
“ENDURANCE”
Decathlon example
Eric Xing 24
Model Invariance and Identifiability
There is degeneracy in the FA model.Since Λ only appears as outer product ΛΛΤ, the model is invariant to rotation and axis flips of the latent space.We can replace Λ with ΛQ for any orthonormal matrix Q and the model remains the same: (ΛQ)(ΛQ)Τ=Λ(QQΤ)ΛΤ=ΛΛΤ.This means that there is no “one best” setting of the parameters. An infinite number of parameters all give the ML score!Such models are called un-identifiable since two people both fitting ML parameters to the identical data will not be guaranteed to identify the same parameters.
13
Eric Xing 25
Why FALatent trajectories
A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
...
HMM (for discrete sequential data, e.g., text)
HMM(for continuous sequential data, e.g., speech signal)
State space model
AX
Y
AX
Y
AX
Ydiscrete
discrete
discrete
continuous
continuous
continuous
Mixture modele.g., mixture of multinomials
Mixture modele.g., mixture of Gaussians
Factor analysis
Eric Xing 26
Independent Components Analysis (ICA)
ICA is similar to FA, except it assumes the latent source has non-Gaussian density.Hence ICA can extract higher order moments (not just second order).It is commonly used to solve blind source separation (cocktail party problem).
AY
X
AY2
X1 X2
Y1 YK……
⇒⇒
FAFA ICAICA
14
Eric Xing 27
SourcesObservations
s1
s2
x1
x2
Mixing matrix A
x = As
n sources, m=n observations
The simple “Cocktail Party”Problem
We skip more details and next introduce a more interesting new algorithm!
Eric Xing 28
ICA versus PCA (and FA)
SimilarityFeature extractionDimension reduction
DifferencePCA uses up to second order moment of the data to produce uncorrelated componentsICA strives to generate components as independent as possible
15
Eric Xing 29
Semi-supervised Metric Learning
⇒⇒
Xing et al, NIPS 2003
Eric Xing 30
What is a good metric?What is a good metric over the input space for learning and data-mining
How to convey metrics sensible to a human user (e.g., dividing traffic along highway lanes rather than between overpasses, categorizing documents according to writing style rather than topic) to a computer data-miner using a systematic mechanism?
16
Eric Xing 31
Issues in learning a metricData distribution is self-informing (E.g., lies in a sub-manifold)
Learning metric by finding an embedding of data in some space. Con: does not reflect (changing) human subjectiveness.
Explicitly labeled dataset offers clue for critical featuresSupervised learning
Con: needs sizable homogeneous training sets.
What about side information? (E.g., x and y look (or read) similar ...)
Providing small amount of qualitative and less structured side information is often much easier than stating explicitly a metric (what should be the metric for writing style?) or labeling a large set of training data.
Can we learn a distance metric more informative than Euclidean distance using a small amount of side information?
Eric Xing 32
Distance Metric Learning
17
Eric Xing 33
Optimal Distance MetricLearning an optimal distance metric with respect to the side-information leads to the following optimization problem:
This optimization problem is convex. Local-minima-free algorithms exist.Xing et al 2003 provided an efficient gradient descent + iterative constraint-projection method
Eric Xing 34
Examples of learned distance metrics
Distance metrics learned on three-cluster artificial data:
18
Eric Xing 35
Application to ClusteringArtificial Data I: a difficult two-class dataset
Eric Xing 36
Application to ClusteringArtificial Data II: two-class data with strong irrelevant feature
19
Eric Xing 37
Application to Clustering9 datasets from the UC Irvine repository
Eric Xing 38
Accuracy vs. amount of side-information
Two typical examples of how the quality of the clusters found increases with the amount of side-information.
20
Eric Xing 39
Take home messageDistance metric learning is an important problem in machine learning and data mining.A good distance metric can be learned from small amount of side-information in the form of similarity and dissimilarity constraints from data by solving a convex optimization problem.The learned distance metric can identify the most significant direction(s) in feature space that separates data well, effectively doing implicit Feature Selection.The learned distance metric can be used to improve clustering performance.
Eric Xing 40
Additional Details: Independent Components Analysis (ICA)
ICA is similar to FA, except it assumes the latent source has non-Gaussian density.Hence ICA can extract higher order moments (not just second order).It is commonly used to solve blind source separation (cocktail party problem).
AY
X
AY2
X1 X2
Y1 YK……
⇒⇒
FAFA ICAICA
21
Eric Xing 41
SourcesObservations
s1
s2
x1
x2
Mixing matrix A
x = As
n sources, m=n observations
The simple “Cocktail Party”Problem
Eric Xing 42
MotivationMotivation
Two Independent Sources
Mixture at two Mics
aij ... Depend on the distances of the microphones from the speakers
)()()()()()(tsatsatx
tsatsatx
2221212
2121111
+=+=
22
Eric Xing 43
MotivationMotivation
Get the Independent Signals out of the Mixture
Eric Xing 44
Blind Source SeparationSuppose that there are k unknown independent sources
A data vector x(t) is observed at each time point t, such that
where A is a n ´ k full rank scalar matrix
[ ] [ ] 11 == )(th wi)(,),()( tEtstst Tk ss K
)()( tt Asx =
23
Eric Xing 45
Blind source separation
Mixingprocess
A
Independentcomponents
…
BlindSource
De-mixingprocess
W
…
Observedsequences
Recovered independent components
Eric Xing 46
ICA versus PCA (and FA)
SimilarityFeature extractionDimension reduction
DifferencePCA uses up to second order moment of the data to produce uncorrelated componentsICA strives to generate components as independent as possible
24
Eric Xing 47
Problem formulationThe goal of ICA is to find a linear mapping W such that the unmixed sequences u
are maximally statistically independent
Find some
where C is a diagonal matrix and P is a permutation matrix.
Mixingprocess
A
Independentcomponents
…
BlindSource
De-mixingprocess
W
…
Observedsequences
Recovered independent components
Mixingprocess
A
Mixingprocess
A
Independentcomponents
…
Independentcomponents
…
BlindSourceBlindSource
De-mixingprocess
W
…
Observedsequences
…
Observedsequences
Recovered independent components
Recovered independent components)()()( ttt WAsWxu ==
PCWAV ==
Eric Xing 48
The fundamental restriction in ICA is that the independent components must be nongaussian for ICA to be possible.
This is because gaussianity is invariant under orthogonal transformation and hence make the matrix A not identifiable for gaussian independent components.
Principle of ICA: Nongaussianity
25
Eric Xing 49
Measures of nongaussianity (1)Kurtosis
Kurtosis can be very sensitive to outliers, when its value has to be estimate from a measured sample.
Mutual information
Negative Entropy
Eric Xing 50
FastICA - PreprocessingCentering:
Make the x-s mean 0 variables
WhiteningTransform the observed vector x linearly so that it has unit variance:
One can show that:
where
26
Eric Xing 51
FastICA algorithmInitialize the weight matrix WIteration:
where
Repeat until convergence
The ICAs are the components of
Eric Xing 52
SummaryThere has been a wide discussion about the application of Independence Component Analysis (ICA) in Signal Processing, Neural Computation and Finance.
First introduced as a novel tool to separate blind sources in a mixed signal.
The Basic idea of ICA is to reconstruct from observation sequences the hypothesized independent original sequences.