Principal component analysis and matrix factorizations for learning (part 1) ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1

Principal Component Analysis andMatrix Factorizations for Learning

Chris DingLawrence Berkeley National Laboratory

Supported by Office of Science, U.S. Dept. of Energy


Many unsupervised learning methodsare closely related in a simple way

Spectral Clustering

NMF

K-means clustering

PCA

Indicator Matrix Quadratic Clustering

Semi-supervised classification

Semi-supervised clustering

Outlier detection


Part 1.A.Principal Component Analysis (PCA)

andSingular Value Decomposition (SVD)

• Widely used in large number of different fields• Most widely known as PCA (multivariate

statistics)• SVD is the theoretical basis for PCA


Brief history

• PCA– Draw a plane closest to data points (Pearson, 1901)– Retain most variance (Hotelling, 1933)

• SVD– Low-rank approximation (Eckart-Young, 1936) – Practical application/Efficient Computation (Golub-

Kahan, 1965)• Many generalizations


PCA and SVD

),,,( 21 nxxxX L=Data: n points in p-dim:

Covariance

Principal directions:(Principal axis,subspace)

ku Principal components:(projection on the subspace)

kv

∑=

==p

k

Tkkk

T uuXXC1λ

∑=

=r

k

Tkkk

T vvXX1

λGram (kernel) matrix

Underlying basis: SVD Tp

k

Tkkk VUvuX Σ==∑

=1

σ


Further Developments

SVD/PCA• Principal Curves • Independent Component Analysis• Sparse SVD/PCA (many approaches)• Mixture of Probabilistic PCA• Generalization to exponential familty, max-margin• Connection to K-means clustering

Kernel (inner-product)

• Kernel PCA


Methods of PCA Utilization

dkkk XduXuu ⋅++⋅= )()1( 1 L

Principal components (uncorrelated random variables):

Projection to low-dim subspace

Sphereing the dataTransform data to N(0,1)

Dimension reduction: Tp

k

Tkkk VUvuX Σ==∑

=1

σ

),,,( 21 nxxxX L=

XUX T=~ ),,( 1 kuuU L=

XUUXCX T12/1~ −Σ== −


Applications of PCA/SVD

• Most popular in multivariate statistics• Image processing, signal processing• Physics: principal axis, diagonalization of

2nd tensor (mass)• Climate: Empirical Orthogonal Functions

(EOF)• Kalman filter. • Reduced order analysis

Ttttt APAPEsAs )()1()()1( , =+= ++


Applications of PCA/SVD

• PCA/SVD is as widely as Fast Fourier Transforms– Both are spectral expansions– FFT is more on Partial Differential Equations– PCA/SVD is more on discrete (data) analysis– PCA/SVD surpass FFT as computational sciences

further advance

• PCA/SVD– Select combination of variables– Dimension reduction

• An image has 104 pixels. True dimension is 20 !


PCA is a Matrix Factorization(spectral/eigen decomposition)

Covariance Tp

k

Tkkk

T UUuuXXC Λ=== ∑=1

λ

Tr

k

Tkkk

T VVvvXX Λ==∑=1

λKernel matrix

Underlying basis: SVD Tp

k

Tkkk VUvuX Σ==∑

=1

σ

Principal directions: ),,,( 21 kuuuU L=

Principal components: ),,,( 21 kvvvV L=


From PCA to spectral clusteringusing generalized eigenvectors

∑=j iji wd

In Kernel PCA we compute eigenvector: vWv λ=

Consider the kernel matrix:

Generalized Eigenvector:

)(),( jiij xxW φφ=

DqWq λ=

),,( 1 ndddiagD L=

This leads to Spectral Clustering !


Scale PCA ⇒ Spectral Clustering

PCA:

2/1)/(~,~21

21

jiijij ddwwWDDW == −−

scaled principal component

Scaled PCA: DqqDDWDWk

Tkkk ∑

=

==1

21

21 ~ λ

kk vDq 21−=

∑=k

Tkkk vvW λ


Scaled PCA on a Rectangle Matrix⇒ Correspondence Analysis

Re-scaling: 2/1.. )(~ ,~ /2

121

jiijijcr ppppPDDP == −−

are scaled row and column principal component (standard coordinates in CA)

Apply SVD on P~

ck

Tkkkr

T DgfDprcP ..1

/ ∑=

=− λ

Subtract trivial component

Tnppr ),,( ..1 L=

Tnppc ),,( .1. L=

kckkrk vDguDf 21

21

, −− ==

(Zha, et al, CIKM 2001, Ding et al, PKDD2002)


Nonnegative Matrix Factorization

),,,( 21 nxxxX L=Data Matrix: n points in p-dim:

TFGX ≈Decomposition (low-rank approximation)

Nonnegative Matrices0,0,0 ≥≥≥ ijijij GFX

),,,( 21 kgggG L=),,,( 21 kfffF L=

is an image, document, webpage, etc

ix


Solving NMF with multiplicative updating

Fix F, solve for G; Fix G, solve for F

Lee & Seung ( 2000) propose

0,0,|||| 2 ≥≥−= GFFGXJ T

jkT

jkT

jkjk FGFFX

GG)(

)(←

ikT

ikikik GFG

XGFF)(

)(←


Matrix Factorization Summary

PCA:

Scaled PCA:

DQQDDWDW TΛ== 21

21 ~

TVVW Λ=

Symmetric(kernel matrix, graph)

Rectangle Matrix (contigency table, bipartite graph)

TVUX Σ=

cT

rcr DGFDDXDX Λ== 21

21 ~

TFGX ≈NMF: TQQW ≈



Unsigned Cluster indicator Matrix H=(h1, …, hK)

0,..),Tr( max ≥= HIHHtsWHH TTH

;XXW T=

Kernel K-means clustering:

Spectral clustering (normalized cut)

K-means: ))(),(( ><= ji xxW φφKernel K-means

0,..),Tr( max ≥= HIDHHtsWHH TTH

Difference between the two is the orthogonality of H



Additional features:

)Tr( max HCWHH TTH

+

.,)(

2/)( CHWHHH

CWHHH TT

ikikik

ikik +=+← αα

Semi-suerpvised classification:

Semi-supervised clustering: (A) must-link and (B) cannot-link constraints

allowing zero rows in HOutlier Detection:

)Tr( max BHHAHHWHH TTTH

βα −+

)Tr( max WHHTH

Nonnegative Lagrangian Relaxation:


Tutorial Outline• PCA

– Recent developments on PCA/SVD– Equivalence to K-means clustering

• Scaled PCA– Laplacian matrix– Spectral clustering– Spectral ordering

• Nonnegative Matrix Factorization– Equivalence to K-means clustering– Holistic vs. Parts-based

• Indicator Matrix Quadratic Clustering– Use Nonnegative Lagrangian Relaxtion – Includes

• K-means and Spectral Clustering• semi-supervised classification• Semi-supervised clustering• Outlier detection

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20

Part 1.B.Recent Developments on PCA and SVD

Principal Curves Independent Component AnalysisKernel PCAMixture of PCA (probabilistic PCA)Sparse PCA/SVD

Semi-discrete, truncation, L1 constraint, Direct sparsification

Column Partitioned Matrix Factorizations2D-PCA/SVDEquivalence to K-means clustering


PCA and SVD

),,,( 21 nxxxX L=Data Matrix:

Covariance

Principal directions:(Principal axis,subspace)

ku Principal components:(projection on the subspace)

kv

∑=

==p

k

Tkkk

T uuXXC1λ

∑=

=r

k

Tkkk

T vvXX1

λGram (kernel) matrix

Underlying basis: SVD ∑=

=p

k

Tkkk vuX

1σ


Kernel PCA

Kernel

Feature extraction

Indefinite Kernels

Generalization to graphs with nonnegative weights

)(),( jiij xxK φφ=

(Scholkopf, Smola, Muller, 1996)

)(),()(, xxvxv iiiφφφ ∑=

PCA Component v)( ii xx φ→


Mixture of PCA• Data has local structures.

– Global PCA on all data is not useful• Clustering PCA (Hinton et al):

– Using clustering to cluster data into clusters– Perform PCA in each cluster– No explicit generative model

• Probabilistic PCA (Tipping & Bishop)– Latent variables– Generative model (Gaussian)– Mixture of Gaussians ⇒ mixture of PCA– Adding Markov dynamics for latent variables (Linear

Gaussian Models)


Probabilistic PCALinear Gaussian Model

),0(~, 2INWsx ii εσεεμ ++=

Latent variables ),,( 1 nssS L=

),(~)( 20 IsNsP sσGaussian prior

),(~ 20

TsWWIWsNx σσε +

(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)

Linear Gaussian Model

,,1 εη +=+=+ iiii WsxAss


Sparse PCA• Compute a factorization

– U or V is sparse or both are sparse• Why sparse?

– Variable selection (sparse U)– When n >> d– Storage saving– Other new reasons?

• L1 and L2 constraints

TUVX ≈


Sparse PCA: Truncation and Discretization

• Sparsified SVD– Compute {uk,vk} one at a time, truncate those entries

below a threshold. – Recursively compute all pairs using deflation.– (Zhang, Zha, Simon, 2002)

• Semi-discrete decomposition – U, V only contains {-1, 0, 1} – Iterative algorithm to compute U,V using deflation– (Kolda & O’leary, 1999)

TVUX Σ≈

TuvXX σ−←

)( 1 kuuU L= )( 1 kvvV L=


Sparse PCA: L1 constraint

• LASSO (Tibshirani, 1996)

• SCoTLASS (Joliffe & Uddin, 2003)

• Least Angle Regression (Efron, et al 2004)

• Sparse PCA (Zou, Hastie, Tibshirani,2004)

tXy T ≤− 12 ||||,||||min ββ

0,||||,)(max 1 =≤ hTTTT uutuuXXu

Ixx Tk

jjj

k

jji

Tn

ii =++− ∑∑∑

===

ααβλβλβαβα

,||||||||||||min1

1,11

22

1,

||||/ jjjv ββ=


Sparse PCA: Direct Sparsification

• Sparse SVD with explicit sparsification

– rank-one approximation – Minimize a bound– deflation

• Direct sparse PCA, on covariance matrix S

)nnz()nnz(||||min,

vuudvX FT

vu++−

)Tr(max)Tr(maxmax SUSuuSuuu TT ===1)rank(,0,)nnz(,1)Tr(.. 2 =≤= UUkUUts f

(Zhang, Zha, Simon 2003)

(D’Aspremont, Gharoui, Jordan,Lancriet, 2004)


Sparse PCA Summary• Many different approaches

– Truncation, discretization– L1 Constraint– Direct sparsification– Other approaches

• Sparse Matrix factorization in general– L1 constraint

• Many questions– Orthogonality– Unique solution, global solution


PCA: Further Generalizations

• Generalization to Exponential Family– (Collins, Dasgupta, Schapire, 2001)

• Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)

– Collaborative filtering– Input Y is binary– Hard margin– Soft margin

∑∈

Σ −+Sia

iaia XYcX )1,0max(||||min

)||||||(||||||, 2221

FroFroT VUXUVX +==

SiaXY iaia ∈∀≥ ,1


Column Partitioned Matrix Factorizations

• Column Partitioned Data Matrix• Partitions are generate by clustering• Centroid matrix

– uk is centroid– Fix U, compute V

• Represent each partition by a SVD. – Pick leading Us to form U– Fix U, compute V

• Several other variations

1)( −= UUUXV TT2||||min FTUVX −

)( 1 kuuU L=

),,,(),( 1111 1

2

21

1

1

4484476LL

48476L

48476LL

k

k

n

nn

n

nn

n

nn xxxxxxxxX ++ −==

nnn k =++L1

),,(),( )()(

1

)1(

1

)1(111

48476

LL

48476

LL

l

l

l

ll

k

k

k

k uuuuUUU ==

(Zhang & Zha, 2001)

(Castelli, Thomasian & Li 2003)

(Park, Jeon & Rosen, 2003)

(Dhillon & Modha, 2001)

(Zeimpekis & Gallopoulos, 2004)


Two-dimensional SVD

• Large number of data objects are 2-D: images, maps• Standard method:

– convert (re-order) each image as a 1D vector– collect all 1D vectors into a single (big) matrix– apply SVD on the big matrix

• 2D-SVD is developed for 2D objects– Extension of standard SVD– Keeping the 2D characteristics– Improves quality of low-dimensional approximation– Reduces computation, storage


0 0050 710

080 20 0

.

.

.

.

.

.

.

M

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

Pixel vector

Linearize a 2D object into 1D object


SVD and 2D-SVDSVD

VXU T=ΣTVUX Σ=

),,,( 21 nxxxX L=Eigenvectors of TXX XX Tand

},,,{}{ 21 nAAAA L=Eigenvectors of

2D-SVD

Tii

i

AAAAF ))(( −−=∑)()( AAAAG i

Ti

i

−−=∑T

ii VUMA = VAUM iT

i =

row-row covariance

column-column cov


2D-SVD},,,{}{ 21 nAAAA L= assume 0=A

∑∑ == Tkkk

Tii

iuuAAF λ

∑∑=

==1k

Tkki

Ti

ikuuAAG ζ

VAUM iT

i =

row-row cov:

col-col cov:

),,,( 21 kuuuU L=),,,( 21 kvvvV L=

niVUMA Tii ,,1, L==

Bilinear

subspace

kki

kckrcri MVUA ×××× ℜ∈ℜ∈ℜ∈ℜ∈ ,,,


2D-SVD Error Analysis

∑∑+==

=−=r

kjj

Tii

n

i

RMAJ1

2

12 ||||min λ

∑∑∑+=+==

+≅−=c

kjj

r

kjj

Tii

n

i

RLMAJ11

2

13 ||||min ζλ

∑∑+==

=−=c

kjjii

n

i

LMAJ1

2

11 ||||min ζ

kki

kckrcri

Tii RMRRRLRARLMA ×××× ∈∈∈∈≈ ,,,,

∑∑+==

≅−=r

kjj

Tii

n

i

LLMAJ1

2

14 2||||min λ

∑+=

=Σ−p

kii

TVUX1

22||||min σSVD:


Temperature maps (January over 100 years)

Reconstruction Errors

SVD/2DSVD=1.1

Storages

SVD/2DSVD=8


Reconstructed image

SVD (K=15), storage 160560

2DSVD (K=15), storage 93060

SVD

2dSVD


2D-SVD Summary

• 2DSVD is extension of standard SVD• Provides optimal solution for 4 representations for

2D images/maps• Substantial improvements in storage, computation,

quality of reconstruction• Capture 2D characteristics

40

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Part 1.C.K-means Clustering ⇔

Principal Component Analysis

(Equivalence between PCA and K-means)

41


K-means clustering

• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hatigan,

etc)• Computationally Efficient (order-mN)• Widely used in practice

– Benchmark to evaluate other algorithms

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1

||||min

TnxxxX ),,,( 21 L=Given n points in m-dim:

K-means objective

42


PCA is equivalent to K-means

Continuous optimal solution for cluster indicators in K-means clustering are given by principal components.

Subspace spanned by K cluster centroidsis given by PCA subspace.

43


2-way K -means Clustering

⎪⎩

⎪⎨⎧

∈−∈+

=221

112

if/ if/

)(CinnnCinnn

iqCluster membership indicator:

⎥⎦

⎤⎢⎣

⎡−−= 2

2

2221

11

21

2121 ),(),(),(2n

CCdn

CCdnnCCd

nnnJD,2

DK JxnJ −⟩⟨=

DK JJ maxmin ⇒

Define distance matrix: 2||),( jiijij xxddD −==

KqqqXXqqDqDqqJ TTTTTD 2)(2~ ==−=−= KD =~

Solution is principal eigenvector v1of K

}0)(|{},0)(|{ 1211 ≥=<= iviCiviCClusters C1, C2 are determined by:

44


A simple illustration

45


DNA Gene Expression File for Leukemia

Using v1 , tissue samples separated into 2 clusters, 3 errors

Do one more K-means, reduce to 1 error

46


Multi-way K-means Clustering

Unsigned Cluster membership indicators h1, …, hK:

),,(

1000

0100

0011

321 hhh=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

C1 C2 C3

47



∑ ∑ ∑=

∈−=

i

K

kCji j

Ti

kiK

kxx

nxJ

1,

2 1

(Unsigned) Cluster indicators H=(h1, …, hK)

)(Tr2k

TTk

iiK XHXHxJ −=∑

∑ ∑=

−=i

K

kk

TTki XhXhx

1

2

THQ kk=

Redundancy: ∑=

=K

kkk ehn

1

2/1Regularized Relaxation

Transform h1, …, hK to q1 - qk via orthogonal matrix T

Thhqq kk ),,(),...,( 11 L= 2/11 /neq =

48



])(maxTr[ 11 −− kTT

k QXXQ

21

1

2 min xnJxnK

kKk <<−∑

−

=

λ

),...,( 21 kk qqQ =−

Optimal solutions of q2… qk are given by

principal components v2… vk.

JK is bounded below by total variance minus sum of K eigenvalues of covariance:

49


Consistency: 2-way and K-way approaches

Orthogonal Transform:

Recover the original 2-way cluster indicator

T transforms (h1, h2) to (q1,q2):

Tbbaaq ),,,,,(, 2 −−= LL Tq )11(1 L=

Th )11,00(, 2 LL= Th )00,11(1 LL= nnna1

2=

nnnb2

1=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

nnnnnnnn

T////

21

12

50


Lower bound is within 0.6-1.5% of the optimal value

Test of Lower bounds of K-means clustering

opt

LBopt

JJJ || −

51


∑∑∑∑ ====k

Tkkk

T

k

Tkk

T

k

Tkk

k

Tkk uuXvvXXhhXccP λ

Cluster Subspace (spanned by K centroids) = PCA Subspace

Given a data point x,

∑=k

TkkccP project x into the cluster subspace

kk

ikk Xhxihc ==∑ )(Centroid is given by

PCAk

Tkk

k

TkkkmeansK PuuuuP ≡⇔= ∑∑− λ

PCA automatically project into cluster subspace

PCA is unsupervised version of LDA

52


Effectiveness of PCA Dimension Reduction

53


Kernel K-means Clustering

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1

||)()(||min φφφ

Kernal K-means objective: )( ii xx φ→

Kernal K-means

∑ ∑∑= ∈

−=K

k Cjij

Ti

kii

k

xxn

x1 ,

2 )()(1|)(| φφφ

∑ ∑= ∈

=K

k Cjiji

kK

k

xxn

J1 ,

)(),(1max φφφ

54


Kernel K-means clusteringis equivalent to Kernal PCA

Continuous optimal solution for cluster indicators are given by Kernal PCA components

Subspace spanned by K cluster centroidsare given by Kernal PCA principal subspace

Technology

Principal component analysis and matrix factorizations for learning (part 1) ding - icml 2005 tutorial - 2005