54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1 Principal Component Analysis and Matrix Factorizations for Learning Chris Ding Lawrence Berkeley National Laboratory Supported by Office of Science, U.S. Dept. of Energy

Principal component analysis and matrix factorizations for learning (part 1) ding - icml 2005 tutorial - 2005

  • Upload
    zukun

  • View
    1.705

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1

Principal Component Analysis andMatrix Factorizations for Learning

Chris DingLawrence Berkeley National Laboratory

Supported by Office of Science, U.S. Dept. of Energy

Page 2: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 2

Many unsupervised learning methodsare closely related in a simple way

Spectral Clustering

NMF

K-means clustering

PCA

Indicator Matrix Quadratic Clustering

Semi-supervised classification

Semi-supervised clustering

Outlier detection

Page 3: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 3

Part 1.A.Principal Component Analysis (PCA)

andSingular Value Decomposition (SVD)

• Widely used in large number of different fields• Most widely known as PCA (multivariate

statistics)• SVD is the theoretical basis for PCA

Page 4: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 4

Brief history

• PCA– Draw a plane closest to data points (Pearson, 1901)– Retain most variance (Hotelling, 1933)

• SVD– Low-rank approximation (Eckart-Young, 1936) – Practical application/Efficient Computation (Golub-

Kahan, 1965)• Many generalizations

Page 5: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 5

PCA and SVD

),,,( 21 nxxxX L=Data: n points in p-dim:

Covariance

Principal directions:(Principal axis,subspace)

ku Principal components:(projection on the subspace)

kv

∑=

==p

k

Tkkk

T uuXXC1λ

∑=

=r

k

Tkkk

T vvXX1

λGram (kernel) matrix

Underlying basis: SVD Tp

k

Tkkk VUvuX Σ==∑

=1

σ

Page 6: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 6

Further Developments

SVD/PCA• Principal Curves • Independent Component Analysis• Sparse SVD/PCA (many approaches)• Mixture of Probabilistic PCA• Generalization to exponential familty, max-margin• Connection to K-means clustering

Kernel (inner-product)

• Kernel PCA

Page 7: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 7

Methods of PCA Utilization

dkkk XduXuu ⋅++⋅= )()1( 1 L

Principal components (uncorrelated random variables):

Projection to low-dim subspace

Sphereing the dataTransform data to N(0,1)

Dimension reduction: Tp

k

Tkkk VUvuX Σ==∑

=1

σ

),,,( 21 nxxxX L=

XUX T=~ ),,( 1 kuuU L=

XUUXCX T12/1~ −Σ== −

Page 8: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 8

Applications of PCA/SVD

• Most popular in multivariate statistics• Image processing, signal processing• Physics: principal axis, diagonalization of

2nd tensor (mass)• Climate: Empirical Orthogonal Functions

(EOF)• Kalman filter. • Reduced order analysis

Ttttt APAPEsAs )()1()()1( , =+= ++

Page 9: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 9

Applications of PCA/SVD

• PCA/SVD is as widely as Fast Fourier Transforms– Both are spectral expansions– FFT is more on Partial Differential Equations– PCA/SVD is more on discrete (data) analysis– PCA/SVD surpass FFT as computational sciences

further advance

• PCA/SVD– Select combination of variables– Dimension reduction

• An image has 104 pixels. True dimension is 20 !

Page 10: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 10

PCA is a Matrix Factorization(spectral/eigen decomposition)

Covariance Tp

k

Tkkk

T UUuuXXC Λ=== ∑=1

λ

Tr

k

Tkkk

T VVvvXX Λ==∑=1

λKernel matrix

Underlying basis: SVD Tp

k

Tkkk VUvuX Σ==∑

=1

σ

Principal directions: ),,,( 21 kuuuU L=

Principal components: ),,,( 21 kvvvV L=

Page 11: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 11

From PCA to spectral clusteringusing generalized eigenvectors

∑=j iji wd

In Kernel PCA we compute eigenvector: vWv λ=

Consider the kernel matrix:

Generalized Eigenvector:

)(),( jiij xxW φφ=

DqWq λ=

),,( 1 ndddiagD L=

This leads to Spectral Clustering !

Page 12: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 12

Scale PCA ⇒ Spectral Clustering

PCA:

2/1)/(~,~21

21

jiijij ddwwWDDW == −−

scaled principal component

Scaled PCA: DqqDDWDWk

Tkkk ∑

=

==1

21

21 ~ λ

kk vDq 21−=

∑=k

Tkkk vvW λ

Page 13: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 13

Scaled PCA on a Rectangle Matrix⇒ Correspondence Analysis

Re-scaling: 2/1.. )(~ ,~ /2

121

jiijijcr ppppPDDP == −−

are scaled row and column principal component (standard coordinates in CA)

Apply SVD on P~

ck

Tkkkr

T DgfDprcP ..1

/ ∑=

=− λ

Subtract trivial component

Tnppr ),,( ..1 L=

Tnppc ),,( .1. L=

kckkrk vDguDf 21

21

, −− ==

(Zha, et al, CIKM 2001, Ding et al, PKDD2002)

Page 14: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 14

Nonnegative Matrix Factorization

),,,( 21 nxxxX L=Data Matrix: n points in p-dim:

TFGX ≈Decomposition (low-rank approximation)

Nonnegative Matrices0,0,0 ≥≥≥ ijijij GFX

),,,( 21 kgggG L=),,,( 21 kfffF L=

is an image, document, webpage, etc

ix

Page 15: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 15

Solving NMF with multiplicative updating

Fix F, solve for G; Fix G, solve for F

Lee & Seung ( 2000) propose

0,0,|||| 2 ≥≥−= GFFGXJ T

jkT

jkT

jkjk FGFFX

GG)(

)(←

ikT

ikikik GFG

XGFF)(

)(←

Page 16: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 16

Matrix Factorization Summary

PCA:

Scaled PCA:

DQQDDWDW TΛ== 21

21 ~

TVVW Λ=

Symmetric(kernel matrix, graph)

Rectangle Matrix (contigency table, bipartite graph)

TVUX Σ=

cT

rcr DGFDDXDX Λ== 21

21 ~

TFGX ≈NMF: TQQW ≈

Page 17: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 17

Indicator Matrix Quadratic Clustering

Unsigned Cluster indicator Matrix H=(h1, …, hK)

0,..),Tr( max ≥= HIHHtsWHH TTH

;XXW T=

Kernel K-means clustering:

Spectral clustering (normalized cut)

K-means: ))(),(( ><= ji xxW φφKernel K-means

0,..),Tr( max ≥= HIDHHtsWHH TTH

Difference between the two is the orthogonality of H

Page 18: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 18

Indicator Matrix Quadratic Clustering

Additional features:

)Tr( max HCWHH TTH

+

.,)(

2/)( CHWHHH

CWHHH TT

ikikik

ikik +=+← αα

Semi-suerpvised classification:

Semi-supervised clustering: (A) must-link and (B) cannot-link constraints

allowing zero rows in HOutlier Detection:

)Tr( max BHHAHHWHH TTTH

βα −+

)Tr( max WHHTH

Nonnegative Lagrangian Relaxation:

Page 19: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 19

Tutorial Outline• PCA

– Recent developments on PCA/SVD– Equivalence to K-means clustering

• Scaled PCA– Laplacian matrix– Spectral clustering– Spectral ordering

• Nonnegative Matrix Factorization– Equivalence to K-means clustering– Holistic vs. Parts-based

• Indicator Matrix Quadratic Clustering– Use Nonnegative Lagrangian Relaxtion – Includes

• K-means and Spectral Clustering• semi-supervised classification• Semi-supervised clustering• Outlier detection

Page 20: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20

Part 1.B.Recent Developments on PCA and SVD

Principal Curves Independent Component AnalysisKernel PCAMixture of PCA (probabilistic PCA)Sparse PCA/SVD

Semi-discrete, truncation, L1 constraint, Direct sparsification

Column Partitioned Matrix Factorizations2D-PCA/SVDEquivalence to K-means clustering

Page 21: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 21

PCA and SVD

),,,( 21 nxxxX L=Data Matrix:

Covariance

Principal directions:(Principal axis,subspace)

ku Principal components:(projection on the subspace)

kv

∑=

==p

k

Tkkk

T uuXXC1λ

∑=

=r

k

Tkkk

T vvXX1

λGram (kernel) matrix

Underlying basis: SVD ∑=

=p

k

Tkkk vuX

Page 22: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 22

Kernel PCA

Kernel

Feature extraction

Indefinite Kernels

Generalization to graphs with nonnegative weights

)(),( jiij xxK φφ=

(Scholkopf, Smola, Muller, 1996)

)(),()(, xxvxv iiiφφφ ∑=

PCA Component v)( ii xx φ→

Page 23: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 23

Mixture of PCA• Data has local structures.

– Global PCA on all data is not useful• Clustering PCA (Hinton et al):

– Using clustering to cluster data into clusters– Perform PCA in each cluster– No explicit generative model

• Probabilistic PCA (Tipping & Bishop)– Latent variables– Generative model (Gaussian)– Mixture of Gaussians ⇒ mixture of PCA– Adding Markov dynamics for latent variables (Linear

Gaussian Models)

Page 24: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 24

Probabilistic PCALinear Gaussian Model

),0(~, 2INWsx ii εσεεμ ++=

Latent variables ),,( 1 nssS L=

),(~)( 20 IsNsP sσGaussian prior

),(~ 20

TsWWIWsNx σσε +

(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)

Linear Gaussian Model

,,1 εη +=+=+ iiii WsxAss

Page 25: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 25

Sparse PCA• Compute a factorization

– U or V is sparse or both are sparse• Why sparse?

– Variable selection (sparse U)– When n >> d– Storage saving– Other new reasons?

• L1 and L2 constraints

TUVX ≈

Page 26: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 26

Sparse PCA: Truncation and Discretization

• Sparsified SVD– Compute {uk,vk} one at a time, truncate those entries

below a threshold. – Recursively compute all pairs using deflation.– (Zhang, Zha, Simon, 2002)

• Semi-discrete decomposition – U, V only contains {-1, 0, 1} – Iterative algorithm to compute U,V using deflation– (Kolda & O’leary, 1999)

TVUX Σ≈

TuvXX σ−←

)( 1 kuuU L= )( 1 kvvV L=

Page 27: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 27

Sparse PCA: L1 constraint

• LASSO (Tibshirani, 1996)

• SCoTLASS (Joliffe & Uddin, 2003)

• Least Angle Regression (Efron, et al 2004)

• Sparse PCA (Zou, Hastie, Tibshirani,2004)

tXy T ≤− 12 ||||,||||min ββ

0,||||,)(max 1 =≤ hTTTT uutuuXXu

Ixx Tk

jjj

k

jji

Tn

ii =++− ∑∑∑

===

ααβλβλβαβα

,||||||||||||min1

1,11

22

1,

||||/ jjjv ββ=

Page 28: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 28

Sparse PCA: Direct Sparsification

• Sparse SVD with explicit sparsification

– rank-one approximation – Minimize a bound– deflation

• Direct sparse PCA, on covariance matrix S

)nnz()nnz(||||min,

vuudvX FT

vu++−

)Tr(max)Tr(maxmax SUSuuSuuu TT ===1)rank(,0,)nnz(,1)Tr(.. 2 =≤= UUkUUts f

(Zhang, Zha, Simon 2003)

(D’Aspremont, Gharoui, Jordan,Lancriet, 2004)

Page 29: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 29

Sparse PCA Summary• Many different approaches

– Truncation, discretization– L1 Constraint– Direct sparsification– Other approaches

• Sparse Matrix factorization in general– L1 constraint

• Many questions– Orthogonality– Unique solution, global solution

Page 30: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 30

PCA: Further Generalizations

• Generalization to Exponential Family– (Collins, Dasgupta, Schapire, 2001)

• Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)

– Collaborative filtering– Input Y is binary– Hard margin– Soft margin

∑∈

Σ −+Sia

iaia XYcX )1,0max(||||min

)||||||(||||||, 2221

FroFroT VUXUVX +==

SiaXY iaia ∈∀≥ ,1

Page 31: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 31

Column Partitioned Matrix Factorizations

• Column Partitioned Data Matrix• Partitions are generate by clustering• Centroid matrix

– uk is centroid– Fix U, compute V

• Represent each partition by a SVD. – Pick leading Us to form U– Fix U, compute V

• Several other variations

1)( −= UUUXV TT2||||min FTUVX −

)( 1 kuuU L=

),,,(),( 1111 1

2

21

1

1

4484476LL

48476L

48476LL

k

k

n

nn

n

nn

n

nn xxxxxxxxX ++ −==

nnn k =++L1

),,(),( )()(

1

)1(

1

)1(111

48476

LL

48476

LL

l

l

l

ll

k

k

k

k uuuuUUU ==

(Zhang & Zha, 2001)

(Castelli, Thomasian & Li 2003)

(Park, Jeon & Rosen, 2003)

(Dhillon & Modha, 2001)

(Zeimpekis & Gallopoulos, 2004)

Page 32: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 32

Two-dimensional SVD

• Large number of data objects are 2-D: images, maps• Standard method:

– convert (re-order) each image as a 1D vector– collect all 1D vectors into a single (big) matrix– apply SVD on the big matrix

• 2D-SVD is developed for 2D objects– Extension of standard SVD– Keeping the 2D characteristics– Improves quality of low-dimensional approximation– Reduces computation, storage

Page 33: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 33

0 0050 710

080 20 0

.

.

.

.

.

.

.

M

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

Pixel vector

Linearize a 2D object into 1D object

Page 34: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 34

SVD and 2D-SVDSVD

VXU T=ΣTVUX Σ=

),,,( 21 nxxxX L=Eigenvectors of TXX XX Tand

},,,{}{ 21 nAAAA L=Eigenvectors of

2D-SVD

Tii

i

AAAAF ))(( −−=∑)()( AAAAG i

Ti

i

−−=∑T

ii VUMA = VAUM iT

i =

row-row covariance

column-column cov

Page 35: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 35

2D-SVD},,,{}{ 21 nAAAA L= assume 0=A

∑∑ == Tkkk

Tii

iuuAAF λ

∑∑=

==1k

Tkki

Ti

ikuuAAG ζ

VAUM iT

i =

row-row cov:

col-col cov:

),,,( 21 kuuuU L=),,,( 21 kvvvV L=

niVUMA Tii ,,1, L==

Bilinear

subspace

kki

kckrcri MVUA ×××× ℜ∈ℜ∈ℜ∈ℜ∈ ,,,

Page 36: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 36

2D-SVD Error Analysis

∑∑+==

=−=r

kjj

Tii

n

i

RMAJ1

2

12 ||||min λ

∑∑∑+=+==

+≅−=c

kjj

r

kjj

Tii

n

i

RLMAJ11

2

13 ||||min ζλ

∑∑+==

=−=c

kjjii

n

i

LMAJ1

2

11 ||||min ζ

kki

kckrcri

Tii RMRRRLRARLMA ×××× ∈∈∈∈≈ ,,,,

∑∑+==

≅−=r

kjj

Tii

n

i

LLMAJ1

2

14 2||||min λ

∑+=

=Σ−p

kii

TVUX1

22||||min σSVD:

Page 37: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 37

Temperature maps (January over 100 years)

Reconstruction Errors

SVD/2DSVD=1.1

Storages

SVD/2DSVD=8

Page 38: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 38

Reconstructed image

SVD (K=15), storage 160560

2DSVD (K=15), storage 93060

SVD

2dSVD

Page 39: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 39

2D-SVD Summary

• 2DSVD is extension of standard SVD• Provides optimal solution for 4 representations for

2D images/maps• Substantial improvements in storage, computation,

quality of reconstruction• Capture 2D characteristics

Page 40: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

40

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Part 1.C.K-means Clustering ⇔

Principal Component Analysis

(Equivalence between PCA and K-means)

Page 41: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

41

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

K-means clustering

• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hatigan,

etc)• Computationally Efficient (order-mN)• Widely used in practice

– Benchmark to evaluate other algorithms

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1

||||min

TnxxxX ),,,( 21 L=Given n points in m-dim:

K-means objective

Page 42: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

42

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

PCA is equivalent to K-means

Continuous optimal solution for cluster indicators in K-means clustering are given by principal components.

Subspace spanned by K cluster centroidsis given by PCA subspace.

Page 43: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

43

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

2-way K -means Clustering

⎪⎩

⎪⎨⎧

∈−∈+

=221

112

if/ if/

)(CinnnCinnn

iqCluster membership indicator:

⎥⎦

⎤⎢⎣

⎡−−= 2

2

2221

11

21

2121 ),(),(),(2n

CCdn

CCdnnCCd

nnnJD,2

DK JxnJ −⟩⟨=

DK JJ maxmin ⇒

Define distance matrix: 2||),( jiijij xxddD −==

KqqqXXqqDqDqqJ TTTTTD 2)(2~ ==−=−= KD =~

Solution is principal eigenvector v1of K

}0)(|{},0)(|{ 1211 ≥=<= iviCiviCClusters C1, C2 are determined by:

Page 44: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

44

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

A simple illustration

Page 45: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

45

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

DNA Gene Expression File for Leukemia

Using v1 , tissue samples separated into 2 clusters, 3 errors

Do one more K-means, reduce to 1 error

Page 46: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

46

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Multi-way K-means Clustering

Unsigned Cluster membership indicators h1, …, hK:

),,(

1000

0100

0011

321 hhh=

⎥⎥⎥⎥

⎢⎢⎢⎢

C1 C2 C3

Page 47: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

47

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Multi-way K-means Clustering

∑ ∑ ∑=

∈−=

i

K

kCji j

Ti

kiK

kxx

nxJ

1,

2 1

(Unsigned) Cluster indicators H=(h1, …, hK)

)(Tr2k

TTk

iiK XHXHxJ −=∑

∑ ∑=

−=i

K

kk

TTki XhXhx

1

2

THQ kk=

Redundancy: ∑=

=K

kkk ehn

1

2/1Regularized Relaxation

Transform h1, …, hK to q1 - qk via orthogonal matrix T

Thhqq kk ),,(),...,( 11 L= 2/11 /neq =

Page 48: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

48

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Multi-way K-means Clustering

])(maxTr[ 11 −− kTT

k QXXQ

21

1

2 min xnJxnK

kKk <<−∑

=

λ

),...,( 21 kk qqQ =−

Optimal solutions of q2… qk are given by

principal components v2… vk.

JK is bounded below by total variance minus sum of K eigenvalues of covariance:

Page 49: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

49

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Consistency: 2-way and K-way approaches

Orthogonal Transform:

Recover the original 2-way cluster indicator

T transforms (h1, h2) to (q1,q2):

Tbbaaq ),,,,,(, 2 −−= LL Tq )11(1 L=

Th )11,00(, 2 LL= Th )00,11(1 LL= nnna1

2=

nnnb2

1=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

nnnnnnnn

T////

21

12

Page 50: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

50

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Lower bound is within 0.6-1.5% of the optimal value

Test of Lower bounds of K-means clustering

opt

LBopt

JJJ || −

Page 51: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

51

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

∑∑∑∑ ====k

Tkkk

T

k

Tkk

T

k

Tkk

k

Tkk uuXvvXXhhXccP λ

Cluster Subspace (spanned by K centroids) = PCA Subspace

Given a data point x,

∑=k

TkkccP project x into the cluster subspace

kk

ikk Xhxihc ==∑ )(Centroid is given by

PCAk

Tkk

k

TkkkmeansK PuuuuP ≡⇔= ∑∑− λ

PCA automatically project into cluster subspace

PCA is unsupervised version of LDA

Page 52: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

52

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Effectiveness of PCA Dimension Reduction

Page 53: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

53

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Kernel K-means Clustering

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1

||)()(||min φφφ

Kernal K-means objective: )( ii xx φ→

Kernal K-means

∑ ∑∑= ∈

−=K

k Cjij

Ti

kii

k

xxn

x1 ,

2 )()(1|)(| φφφ

∑ ∑= ∈

=K

k Cjiji

kK

k

xxn

J1 ,

)(),(1max φφφ

Page 54: Principal component analysis and matrix factorizations for learning (part 1)   ding - icml 2005 tutorial - 2005

54

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Kernel K-means clusteringis equivalent to Kernal PCA

Continuous optimal solution for cluster indicators are given by Kernal PCA components

Subspace spanned by K cluster centroidsare given by Kernal PCA principal subspace