8
Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China {i-dezhao,zhoulin,xitang}@microsoft.com Abstract Dimensionality reduction plays a fundamental role in data processing, for which principal component analysis (PCA) is widely used. In this paper, we develop the Lapla- cian PCA (LPCA) algorithm which is the extension of PCA to a more general form by locally optimizing the weighted scatter. In addition to the simplicity of PCA, the benefits brought by LPCA are twofold: the strong robustness against noise and the weak metric-dependence on sample spaces. The LPCA algorithm is based on the global alignment of locally Gaussian or linear subspaces via an alignment tech- nique borrowed from manifold learning. Based on the cod- ing length of local samples, the weights can be determined to capture the local principal structure of data. We also give the exemplary application of LPCA to manifold learn- ing. Manifold unfolding (non-linear dimensionality reduc- tion) can be performed by the alignment of tangential maps which are linear transformations of tangent coordinates ap- proximated by LPCA. The superiority of LPCA to PCA and kernel PCA is verified by the experiments on face recogni- tion (FRGC version 2 face database) and manifold (Scherk surface) unfolding. 1. Introduction Principal component analysis (PCA) [13] is widely used in computer vision, pattern recognition, and signal process- ing. In face recognition, for example, PCA is performed to map samples into a low-dimensional feature space where the new representations are viewed as expressive features [21, 26, 22]. Discriminators like LDA [2, 29], LPP [12], and MFA [31] are performed in the PCA-transformed spaces. In active appearance models (AAM) [8] and 3D morphable models [5], textures and shapes of faces are compressed in the PCA-learnt texture and shape subspaces, respectively. Deformation and matching between faces are performed using these texture and shape features. In manifold learn- ing, tangent spaces of a manifold [14] are presented by the PCA subspaces and tangent coordinates are the PCA fea- -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 (a) (b) Figure 1. Principal subspaces by PCA (a) and LPCA (b) (K =5). Dotted lines denote the principal directions of unperturbed data (first row). Solid lines denote the principal directions of perturbed data (second and third rows). tures. The representative algorithms in manifold learning like Hessian Eigenmaps [9], local tangent space alignment (LTSA) [33], S-Logmaps [7], and Riemannian normal co- ordinates (RNC) [15] are all based on tangent coordinates. In addition, K-Means, the classical algorithm for cluster- ing, was proven equivalent to PCA in a relaxed condition [32]. Thus, PCA features can be naturally adopted for clus- tering. The performance of the algorithms mentioned above is determined by the subspaces and the features yielded by PCA. There are also variants of PCA, such as the probabilis- tic PCA [24], the kernel PCA (KPCA) [20], the robust PCA [25], the weighted PCA [17], the generalized PCA [27]. However, PCA has some limitations as well. First, PCA is sensitive to noise, meaning that noise samples may incur significant change of principal subspaces. Figure 1 (a) il- lustrates an example. We can clearly observe the instability of PCA with perturbed sample points. To address this is- sue, the robust PCA algorithm [25] was proposed but with the sacrifice of the simplicity of PCA. The weighted PCA [17] was developed to perform smoothing on local patches of data in manifold learning. The authors used an itera- tive approach to compute weights, whose convergence can- 978-1-4244-1631-8/07/$25.00 ©2007 IEEE

Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

Embed Size (px)

Citation preview

Page 1: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

Laplacian PCA and Its Applications

Deli Zhao Zhouchen Lin Xiaoou Tang

Visual Computing Group, Microsoft Research Asia, Beijing, China{i-dezhao,zhoulin,xitang}@microsoft.com

Abstract

Dimensionality reduction plays a fundamental role indata processing, for which principal component analysis(PCA) is widely used. In this paper, we develop the Lapla-cian PCA (LPCA) algorithm which is the extension of PCAto a more general form by locally optimizing the weightedscatter. In addition to the simplicity of PCA, the benefitsbrought by LPCA are twofold: the strong robustness againstnoise and the weak metric-dependence on sample spaces.The LPCA algorithm is based on the global alignment oflocally Gaussian or linear subspaces via an alignment tech-nique borrowed from manifold learning. Based on the cod-ing length of local samples, the weights can be determinedto capture the local principal structure of data. We alsogive the exemplary application of LPCA to manifold learn-ing. Manifold unfolding (non-linear dimensionality reduc-tion) can be performed by the alignment of tangential mapswhich are linear transformations of tangent coordinates ap-proximated by LPCA. The superiority of LPCA to PCA andkernel PCA is verified by the experiments on face recogni-tion (FRGC version 2 face database) and manifold (Scherksurface) unfolding.

1. Introduction

Principal component analysis (PCA) [13] is widely usedin computer vision, pattern recognition, and signal process-ing. In face recognition, for example, PCA is performedto map samples into a low-dimensional feature space wherethe new representations are viewed as expressive features[21, 26, 22]. Discriminators like LDA [2, 29], LPP [12], andMFA [31] are performed in the PCA-transformed spaces.In active appearance models (AAM) [8] and 3D morphablemodels [5], textures and shapes of faces are compressed inthe PCA-learnt texture and shape subspaces, respectively.Deformation and matching between faces are performedusing these texture and shape features. In manifold learn-ing, tangent spaces of a manifold [14] are presented by thePCA subspaces and tangent coordinates are the PCA fea-

−1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

−1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

−1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

−1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

−1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

−1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

(a) (b)Figure 1. Principal subspaces by PCA (a) and LPCA (b) (K = 5).Dotted lines denote the principal directions of unperturbed data(first row). Solid lines denote the principal directions of perturbeddata (second and third rows).

tures. The representative algorithms in manifold learninglike Hessian Eigenmaps [9], local tangent space alignment(LTSA) [33], S-Logmaps [7], and Riemannian normal co-ordinates (RNC) [15] are all based on tangent coordinates.In addition, K-Means, the classical algorithm for cluster-ing, was proven equivalent to PCA in a relaxed condition[32]. Thus, PCA features can be naturally adopted for clus-tering. The performance of the algorithms mentioned aboveis determined by the subspaces and the features yielded byPCA. There are also variants of PCA, such as the probabilis-tic PCA [24], the kernel PCA (KPCA) [20], the robust PCA[25], the weighted PCA [17], the generalized PCA [27].

However, PCA has some limitations as well. First, PCAis sensitive to noise, meaning that noise samples may incursignificant change of principal subspaces. Figure 1 (a) il-lustrates an example. We can clearly observe the instabilityof PCA with perturbed sample points. To address this is-sue, the robust PCA algorithm [25] was proposed but withthe sacrifice of the simplicity of PCA. The weighted PCA[17] was developed to perform smoothing on local patchesof data in manifold learning. The authors used an itera-tive approach to compute weights, whose convergence can-

978-1-4244-1631-8/07/$25.00 ©2007 IEEE

Page 2: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

not be guaranteed. Besides, the weighted PCA in [17] isperformed on local patches of data. The authors did notdiscuss how to derive the global projection matrix fromthe locally weighted scatters. Second, in principle, PCAis only reasonable for samples in Euclidean spaces wheredistances between samples are measured by l2 norms. Fornon-Euclidean sample spaces, the scatter of samples cannotbe represented by the summation of Euclidean distances.For instance, histogram features are non-Euclidean. Theirdistances are better measured by the Chi square. There-fore, the principal subspaces of such samples cannot be op-timally obtained by the traditional PCA. The KPCA algo-rithm was designed for extracting principal components ofsamples whose underlying spaces are non-Euclidean. How-ever, KPCA cannot explicitly produce principal subspacesof samples, which are required in many applications. Be-sides, KPCA is also sensitive to noise data because its cri-terion for optimization is intrinsically equivalent to PCA.

In this paper, we aim at enhancing the robustness of PCAand freeing it from the limitation of metrics at the same timewith a Laplacian PCA (LPCA) algorithm. Different fromthe conventional PCA, we first formulate the scatter of sam-ples on local patches of the data by the weighted summationof distances. The local scatter can be expressed in a com-pact form like the global scatter of the traditional PCA. Fur-thermore, we formulate a general framework for aligninglocal scatters to a global one. The framework of alignmentis also applicable for methods based on spectral analysis inmanifold learning. The optimal principal subspace can beobtained by solving a simple eigen-decomposition problem.Moreover, an efficient approach is provided for comput-ing local LPCA features that are frequently utilized as tan-gent coordinates in manifold learning. As an application ofLPCA, we develop tangential maps of manifolds based ontangential coordinates approximated by local LPCA. Partic-ularly, we locally determine the weights by investigating thereductive coding length [16] of a local data patch, which isthe variation of the coding length of a data set by leavingone point out. Hence, the principal structures of the datacan be locally captured in this way.

Experiments are performed on face recognition and man-ifold unfolding to test LPCA. Face recognition is conductedon a subset of FRGC version 2 [18]. Three representa-tive discriminators, LDA, LPP, and MFA, are performedon LPCA and PCA expressive features. The results indi-cate that the recognition performance of three discimina-tors based on LPCA is consistently better than that basedon PCA. Besides, we perform dimensionality reduction onLBP non-Euclidean features using LPCA, PCA, and KPCA.LPCA shows significant superiority to PCA and KPCA. Formanifold learning, we introduce Scherk surface [4] as a newexample for manifold unfolding. LPCA-based tangentialmaps yields the faithful embeddings with or without noise.

I The identity matrix.e The all-one column vector.H H = I− 1

neeT is a centering matrix.RD The D-dimensional Euclidean space.xi The i-th sample inRD , i = 1, . . . , n.Sx Sx = {x1, . . . ,xn}.X X = [x1, . . . ,xn].x x = 1

nXe is the center of sample points.xik

The k-th nearest neighbor of xi, k = 1, . . . , K .Sx

i Sxi = {xi0 ,xi1 , . . . ,xiK}, i0 = i.

Xi Xi = [xi0 ,xi1 , . . . ,xiK ].yi The representation of xi inRd, d < D.tr The trace of a matrix.det The determinant of a matrix.XT The transpose of X.Wi Wi = diag(wi0 , . . . , wiK ) is a diagonal matrix.

Hw Hw = I− WieeT

eT Wie, a weighted centering matrix.

Table 1. Notations

However, PCA-based tangential maps fails for noisy mani-folds.

2. Laplacian PCA

The criterion of LPCA is to maximize the local scatter ofdata instead of the global one pursued by PCA. The scatteris the summation of weighted distances between low dimen-sional representations of original samples and their means.Like PCA, we aim at finding a global projection matrix Usuch that y = UT (x− x), (1)

where U is of size D by d. In the matrix form, we can writeY = UT (XH)1. In the following sub-sections, we presentthe formulations of performing local LPCA, the alignmentof local LPCA, the global LPCA, and the efficient compu-tation of local LPCA features. The notations used in thispaper are listed in Table 1.

2.1. Local LPCA

For non-Gaussian or manifold-valued data, we usuallydeal with it from local patches because non-Gaussian datacan be viewed locally Gaussian and a curved manifold canbe locally viewed Euclidean [14]. Particularly, Gaussiandistribution is the theoretical base of many statistical oper-ations [11], and tangent spaces and tangent coordinates arethe fundamental descriptors of a manifold. So, we beginwith local LPCA.

Specifically, let αi denote the local scatter on the i-thneighborhood Sy

i . It is defined as

αi =K∑

k=0

wik‖yik

− yi‖2Rd , (2)

1The size of I and the length of e are easily known from the contexts.

Page 3: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

where wikis the related weight and yi is the geometric cen-

troid of Syi , i.e., yi = YiWie

eT Wie. We will present the definition

of wikin Section 4. The distance between yik

and yi aremeasured by the l2 norm ‖ • ‖Rd . Rewriting (2) yields

αi =K∑

k=0

wiktr

((yik− yi)(yik

− yi)T)

(3)

= tr(YiWiYTi )− tr(YiWieeT WiYT

i )eT Wie

. (4)

Thus we obtain

αi = tr(YiLiYTi ), (5)

whereLi = Wi −WieeT Wi

eT Wie(6)

is called the local Laplacian scatter matrix. For Yi, we have

Yi = UTi (Xi − xieT ) = UT

i (XiHw), (7)

where xi is the geometric centroid of Sxi . Plugging (7) into

(5) gives

αi = tr(UTi XiHwLiHT

wXTi Ui). (8)

It is not hard for one to check that HwLiHTw = Li. So, we

get the final expression of the local scatter

αi = tr(UTi Si

lUi), (9)

where Sil = XiLiXT

i is the local scatter matrix of Sxi . Im-

posing the orthogonality constraint on Ui, we arrive at thefollowing maximization problem{

argmaxUi

αi = argmaxUi

tr(UTi XiLiXT

i Ui),

s.t. UTi Ui = I.

(10)

Ui is essentially the principal column space of Sil , i.e.,

the space spanned by the eigenvectors associated with thed largest eigenvalues of Si

l. We will present the efficientmethod for the computation in Section 2.4.

2.2. Alignment of Local Geometry

If we aim at deriving the global Y or the global pro-jection U, then global analysis can be performed on thealignment of localities. For the traditional approach, theGaussian mixing model (GMM) [11], along with the EMscheme, is usually applied to fulfill this task (probabilisticPCA for instance). For spectral methods however, there hasa simple approach. Here, we present a unified framework ofalignment for spectral methods, by which the optimal solu-tion in closed form can be obtained by eigen-analysis.

In general, the following form of optimization like (5) isinvolved on local patches

argmaxYi

tr(YiLiYTi ), (11)

where Li is the local Laplacian scatter matrix2. For eachSy

i , we have Syi ⊂ Sy , meaning that {yi0 ,yi1 , . . . ,yiK}

are always selected from {y1, . . . ,yn}. What is more,the selection labels are known from the process of nearestneighbors searching. Thus, we can write Yi = YSi, whereSi is the n by (K + 1) binary selection matrix associatedwith Sy

i . Let Ii = {i0, i1, . . . , iK} denote the label set. It isnot hard to know that the structure of Si can be expressedby

(Si)pq =

{1 if p = iq−1

0 otherwise, iq−1 ∈ Ii, q = 1, . . . , K + 1,

(12)meaning that (Si)pq = 1 if the q-th vector in Yi is the p-thvector in Y. Then rewriting (11) gives

arg maxY

tr(YSiLiSTi YT ). (13)

For each Syi , such maximization must be performed. So, we

have the following problem

argmaxY

n∑i=1

tr(YSiLiSTi YT ) = argmax

Ytr(YLYT ),

(14)where L =

∑ni=1 SiLiST

i is called the global Laplacianscatter matrix. The expression of L implies that, initializedby a zero matrix of the same size, L can be obtained by theupdate L(Ii, Ii)← L(Ii, Ii) + Li, i = 1, . . . , n.

The alignment technique presented here is hiddenly con-tained in [9], formulated (a little different from ours) in [6]and [33], and applied in [34, 37, 36]. Therefore, it is capa-ble of aligning general local geometry matrices in manifoldlearning as well [35].

2.3. LPCA

For LPCA, our goal is to derive a global projection ma-trix. To this end, we need to plug Y = UT (XH) in (14) toderive the expression of the global scatter when the globalLaplacian scatter matrix is ready. Thus we obtain the fol-lowing maximization problem{

argmaxU

tr(UT XHLHXT U),

s.t. UT U = I.(15)

Similar to the optimization in (10), U can be achieved bythe eigen-decomposition of XHLHXT .

2.4. Efficient Computation

For real data however, the dimension of xi is large. Soit is computationally expensive to compute Ui in (10) bythe eigen-decomposition of Si

l . However, the computationof local Yi in (10) can be significantly simplified via SVD[10] in the case of K � D.

2In fact, Li can be an arbitrary matrix that embodies the geometry ofdata on the local patch.

Page 4: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

For the local Laplacian scatter matrix Li and the globalLaplacian scatter matrix L, it is easy for one to verify thatLie = 0 and Le = 0, implying that they have zero eigen-values and the corresponding eigenvectors are the all-onevectors. Thus, we can say that Yi in (10) and Y in (15) areall centered at the origin. For Li, it is not hard for one tocheck that Li = LiLT

i , where

Li = HwW12i . (16)

Then the local scatter matrix Sil can be rewritten as Si

l =XiLi(XiLi)T , and we have the following theorem:

Theorem 1. Let the d-truncated SVD of the tall-skinny ma-trix XiLi be XiLi = PiDiQT

i . Then the left singularmatrix Pi is the local projection matrix Ui, and the local

coordinates Yi is Yi = (QiDi)T W− 12

i .

By Theorem 1, the computational complexity of Ui andYi is reduced from O(D3) to O(DK2). Such speedupis critical for computing tangent coordinates in manifoldlearning.

3. Applications to Manifold Learning

In many cases, the data set Sx is manifold-valued. Thelow-dimensional representations can be obtained by non-linear embeddings of original points. Here, we formulatetangential maps between manifolds to fulfill such tasks. Thetangent spaces and the tangent coordinates of a manifold areapproximated by local LPCA.

3.1. Tangential Maps

For a d-dimensional Riemannian manifoldMd, its tan-gent space at each point is isomorphic to the Euclideanspace Rd [14]. Thus linear transformations are allowablebetween tangent spaces of Md and Rd. Given a set ofpoints Sx sampled fromMd, the parameterization ofMd

can be performed by tangential maps, where xi is viewed asthe natural coordinate representation in the ambient spaceRD in whichMd is embedded.

With little abuse of notations, we let Syi =

{yi0 , yi1 , . . . , yiK } denote the low-dimensional represen-tation yielded by the local LPCA of Sx

i , where an extraconstraint d < K should be imposed. The global represen-tation Sy

i is obtained via the following linear transformationof Sy

i : YiHw = AiYi + Ei, (17)

where Ei is the error matrix and Ai is the Jacobian matrixof size (K + 1) by (K + 1) to be determined. Here, Yi

is centerized by Hw because the center of Syi lies at the

origin. To derive the optimal Yi, we need to minimize Ei,thus giving

arg minYi

‖Ei‖2 = arg minYi

‖YiHw −AiYi‖2. (18)

For the Jacobian matrix, we have Ai = YiHwY†i , where †

denotes the Moore-Penrose inverse [10] of a matrix. Plug-ging it in (18) and expanding the norm yields

arg minYi

tr(YiZiYTi ), (19)

whereZi = Hw(I− Y†

i Yi)(I− Y†i Yi)T HT

w. (20)

What we really need is the global representation Y insteadof local Yi. So, the alignment technique is needed to alignlocal representations to be a global one, which has been pre-sented in Section 2.2.

To make the optimization presented here well-posed, weneed a constraint on Y. Let it be YYT = I. Putting every-thing together, we get a well-posed and easily solvable min-imization problem{

argminY

tr(YLYT ),

s.t. YYT = I,(21)

where L =∑n

i=1 SiZiSTi . Again, the optimization can be

solved by the spectral decomposition of L: the d-columnmatrix YT corresponds to the d eigenvectors associatedwith the d smallest nonzeros eigenvalues of L. Thus, wecomplete a general framework of tangential maps.

3.2. LPCA Based on Tangential Maps

In general, the principal subspace of data set Sxi are em-

ployed as the approximation of the tangent space tangent tothe point xi. Thus, more robust approximation of the tan-gent space can provide better results of manifold unfolding.For LPCA however, we can obtain Zi without the explicitcomputation of Y†

i by the following theorem:

Theorem 2. Zi = Hw(I − Qi(QTi Qi)−1QT

i )HTw, where

Qi = W− 12

i Qi.

The inverse of QTi Qi can be efficiently handled because

QTi Qi is of size d by d. The computation of Zi is efficient

by noting that Hw is a rank-one modification of I and Wi

is diagonal. Zhang and Zha [33] first developed the LTSAalgorithm based on tangential maps to unfold manifolds,where tangent spaces and tangent coordinates are derivedby PCA. For LTSA, we have the following observation:

Proposition 1. LPCA-based tangential maps coincide withthe LTSA algorithm if Wi = I.

Therefore, the framework formulated here is the generaliza-tion of Zhang and Zha’s LTSA [33].

4. Definition of Weights

For traditional methods [17, 12], weights are determinedby exponentials of Euclidean distances or its analogues. We

Page 5: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

will show that such pairwise distance based dissimilaritiescannot capture the principal structure of data robustly. So,we introduce the reductive coding length as a new dissimi-larity that is compatible with the intrinsic structure of data.

4.1. Reductive Coding Length

The coding length [16] L(Sxi )3 of a vector-valued set Sx

i

is the intrinsic structural characterization of the set. We no-tice that if a point xik

complies with the structure of Sxi ,

then removing xikfrom Sx

i will not affect the structuremuch. In contrast, if the point xik

is an outlier or a noisepoint, then removing xik

from Sxi will change the structure

significantly. This motivates us to define the variation ofcoding length as the structural descriptor between xik

andSx

i . The reductive variation of L(Sxi ) with and without xik

is defined as

δLik= |L(Sx

i )− L (Sxi \ {xik

}) |, k = 0, 1, . . . , K,(22)

where | • | denotes the absolute value of a scalar. Thus, theweight wik

in (2) can be defined as

wik= exp

(− (δLik

− δLi)2

2σ2i

), (23)

where δLi and σi are the mean and the standard deviationof {δLi0 , . . . , δLiK}, respectively.

In fact, the reductive coding length is a kind of contextualdistances. One can refer to [36] for more details.

4.2. Coding Length vs. Traditional Distance

We compare the difference between reductive codinglength and the traditional pairwise distance by a toy exam-ple.

From Figure 2 (a), we observe that, using reductive cod-ing length, the perturbed point (bottom) is slightly weightedwhereas the five points that are consistent to the princi-pal structure are heavily weighted. As shown in Figure 2(c), the local principal direction (solid line) learnt by LPCAbased on reductive coding length is highly consistent withthe global principal structure (dotted line).

In contrast, as shown in Figure 2 (b), it seems promis-ing that the perturbed point is very lightly weighted. How-ever, the two significant points (pointed by two arrows)that are important to the principal structure are also lightlyweighted. Thus, the local principal direction is mainly gov-erned by the three central points. As a result, the principaldirection (dotted line in Figure 2 (d)) learnt by LPCA basedon pairwise Euclidean distance cannot capture the principalstructure of the data. Note that, based on reductive codinglength, the two significant points are most heavily weighted(Figure 2 (a)).

3The definition of it is presented in Appendix.

−1 −0.8 −0.6 −0.4 −0.2 0 0.2

−1.1

−1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

0.9679

0.1845

0.7239

0.76440.99870.8318

Significant point

−1 −0.8 −0.6 −0.4 −0.2 0 0.2

−1.1

−1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

0.029

0.0009

0.4419

0.27420.01830.9262

Significant point

(a) (b)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

(c) (d)Figure 2. Illustrations of reductive coding length vs. pairwiseEuclidean distance on one of the local patches (red circle mark-ers) of the toy data. (a) and (b) illustrate the weights computedby reductive coding length and pairwise Euclidean distance, re-spectively. In (b), the green square marker denotes the geometriccenter instead of physical centroid. (c) and (d) illustrate the localprincipal directions (solid lines) learnt by LPCA based on reduc-tive coding length and pairwise Euclidean distance, respectively.

Figure 3. Facial images of one subject for our experiment in FRGCversion 2. The first five facial images are in the gallery set andothers in the probe set.

5. Experiment

We perform experiments on face recognition and man-ifold unfolding to compare the performance of our LPCAalgorithm to that of existing related methods.

5.1. Face Recognition

Face database.We perform face recognition on a sub-set of facial data in FRGC version 2 [18]. The query setfor the experiment 4 in this database consists of single un-controlled still images which contain the variations of illu-mination, expression, time, and blurring. There are 8014images of 466 subjects in the set. However, there are onlytwo facial images available for some persons. So, we selecta subset for our experiments. First, we search all images ofeach person in the set and take the first 10 facial images ifthe number of facial images is not less than 10. Thus weget 3160 facial images of 316 subjects. Then we divide the316 subjects into three subsets. First, the first 200 subjectsare used as the gallery set and the probe set, and the remain-ing 116 subjects are exploited as the training set. Second,we take the first five facial images of each person in the first200 subjects as the gallery set and the remaining five imagesas the probe set. Therefore, the set of persons for training isdisjoint with that of persons in the gallery and for the probe.

Page 6: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

50 100 150 200 2500.8

0.82

0.84

0.86

0.88

Dimension

Rec

ogni

tion

rate

LPCA plus LDA

PCA plus LDA

50 100 150 200 2500.8

0.82

0.84

0.86

0.88

Dimension

Rec

ogni

tion

rate

LPCA plus LPP

PCA plus LPP

50 100 150 200 2500.8

0.82

0.84

0.86

0.88

Dimension

Rec

ogni

tion

rate

LPCA plus MFA

PCA plus MFA

50 100 150 200 2500.8

0.82

0.84

0.86

0.88

Dimension

Rec

ogni

tion

rate

Random Sampling LPCA plus LDA

Random Sampling PCA plus LDA

(a) (b) (c) (d)Figure 4. The performance of LPCA and PCA as expressive feature extractors for face recognition. We reduce the dimensions of originalfacial images to be 290. Thus LPCA and PCA preserve 95% power and 98.8% power, respectively. Here, the power is defined as the ratioof the summation of eigenvalues corresponding to applied eigen-vectors to the trace of the scatter matrix. (a) LDA. (b) LPP (K = 2). (c)MFA (k1 = 2, k2 = 20). (d) Random sampling subspace LDA, with four hundred eigen-vectors computed. As in [30], we take the first 50eigen-vectors as the base, and randomly sample another 100 eigen-vectors. Twenty LDA classifiers are designed.

We align the facial images according to the positions of eyesand mouths. Then each facial image is cropped to a size of64 × 72. Figure 3 shows ten images of one subject. Thenearest neighbor classifier is adopted. For the experimentsin this subsection, K = 2 for the LPCA method.

Note that the experiments in this section are not toachieve the high performance of recognition. Rather, thegoal is to compare the performance of LPCA as the samerole where PCA or KPCA may be applied.

Dimensionality reduction as expressive features. Forthe development of discriminators in face recognition, PCAplays an important role. These discriminators solve general-ized eigen-decomposition problems like Au = λBu. Dueto the small sample size problem, the matrix B is usuallysingular, which leads to the difficulty of computation. So,dimensionality reduction is first performed by PCA to ex-tract expressive features [21]. Then these discriminators areperformed in the PCA-transformed space. Here we performboth PCA and LPCA to extract expressive features. Andthree representive discriminators LDA [2], LPP [12], andMFA [31] are applied for extracting discriminative featureson these two kinds of expressive features, respectively. Asshown in Figure 4 (a), (b), and (c), the recognition ratesof these three discriminators based on LPCA are consis-tently higher than those based on PCA. Another applica-tion of PCA subspaces in face recognition is to the ran-dom sampling strategy [28, 30]. Figure 4 (d) shows thatthe discriminative power of LDA based on the random sam-pling of LPCA subspaces is superior to that based on PCAsubspaces. These results verify that robust expressive sub-spaces can significantly improve the recognition rates ofdiscriminators.

Dimensionality reduction on non-Euclidean features.In image-based recognition, visual features are sometimesextracted as expressive ones. However, the dimension ofa visual feature vector is usually high, which leads to theload of storage of features and the consumption of time incomputation. To reduce these loads, dimensionality reduc-tion is necessary. The LBP algorithm is a newly emergingapproach which is proven superior in un-supervised visual

200 250 300 350 400 450 500 550 6000.8

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

Dimension

Rec

ogni

tion

rate

LBP plus LPCA

LBP plus PCA

LBP plus KPCA

LBP

Figure 5. The performance of dimensionality reduction on LBPfeatures. The recognition rate of LBP is the baseline.

feature extraction [1]. The LBP features are based on his-tograms. Thus the LBP feature space is non-Euclidean. Adistance measure in such a space is often chosen as the Chi

square, defined as χ2(xi,xj) =∑D

s=1

(xsi−xs

j)2

xsi +xs

j, where xs

i

is the s-th component of xi. In this experiment, we comparethe results of dimensionality reduction on LBP features byPCA, KPCA, and LPCA.

We perform LBP on each facial image and then sub-divide each facial image by 7 × 7 grids. Histograms with59 bins are performed on each sub-block. An LBP featurevector is obtained by concatenating the feature vectors onsub-blocks. Here we use 58 uniform patterns for LBP andeach uniform pattern accounts for one bin. The remaining198 binary patterns are all put in another bin, resulting in a59-bin histogram. So, the number of tuples in a LBP fea-ture vector is 59 × (7 × 7) = 2891. The (8, 2) LBP isadopted. Namely, the number of circular neighbors for eachpixel is 8 and the radius of the circle is 2. The above set-tings are consistent with that in [1]. As shown in Figure 5,with 250-dimensional features, the recognition rate of LBPplus LPCA is higher than that of LBP, which means that thedimensionality reduction is effective. In comparison, PCAand KPCA needs higher dimensional features. Overall, theperformance of dimensionality reduction of LPCA is signif-icantly better than that of PCA and KPCA.

Page 7: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

−20

2

−20

2

−4

−2

0

2

4

−20

2

−20

2

−4

−2

0

2

4

−0.05 0 0.05

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

−0.05 0 0.05−0.06

−0.04

−0.02

0

0.02

0.04

0.06

(a) (b) (c) (d)Figure 6. Scherk surface and its faithful Embeddings in the leastsquares sense. (a) Scherk surface (b = 1). (b) Randomly sampledpoints (without noise points). (c) and (d) are embeddings yieldedby PCA-based tangential maps (LTSA) and LPCA-based tangen-tial maps, respectively.

−20

2

−20

2

−4

−2

0

2

4

−0.1 −0.05 0 0.05

−0.05

0

0.05

−0.05 0 0.05

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

−20

2−2

02

−4

−2

0

2

4

−0.05 0 0.05

−0.05

0

0.05

−0.05 0 0.05

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

(a) (b) (c)Figure 7. Noisy Scherk surface unfolding by tangential maps. (a)Randomly sampled points with noise points shown as crosses. (b)PCA-based tangential maps (LTSA). (c) LPCA-based tangentialmaps.

5.2. Manifold learning

The following experiments mainly examine the capabil-ity of LPCA and PCA on approximating tangent spaces andtangent coordinates of a manifold via tangential maps.

The manifold we use here is Scherk surface (Figure 6(a)) which is a classical minimal surface, formulated as [4]f(x, y) = 1

b ln cos(bx)sin(by) , where −π

2 < bx < π2 , −π

2 <

by < π2 , and b is a positive constant. The minimal sur-

face is a kind of zero mean curvature surface. Therefore,Scherk surface cannot be isometrically parameterized by anopen subset in the two-dimensional Euclidean space. Thefaithful embeddings are only obtainable in the least squaressense. So, it is more challenging to unfold Scherk surfacethan Swiss roll [23] (zero Gaussian curvature surface) thatis widely used for experiments in manifold learning. In ad-dition, the randomly sampled points on Scherk’s surface,we think, well simulate the real-world distribution of ran-dom samples: dense close to the center and sparse close tothe boundary like noncurved Gaussian normal distribution[11]. These are the motivations that we use this surface fortesting.

For each trial, 1200 points (including noise points) arerandomly sampled from the surface, shown in Figure 6(b). For all trials in this subsection, K = 15 for all in-volved methods. As shown in Figure 6 (c) and (d), both

−20

2

−20

2

−4

−2

0

2

4

−5 0 5

−6

−4

−2

0

2

4

6

−1 0 1 2−3

−2

−1

0

1

2

3

−0.04 −0.02 0 0.02 0.04

−0.04

−0.02

0

0.02

0.04

(a) (b) (c) (d)Figure 8. Noisy Scherk surface unfolding by existing represen-tative methods. (a) Randomly sampled points with noise pointsshown as crosses. (b) Isomap. (c) LLE. (d) Laplacian Eigenmaps.

PCA-based tangential maps (LTSA) and LPCA-based tan-gential maps result in the faithful embeddings. However,PCA-based tangential maps distort the embeddings (Figure7 (b)) when noise points appear. We can clearly see thatPCA-based tangential maps is very sensitive to noise be-cause very few noise point can change the result of the al-gorithm. In constrast, LPCA-based tangential maps showits robustness against noise (Figure 7 (c)). These resultsimply that LPCA can yield faithful tangent spaces that areless affected by noise to the manifold. From Figure 8, wecan see that LLE [19] and Laplacian Eigenmaps [3] pro-duce unsatisfactory embeddings. Among the three existingrepresentative algorithms in Figure 8, only the Isomap [23]algorithm works for the surface4.

6. Conclusion

The sensitivity to noise and the incompletence to non-Euclidean samples are two major problems of the traditionalPCA. To address these two issues, we propose a novel algo-rithm, named Laplacian PCA. LPCA is an extension of PCAby optimizing the locally weighted scatters instead of thesingle global non-weighted scatter in PCA. The principalsubspace is learnt by the alignment of local optimizations.A general alignment technique is formulated. Based on thecoding length in information theory, we present a new ap-proach to determining weights. As an application, we for-mulate the tangential maps in manifold learning via LPCA,which can be exploited for non-linear dimensionality reduc-tion. The experiments are performed on face recognitionand manifold unfolding, which testify to the superiority ofLPCA to PCA and other variants of PCA, like KPCA.

AcknowledgementThe authors would like to thank Yi Ma and John Wright

for discussion, and Wei Liu for his comments.

AppendixCoding Length. For the set of vectors Sx

i , the total num-ber of bits needed to code Sx

i is [16]

4Actually, there is an outlier-detection procedure in the published Mat-lab codes of Isomap, which is one of reasons why Isomap is robust.

Page 8: Laplacian PCA and Its Applications - PKU · Laplacian PCA and Its Applications Deli Zhao Zhouchen Lin Xiaoou Tang Visual Computing Group, Microsoft Research Asia, Beijing, China

L(Sxi ) =

K + 1 + n

2log det

(I +

n

ε2(K + 1)XiHXT

i

)

+n

2log

(1 +

xTi xi

ε2

), (24)

where ε is the allowable distortion. In fact, the computa-tion can be considerably reduced by the commutativity ofdeterminant

det(I + XiHXTi ) = det(I + HXT

i XiH) (25)

in the case of K + 1� n. It is worth noting that XTi Xi in

(25) will be a kernel matrix if the kernel trick is exploited.The allowable distortion ε in L(Sx

i ) is a free parameter. Inall our experiments, we empirically choose ε = (10n

K )12 .

References

[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face decriptionwith local binary patterns: application to face recognition.IEEE Tran. on PAMI, 28(12):2037–2041, 2006. 6

[2] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfacesvs. Fisherfaces: recognition using class specific linear pro-jection. IEEE Tran. on PAMI, 19(1):711–720, 1997. 1, 6

[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimen-sionality reduction and data representation. Neural Compu-tation, 15:1373–1396, 2003. 7

[4] M. Berger and B. Gostiaux. Differential Geometry: Mani-folds, Curves, and Surfaces. Springer-Verlag, 1988. 2, 7

[5] V. Blanz and T. Vetter. Face recognition based on fitting a 3Dmorphable model. IEEE Tran. on PAMI, 25(9):1063–1074,2003. 1

[6] M. Brand. Charting a manifold. NIPS, 15, 2003. 3[7] A. Brun, C. Westin, M. Herberthsson, and H. Knutsson.

Sample logmaps: intrinsic processing of empirical manifolddata. Proceedings of the Symposium on Image Analysis,2006. 1

[8] T. Cootes, G. Edwards, and C. Taylor. Active appearancemodels. IEEE Tran. on PAMI, 23(6):681–685, 2001. 1

[9] D. Donoho and C. Grimes. Hessian eigenmaps: locally lin-ear embedding techniques for high dimensional data. PNAS,100:5591–5596, 2003. 1, 3

[10] G. Golub and C. Van. Matrix Computations. The Johns Hop-kins University Press, 1996. 3, 4

[11] T. Hastie. The Elements of Statistical Learning. Springer,2001. 2, 3, 7

[12] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang. Face recogni-tion using Laplacianfaces. IEEE Tran. on PAMI, 27(3):328–340, 2005. 1, 4, 6

[13] T. Jollie. Principal Component Analysis. Springer-Verlag,New York. 1

[14] J. Lee. Riemannian Manifolds: An Introduction to Curva-ture. Springer-Verlag, 2003. 1, 2, 4

[15] T. Lin, H. Zha, and S. Lee. Riemannian manifold learningfor nonlinear dimensionality reduction. ECCV, pages 44–55,2006. 1

[16] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentationof multivariate mixed data via lossy data coding and com-pression. IEEE Tran. on PAMI, 2007. 2, 5, 7

[17] J. Park, Z. Zhang, H. Zha, and R. Kasturi. Local smoothingfor manifold learning. CVPR, 2004. 1, 2, 4

[18] P. Philips, P. Flynn, T. Scruggs, and K. Bowyer. Overview ofthe face recognition grand challenge. CVPR, 2005. 2, 5

[19] S. Roweis and L. Saul. Nonlinear dimensionality reductionby locally linear embedding. Science, 290:2323–2326, 2000.7

[20] B. Scholkopf, A. Smola, and K. Muller. Nonlinear compo-nent analysis as a kernel eigenvalue problem. Neural Com-putation, 10:1299–1319, 1998. 1

[21] D. Swets and J. Weng. Using discriminant eigenfeatures forimage retrieval. IEEE Tran. on PAMI, 18(8):831–836, 1996.1, 6

[22] X. Tang and X. Wang. Face sketch recognition. IEEE Tran.on Circuits and Systems for Video Technology, 14(1):50–57,2004. 1

[23] J. Tenenbaum, V. D. Silva, and J. Langford. A global geo-metric framework for nonlinear dimensionality reduction.Science, 290:2319–2323, 2000. 7

[24] M. Tipping and C. Bishop. Mixtures of probabilistic princi-pal component analyzers. Neural Computation, 11(2):443–482, 2001. 1

[25] D. Torre and M. Black. Robust principal component analysisfor computer vision. ICCV, pages 362–369, 2001. 1

[26] M. Turk and A. Pentland. Eigenfaces for recognition. J.Cognitive Neuroscience, 3(1):71–86, 1991. 1

[27] R. Vidal, Y. Ma, and S. Sastry. Generalized principal com-ponent analysis. IEEE Tran. on PAMI, 27:1–15, 2005. 1

[28] X. Wang and X. Tang. Random sampling LDA for facerecognition. International Conf. on Computer Vision andPattern Recognition, pages 259–265, 2004. 6

[29] X. Wang and X. Tang. A unified framework for subspaceface recognition. IEEE Trans. on Pattern Analysis and Ma-chine Intelligence, 26(9):1222–1228, 2004. 1

[30] X. Wang and X. Tang. Random sampling for subspaceface recognition. International Journal of Computer Vision,70(1):91–104, 2006. 6

[31] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin.Graph embedding and extensions: a general framework fordimensionality reduction. IEEE Tran. on PAMI, 29(1):40–51, 2007. 1, 6

[32] H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectralrelaxation for K-means clustering. NIPS, 2001. 1

[33] Z. Zhang and H. Zha. Principal manifolds and nonlineardimensionality reduction by local tangent space alignment.SIAM Journal of Scientific Computing, 26:313–338, 2004.1, 3, 4

[34] D. Zhao. Formulating LLE using alignment technique. Pat-tern Recognition, 39:2233–2235, 2006. 3

[35] D. Zhao. Numerical geomtry of data manifolds. ShanghaiJiao Tong University, Master Thesis, Jan. 2006. 3

[36] D. Zhao, Z. Lin, and X. Tang. Contextual distance for dataperception. In International Conference on Computer Vision,2007. 3, 5

[37] D. Zhao, Z. Lin, R. Xiao, and X. Tang. Linear Laplaciandiscrimination for feature extraction. In International Conf.on Computer Vision and Pattern Recognition, 2007. 3