Tensorized Multi-view Subspace Representation Learning · perspectives, hence it is beneﬁcial to integrate the informa-tion from multiple views for more comprehensive learning (Ding

International Journal of Computer Visionhttps://doi.org/10.1007/s11263-020-01307-0

Tensorized Multi-view Subspace Representation Learning

Changqing Zhang1 · Huazhu Fu2 · Jing Wang3 ·Wen Li4 · Xiaochun Cao5 ·Qinghua Hu1

Received: 21 April 2019 / Accepted: 10 February 2020© Springer Science+Business Media, LLC, part of Springer Nature 2020

AbstractSelf-representation based subspace learning has shown its effectiveness in many applications. In this paper, we promotethe traditional subspace representation learning by simultaneously taking advantages of multiple views and prior constraint.Accordingly, we establish a novel algorithm termed as Tensorized Multi-view Subspace Representation Learning. To exploitdifferent views, the subspace representation matrices of different views are regarded as a low-rank tensor, which effectivelymodels the high-order correlations of multi-view data. To incorporate prior information, a constraint matrix is devised toguide the subspace representation learning within a unified framework. The subspace representation tensor equipped with alow-rank constraint models elegantly the complementary information among different views, reduces redundancy of subspacerepresentations, and then improves the accuracy of subsequent tasks. We formulate the model with a tensor nuclear normminimization problem constrained with �2,1-norm and linear equalities. The minimization problem is efficiently solved byusing an Augmented Lagrangian Alternating Direction Minimization method. Extensive experimental results on diversemulti-view datasets demonstrate the effectiveness of our algorithm.

Keywords Multi-view representation learning · Subspace clustering · Low-rank tensor · Constraint matrix

1 Introduction

Recently, data collected from various sources or representedby different types of features are available inmany real-worldapplications (Sui et al. 2018; Zhang et al. 2018a; Yang et al.2018; Li and Tang 2016; Li et al. 2018; Tong et al. 2005). Forimages, different types of features are usually extracted basedon color, texture and edge. For web pages, different types offeatures could be extracted based on text, hyperlinks andpossibly existing visual information. These different types

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, WanliOuyang, Luc Van Gool.

B Xiaochun [email protected]

Changqing [email protected]

Huazhu [email protected]

Jing [email protected]

Wen [email protected]

Qinghua [email protected]

of information can be considered as different views describ-ing subjects. Different views describe samples from differentperspectives, hence it is beneficial to integrate the informa-tion from multiple views for more comprehensive learning(Ding et al. 2018; de Sa 2005; Bickel and Scheffer 2004;Chaudhuri et al. 2009; Blaschko and Lampert 2008; Tanget al. 2009; Kumar et al. 2011; Cao et al. 2015a, b; Zhanget al. 2018b; Liu et al. 2019; Xie et al. 2018). Moreover, real-world data are usually associated with prior such as labelinformation, which, if being utilized, can improve the dis-criminability of representation. To utilize these two cues, inthis work, we focus on advancing representation learning by

1 College of Intelligence and Computing, Tianjin Key Lab ofMachine Learning, Tianjin University, Tianjin 300350, China

2 Inception Institute of Artificial Intelligence, Abu Dhabi, UAE

3 School of Computing and Mathematical Sciences, Universityof Greenwich, London SE109LS, UK

4 School of Computer Science and Engineering, University ofElectronic Science and Technology of China, Chengdu611731, China

5 State Key Laboratory of Information Security, Institute ofInformation Engineering, Chinese Academy of Sciences,Beijing 100093, China

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s11263-020-01307-0&domain=pdf

http://orcid.org/0000-0003-1410-6650

International Journal of Computer Vision

making use of multiple views and prior constraint within aunified framework.

Most existing multi-view clustering methods exploit dif-ferent views with graph-based models. Typically, some earlyapproaches address the “2-view” case (de Sa 2005; Bickeland Scheffer 2004; Chaudhuri et al. 2009). The method inde Sa (2005) relates two views with a bipartite graph and thefinal clustering result is obtained by using a standard spec-tral clustering algorithm. The method in Bickel and Scheffer(2004) focuses on handling the data with two conditionallyindependent views based on k-means clustering. To be appli-cable for the data with three or more views, Linked MatrixFactorization (LMF) (Tang et al. 2009) fuses the informationfrom multiple graphs, where a common factor is shared byall graphs and a view-specific factor is assigned to each indi-vidual graph. Some methods (Kumar et al. 2011; Kumar andDaumé III 2011) co-regularize or co-train different views toenforce the consistence among multiple views. The commonspace basedmodels (Blaschko andLampert 2008; Chaudhuriet al. 2009) usually focus on learning a common represen-tation by using Canonical Correlation Analysis (CCA) toproject multiple views onto a low-dimensional common sub-space, and followed by conventional clustering algorithms.Recently, some methods (Zhan et al. 2017; Wang et al.2019) are proposed for multi-graph fusion with rank con-straint on its Laplacian matrix, thus the cluster indicatorsare directly obtained by the global graph without perform-ing any post-processing (e.g., k-means clustering). Basedon self-representation subspace learning, several multi-viewsubspace learning methods (Cao et al. 2015a; Zhang et al.2017b, 2020) are proposed,which usually jointly learnmulti-ple subspace representation matrices or one unified subspacerepresentation matrix.

Although great progress has been achieved, there are stilltwo limitations for existingmethods: (1) previous approachesusually capture pairwise correlations of different views,ignoring the essentially high-order relationship of multi-view data; (2) for multi-view representation learning, thereare usually prior constraints (e.g., must-link constraint orpartial label information) which could improve the learnedmulti-view representation. However, this is not guaranteedin existing multi-view clustering approaches.

To address these issues,wepropose a novelmethod termedTensorized Multi-view Subspace Representation Learning(TMSRL), which is outlined in Fig. 1. The whole proce-dure includes the following two aspects. Firstly, the proposedTMSRL regards the subspace representations of all viewsas a high-order structure, i.e., a 3-order tensor. To modelthe high-order correlation cross different views, the tensoris enforced to be low-rank to enhance the consistency andreduce the redundancy of these multiple subspace represen-tations. Secondly, to incorporate the prior information andthus guide the representation learning, a constraint matrix is

introduced into our framework. Therefore, the learned repre-sentation could be beneficial from both the complementarityof multiple views and the effective prior constraint. Notably,in our model, the high-order correlation indicates the linearcorrelation by simultaneously considering all views insteadof a pairwisemanner. Thewell-knownCanonical CorrelationAnalysis (CCA) and its variants are designed for multi-viewrepresentation learning by maximizing the sum of pairwisecorrelations. Although this is a popular way in multi-viewlearning, it fails to incorporate higher-order correlations. Themain contributions are summarized as follows:

– With integrating together all the subspace representa-tions of different views by low-rank tensor, the proposedTMSRL captures the global structure of all views, andexplores the correlations within each view and acrossmultiple views. The proposed algorithm levels up theconventional multi-view learningwhich can only explorepairwise correlation.

– With a constraint matrix with labels as hard constraint,TMSRL guarantees the data with the same label tohave the same subspace representation, which seamlesslyutilizes prior in the unified multi-view subspace repre-sentation framework and promotes the subsequent tasks.The strategy of incorporating additional prior informa-tion is parameter-free, which makes the algorithm moreapplicable for practical applications.

– Extensive experiments on benchmark datasets demon-strate the effectiveness of exploring the high-order cor-relations among multiple views, and the effectiveness ofincorporating constraint as well.

2 RelatedWork

2.1 Multi-view Learning

Different categories of multi-view learning algorithms havebeen proposed and applied in various applications. For exam-ple, graph fusion basedmethods (Tang et al. 2009; Tong et al.2005) usually construct multiple graphs for multiple viewsand then fuse them into a common graph. Co-regularizationbased methods (Kumar et al. 2011; Wang et al. 2014) jointlyregularize the hypotheses to explore the complementaryinformation. Co-training based methods (Kumar and DauméIII 2011; Zhao et al. 2014) search for the results that agreeacross different views. Multiple Kernel Learning (MKL)methods usually combine different kernels by adding themequally (Cortes et al. 2009) or learning the combinationweights (Grigorios and Aristidis 2012). It is also noteworthythat there are also models designed for real-world applica-tions, including video face clustering (Cao et al. 2015b),

123


1 I2

I3

SubspaceRepresentations

Multi-View Features

Low-rank Tensor

Tensorize

Unfolding

TensorMulti-View Raw Data

Constraint Matrix1 0 01 0 0 00 1 0 00 1 00 1 0 00 0 0 10 0 01 0 0 0

00000010

0

0

Prior

Fig. 1 Overview of Tensorized Multi-view Subspace RepresentationLearning (TMSRL). Given a collection of data points with multipleviews,X(1) · · · X(V ), ourmethod integrates all the learned subspace rep-resentations, Z(1) · · · Z(V ), into a low-rank tensor, Z , under the prior

constraints. Then, TMSRL integrates the information from each indi-vidual views by exploring high-order correlations and utilizing the priorconstraints as well, which jointly promote the multi-view subspace rep-resentation

medical diagnosis (Zhang et al. 2018a) and recommendationsystems (Elkahky et al. 2015).

2.2 Multi-view Representation Learning

For multi-view representation learning, CCA-based algo-rithms basically maximize the correlation of two differentviews bymapping the original features into a common space.Generally, CCAcan be expressed as an optimization problemover matrix variables as follows

(P(1), P(2)) = argmaxP(1),P(2)

tr(P(1)T X(1)X(2)T P(2))

s.t . P(v)T X(v)X(v)T P(v) = I, v = 1, 2.

where X(v) = [x(v)1 , · · · , x(v)

N ] ∈ Rdv×n is the feature matrix

corresponding to thevth view,withn anddv being thenumberof samples and dimensionality for the vth view, respectively.I is an identity matrix. P(v) ∈ R

dv×k is the projection matrixfor the vth view, and k is the dimensionality of the commonspace. To address nonlinear correlations, the kernel exten-sion of CCA was proposed. To utilize the neural networks

for more general correlations, the Deep CCA (Andrew et al.2013) jointly learns two deep neural networks (DNN) for dif-ferent views, and the autoencoder based model (Ngiam et al.2011) aims to obtain a compact representationwhich canwellreconstruct the original input. Similar toCCAbasedmethods,the flexible multi-view dimensionality co-reduction method(Zhang et al. 2017a) introduces Hilbert-Schmidt indepen-dence criterion (HSIC) to exploit the correlations amongdifferent views:

(P(1), · · · , P(V ))

= argmaxPi∈RK (v)×D(v)

V∑

v=1

tr(P(v)X(v)L(v)X(v)T P(v)T )

+ λ∑

v �=u

HSIC(P(v)X(v), P(u)X(u)),

s.t. P(v)P(v)T = I, v = 1, ..., V ,

where L(v) is the graph Laplacian for the vth view, and Vis the number of views. The hierarchical semi-nonnegativematrix factorization is proposed to obtain the semantics frommulti-view data in a layer-wise manner (Zhao et al. 2017).

123


The common representation of all views is obtained byenforcing the coefficients of different views in the final layerto be the same:

(P(v)1 , · · · , P(v)

L , H)

= argminP(v)1 ,··· ,P(v)

L ,H

= ||X(v) − P(v)1 P(v)

2 ...P(v)L H||2F

s.t. H � 0.

where the feature matrix X(v) is factorized into hierarchicalproduct of matrices P(v)

1 , · · · , P(v)L and latent representation

H.

2.3 Subspace Representation Learning

Our work is closely related to subspace clustering (Elhamifarand Vidal 2013; Liu et al. 2013a). Sparse Subspace Cluster-ing(SSC) (Elhamifar and Vidal 2013) aims to find a sparserepresentation matrix whose objective function is:

min ||Z||1 s.t . X = XZ, diag(Z) = 0, (1)

where Z is the subspace representation matrix and can beused for subsequent clustering or classification. Low-RankRepresentation(LRR) (Liu et al. 2013a) introduces the lowrank regularization to subspace clustering by solving the fol-lowing problem:

minZ,E

||Z||∗ + λ||E||2,1, s.t . X = XZ + E, (2)

where E corresponds to reconstruction error. Smooth Rep-resentation clustering (SMR) (Hu et al. 2014) underlines theimportance of grouping effect to subspace clustering and thecorresponding model is:

minZ

α||X − XZ||2F + tr(ZL̃Z), (3)

where L̃ is the graph Laplacian matrix. Although impressiveperformance has been achieved with these existing methods(Elhamifar and Vidal 2013; Liu et al. 2013a; Hu et al. 2014),they are only applicable for the data with single-view fea-tures. Recently, multi-view subspace clustering has achievedimpressive performance. Specifically, the methods in Guo(2013), White et al. (2012) and Zhang et al. (2017b) for-mulate multi-view learning as learning a common subspacerepresentation. The dimensionality reduction based meth-ods (Blaschko and Lampert 2008; Chaudhuri et al. 2009)usually learn a low-dimensional subspace to integrate thesemultiple views and then obtain the final clustering result byusing traditional clustering algorithm. Recently, some multi-view subspace clustering methods are proposed (Cao et al.2015a; Gao et al. 2015; Zhang et al. 2017b; Xie et al. 2018;

Cheng et al. 2018) based on the self-representation subspaceclustering. Different from these methods learning a commonrepresentation (Zhang et al. 2017b) or exploring correlationsof pairwise views (Cao et al. 2015a; Gao et al. 2015), we con-duct multi-view subspace clustering with low-rank tensor toexplore the high-order correlations across multiple views,and incorporate prior information as well.

2.4 Semi-supervised Clustering

Generally, clustering is related to the representation learning,and here we mainly review the constrained clustering whichcould be roughly classified into two groups. The first cate-gory (Wu et al. 2013) usually learns a Mahalanobis distanceto minimize the distance between samples within the sameclass andmaximize the distance between samples of differentclasses. However, it may lead to over fitting since constraintsare usually scarce. The second category extends the tradi-tional clustering methods, e.g., k-means (Basu et al. 2004;Wagstaff et al. 2001) or Gaussian mixtures (Lu and Leen2007) to constrained setting. There are also some methods(Zhou et al. 2014; Kamvar et al. 2003) simply replace entriesof affinity matrix for must-link pairs with 1 and cannot-linkpairs with 0.

For example, Video Face Clustering via ConstrainedSpares Representation (CS-VFC) (Zhou et al. 2014) utilizesmust-link and cannot-link constraints in two steps, i.e., sparserepresentation and spectral clustering. Its objective functionis as follows:

min ||Z||1 s.t . X = XZ, Z ji = 0, ( j, i) ∈ (M ∪ C ∪ I),

(4)

where M, C, I are defined as the sets of the must-link,cannot-link constraints and indices corresponding to the ele-ments with value 1 in the identity matrix, respectively. Theway of constructing affinity matrix is provided as:

Wconst = |Z| + |Z|T + λM + βC, (5)

where M ∈ RN×N , C ∈ R

N×N are the must-link matrix andcannot-link matrix, respectively. Here λ and β are trade-offparameters. Note that, we perform normalization for Z aszi ← zi/||zi||∞ so as to make the values in affinity matrix ofthe same scale.

3 The Proposed Approach

3.1 Preliminary

In subspace clustering, the subspace representation is usuallyobtained in self-represented way. The affinity matrix is con-

123


structed according to the learned subspace representation.Given the data matrix X = [x1, ..., xN ] with each columnbeing a D-dimensional sample, where N is the number ofsamples. To obtain the subspace representation, represen-tative subspace representation learning methods (Elhamifarand Vidal 2013; Liu et al. 2013a; Hu et al. 2014) usuallyshare the following formulation

minZ,E

L(X, XZ) + λΩ(Z)

s.t . X = XZ + E,(6)

where Z = [z1, z2, ..., zN ] ∈ RN×N is the reconstruction

coefficient matrix, whose column zi is the learned subspacerepresentation vector corresponding to the sample xi , andE ∈ R

D×N is the reconstruction error matrix.L(·, ·) denotesthe loss function measuring reconstruction error, and Ω(·) isthe regularization term and λ is the trade-off parameter thatbalances the intensity of the loss and regularization. For clus-tering task, after obtaining subspace representation matrixZ,an affinity matrix is constructed as (|Z| + |ZT |)/2, where| · | denotes the absolute operator, and the spectral clusteringalgorithm (Ng et al. 2001) is applied to the affinity matrix forthe final clustering result.

However, Eq. (6) could only handle single-view data. Toextend the single-view subspace representation learning tomulti-view setting, we can rewrite Eq. (6) as:

minZ(v),E(v)

V∑

v=1

(Ω(Z(v)) + λvL(X(v), X(v)Z(v))

)

s.t . X(v) = X(v)Z(v) + E(v), v = 1, 2, ..., V ,

(7)

where X(v), Z(v), E(v) denote the data matrix in the vth view,corresponding subspace representation matrix and recon-struction error matrix, respectively. Here λv denotes thehyperparameter for the vth view and V is the number ofviews. Apparently, this naive way deals with each view dataindependently, which ignores the correlations among differ-ent views. Thus, our proposed algorithm aims to capture thehigh-order correlation among multiple views.

3.2 Multi-view Subspace Representation Learningwith Low-Rank Tensor

In our work, we propose a multiview subspace clusteringmethodwith a low-rank tensor constraint,which aims to learnthe subspace representations of distinct views jointly andexplore the high-order correlation underlyingmultiple views.The proposed method regards the subspace representationmatrices of all views as a tensor, which is the generaliza-tion of matrix concept. The definition of tensor nuclear norm(Liu et al. 2013a; Zhang et al. 2013, 2014; Liu et al. 2013b;

Tomioka et al. 2010) is as follows:

||Z||∗ =M∑

m=1

ξm ||Z(m)||∗

s.t . ξi > 0,M∑

m=1

ξm = 1,

(8)

where ξms are constants and M is the number of modes. Theterminology ofM-order tensor (orM-mode tensor) is definedasZ ∈ R

I1×I2×...×IM and the unfold operation along themthmode on the tensorZ transforms it into amatrixZ(m) definedas unfoldm(Z) = Z(m) ∈ R

Im×(I1×...×Im−1×Im+1...×IM ) (DeLathauwer et al. 2000; Lathauwer et al. 2000). The nuclearnorm ||·||∗ canwell approximate the rank of amatrix, since itis the tightest convex envelop for the rank of a matrix. Essen-tially the nuclear norm of a tensor is a convex combination ofthe nuclear norms of all matrices unfolded along each mode.We uses the nuclear norm to enforce the tensor Z with alow-rank constraint as

minZ(v),E(v)

||E||2,1 + λ||Z||∗,

s.t . X(v) = X(v)Z(v) + E(v), v = 1, 2, ..., V ,

Z = Ψ (Z(1), ..., Z(V )), E = [E(1); ...; E(V )],(9)

where Ψ (·) combines the representations of distinct viewsZ(v) into a 3-order tensor Z , whose dimensionality is N ×N ×V . We concatenate together along the columns of errorswith respect to each view in the vertical direction, formingas E = [E(1); E(2); ...; E(V )] and apply �2,1-norm ||.||2,1 toencourage E to be sparse in columns. There is an underly-ing assumption that corruptions are sample-specific, whichmeans some instances are corrupted. In themanner of integra-tion, the columns of [E(1), E(2), ..., E(V )]will be constrainedwith jointly consistent magnitude values (Cheng et al. 2011).Note that, to decrease the variation in the magnitude of theerror corresponding to different views, we normalize thedata matrices in each view to impose the same scale on theerror of distinct views. Specifically, we normalize xi withxi ← xi/||xi ||2.

3.3 ConstrainedMulti-view SubspaceRepresentation Learning

Although the subspace representation learning could beimproved with multiple views, it is still challenging becausethere is no label information guiding the learning process.Fortunately, prior knowledge (e.g., must-link constraint) isusually available which injects discriminative informationinto the representation learning. The must-link prior infor-mation indicates whether samples belong to the same cluster.

123


To incorporate must-link constraints into the proposedmulti-view representation learningmodel, a constraintmatrix(Jing et al. 2016) is constructed. Suppose there are L samplesbelonging to C sets where the samples in each set belong tothe same class. The remaining N − L samples with no con-straints are considered to belong to N − L sets. Then, thedataset is partitioned into N − L + C sets, and accordingly,we can construct a constraint matrix Q ∈ R

N×(N−L+C),where Qi, j = 1 if xi is in the j th set. To ensure samples inthe same set to be clustered into the same cluster, an auxil-iary matrix U(v) ∈ R

N×(N−L+C) is designed for each viewsatisfying Z(v) = U(v)QT . The objective function in Eq. (9)can be reformulated as

minZ(v),E(v)

||E||2,1 + λ||Z||∗s.t . X(v) = X(v)U(v)QT + E(v),

E = [E(1); ...; E(V )],Z = Ψ (Z(1), ..., Z(V )), Z(v) = U(v)QT .

(10)

Proposition 1 Under the equation Z(v) = U(v)QT , we have||Z||∗ ≤ ||U ||∗.Based on Proposition 1, ||U ||∗ is an upper bound of ||Z||∗.Therefore, for our objective function Eq. (10), we substitute||Z||∗ by ||U ||∗ to avoid the inverse operation. Accordingly,the optimization problem of Eq. (10) is transformed as

minU(v),E(v)

||E||2,1 + λ||U ||∗s.t . X(v) = X(v)U(v)QT + E(v),

U = Ψ (U(1), ..., U(V )), E = [E(1); ...; E(V )].(11)

According to theProposition1,minimizing ||U ||∗ canbe con-sidered as an approximation of minimizing ||Z||∗. This wayof approximation is widely used in the field of optimization.Specifically, because ||U ||∗ is an upper bound of ||Z||∗, anyconstraint ||Z||∗ < a can be satisfied by enforcing ||U ||∗ < b(b ≤ a is a sufficient condition). Therefore, in practice use,we can satisfy the strength of the low-rank property forZ bysetting an appropriate value for the hyper-parameter λ.

Model properties To summarize, we highlight that theproposed latent partial multi-view representation has thefollowing merits: (1) Our model explores the high-ordercorrelations by simultaneously mining the intra-view andinter-view correlations, which is especially important for themulti-view data. (2) The supervised information is incorpo-rated into the proposed multi-view subspace representationlearning model, which could guide the learning process formore accurate result. (3) The proposed algorithm is a flex-ible framework, where the constraint matrix is constructedautomatically according to supervised information and the

model will be reduced into an unconstrained one if there isno prior information

3.4 Optimization

The Augmented Lagrange Multiplier (ALM) is an efficientalgorithm for solving optimization problems under equationconstraints. The ALM with alternating direction minimiz-ing strategy is an efficient solver for our problem (10). It isnecessary to make our objective function separable to adoptthis strategy. Thus, we follow (Tomioka et al. 2010) to intro-duce an auxiliary tensor G consisting of V variables G(v)’sto replace U , and convert it to the following optimizationproblem as

minU(v),E(v),Gm

||E||2,1 + λ||G||∗s.t . U = G, X(v) = X(v)U(v)QT + E(v)

U = Ψ (U(1), ..., U(V )), G = Ψ (G(1), ..., G(V )),

E = [E(1); ...; E(V )],

(12)

where G is the augmented variable corresponding to U thatmakes our problem separable. The first constraint ensuresthe equivalence between (11) and (12). The second constraintjointly relates the data points of the same cluster, i.e., the samelinear subspace, and takes the prior into consideration. Thelast constraint with �2,1-norm gives the underlying assump-tion of error, i.e., sample-specific error. The optimizationproblem of Eq. (12) can be solved by the AL-ADM method(Lin et al. 2010), which minimizes the following augmentedLagrangian function:

Lμ>0({U(v); E(v)}Vv=1; {G(m)}Mm=1)

= ||E||2,1 +M∑

m=1

λm ||G(m)||∗ + Φ(W,U − G)

+V∑

v=1

Φ(YTv , X(v) − X(v)U(v)QT − E(v)),

(13)

where λm = λξm > 0 encodes the intensity of the low-ranktensor constraint and G(m) is themth mode unfolding matrixof G. For convenience, we give the definition Φ(Y, C) =μ2 ||C||2F + 〈Y, C〉 , where 〈·, ·〉 denotes matrix inner productand μ is a positive penalty scalar. The above unconstrainedproblem can be solved by alternating minimization methodcorresponding to the variables E(v), U(v) and G(m) and thenupdating the Lagrange multipliers Yv and W(v) accordingly.In this paper, the AL-ADM strategy is adopted and outlinedin Algorithm 1 to optimize our problem by updating eachvariable for each iteration. The optimization for each sub-problem is as follows:

123


1. U(v)-subproblem: For updating U(v), we solve the fol-lowing problem by fixing the other variables:

U(v)∗ = argminU(v)

Φ(W(v), U(v) − G(v))

+ Φ(YTv , X(v) − X(v)U(v)QT − E(v)).

(14)

Taking the derivative with respect to U(v) and setting it tozero, we obtain the following equation:

AU(v) + U(v)B = C

with A = (X(v)T X(v))−1, B = QT Q

C = (X(v)T X(v))−1(

G(v) − W(v)/μ

+ X(v)T YvQ/μ + X(v)T X(v)Q − X(v)T E(v)Q)

.

(15)

The above equation is Sylvester equation (Bartels and Stew-art 1972). We can find a unique solution to solve the problemin (15). The classical algorithm for solving the Sylvesterequation is the Bartels-Stewart algorithm (Bartels and Stew-art 1972).

2. E-subproblem: The reconstruction error matrix E isoptimized by:

E∗ = argminE

||E||2,1

+V∑

k=1

Φ(YTv , X(v) − X(v)U(v)QT − E(v))

= argminE

1

μ||E||2,1 + 1

2||E − F||2F ,

(16)

where F is formed by vertically concatenating the matricesX(v) − X(v)U(v)QT + Y(v)/μ together along column. Thissubproblem can be efficiently solved by Lemma 3.2 in Liuet al. (2013a).3. Yv-subproblem: The multiplier Yv is updated by:

Y∗v = Yv + (X(v) − X(v)U(v)QT − E(v)). (17)

Intuitively, the multiplier is updated proportionally to theviolation of the equality constraint.4. G-subproblem: G(m) is updated by:

G∗(m) = argmin

G(m)

λm ||G(m)||∗ + Φ(W(m), U(m) − G(m))

= λm

μ||G(m)||∗ + 1

2||G(m) − (U(m) + W(m)/μ)||2F .

(18)

Specifically, there are three unfolding ways for a three-modetensor in our model. G(m) is a matrix corresponding to the

Algorithm 1: Algorithm of TMSRL

Input: Multiple types of feature matrices: X(1), X(2), ..., X(V ),prior knowledge matrix Q, parameters λm ’s and thenumber of clusters K

Initialize: U(1) = 0, ..., U(V ) = 0; Z(1) = 0, ..., Z(V ) = 0;E(1) = 0, ..., E(V ) = 0; Y1 = 0, ..., YV = 0;W(1) = 0, ..., W(V ) = 0; G(1) = 0, ..., G(V ) = 0; μ = 10−5;ρ = 1.5; ε = 10−5; maxμ =1010

while not converged dofor each of V views do

Update U(v), E(v) and Yv according to Eq. (15), (16) and(17), respectively;Compute subspace representation of each view byZ(v) = U(v)QT ;

endfor each of M modes do

Update G(m), W according to Eq. (18) and (19),respectively;

endUpdate the parameter μ by μ = min(ρμ; maxμ);check the convergence conditions:||X(v) − X(v)U(v)QT − E(v)||∞ < ε and||U(v) − G(v)||∞ < ε;

endCombine all subspace representations of each view by

S = 1V

∑Vv=1 |Z(v)| + |Z(v)T |;

Apply the spectral clustering/classification algorithm with S;Output: Clustering/classification result.

mth mode unfolding of G. We can update G(m) as a matrixby the singular value thresholding operator (Cai et al. 2010).5. W-subproblem: Similar to updating Yv , the variableWis updated by:

W∗ = W + μ(U − G). (19)

Compared to the penalty method, taking μ → ∞ is not nec-essary for the ALMmethod to solve the original constrainedproblem. In contrast, owing to the Lagrangian multiplierterm, our method has a fast convergence speed since μ

can be kept much smaller. Actually any optimization algo-rithm can be utilized to solve our problem in Eq. (12) andwe just provide a general optimization scheme. For exam-ple, for large scale problem, LADMPSAP (Lin et al. 2015)can substitute AL-ADM to achieve a more efficient perfor-mance. Furthermore, some methods can also be employed toimprove the matrix inversion computation (e.g., Quintana2001; Soleymani 2013) and Singular Value Thresholding(SVT) operators (e.g., Oh et al. 2015).

3.5 Complexity and Convergence

The detail of our method is summarized on Algorithm 1. Theoptimization process of our models mainly consists of fivesub-problems. Firstly, solving the U(v)-subproblem involvesmatrix inversion and the Sylvester equation, both of whichare with the complexity of O(n3). The complexity of updat-

123


ing U(v) is O(dn2 + n3), where d and n are the dimensionof single-view feature and the number of data, respectively.The computations of updating E(v) and Yv are matrix multi-plication with the complexity of O(dn2). The complexity ofG(m)-subproblem is also O(n3), since it is with the nuclearnorm proximal operator. Overall, the total complexity of ouralgorithm is O(dn2 + n3) for each iteration.

Generally, it is difficult to prove the convergence of ourproposed algorithm in theory but convergence propertiescould be analyzed similarly to those in Lin et al. (2010).For U(v)-subproblem, we can find a unique solution (Bartelsand Stewart 1972). Lemma 3.2 in Liu et al. (2013a) gives theoptimal solution of E(v)-subproblem and the convergence ofG(m)-subproblem is guaranteed in the work (Cai et al. 2010).For each subproblem, the convergence is ensuredwell.More-over, the empirical evidence on real data suggests that ouralgorithm has a stable convergence behavior.

4 Experiments

4.1 Experiments Datasets

Figure 2 presents some example images of four benchmarkdatasets used in our experiments. These datasets are widelyused to perform face and image clustering tasks in recentworks (Elhamifar and Vidal 2013; Liu et al. 2013a; Hu et al.2014). The detailed information of these datasets is describedas follows:

• Yale1 The Yale face dataset contains 165 grayscaleimages of 15 individuals. There are 11 images per sub-ject, one per different facial expression or configuration.

• Extended YaleB 2 The Extended YaleB dataset consistsof 38 individuals and around 64 near frontal images underdifferent illuminations for each individual. Similar to theother work (Liu et al. 2013a), we use the images for thefirst 10 classes, including 640 frontal face images.

• ORL 3 There are 10different images of eachof 40distinctsubjects in the ORL face dataset. They took the images atdifferent times, changing the lighting, facial expressionsand facial details for some subjects.

• COIL-20 4 The Columbia Object Image Library (COIL-20) dataset contains 1440 images of 20 object categories.Each category contains 72 images. All the images arenormalized to 32 × 32 pixel arrays with 256 gray levelsper pixel.

1 http://cvc.yale.edu/projects/yalefaces/yalefaces.html.2 http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html.3 http://www.uk.research.att.com/facedatabase.html.4 http://www.cs.columbia.edu/CAVE/software/softlib/.

Fig. 2 Example images of the four datasets used in this paper (therows from top to bottom correspond to Yale, Extended YaleB, ORL andCOIL-20, respectively)

• BBCSport 5 The dataset consists of documents of sportsnews corresponding to 5 topics, where two different typesof features are extracted (Xia et al. 2014b).

• Football 6 The dataset is a collection of 248 English Pre-mier League football players and clubs active on Twitter.The disjoint ground truth communities correspond to the20 clubs in the league.

• Politicsie 7 The dataset is a collection of Irish politiciansand political organizations assigned to seven disjointgroups according to their affiliation. The two Twitterdatasets are associated with 9 different views.

In our experiments, we extract three types of features [i.e.,intensity, Local Binary Pattern (LBP) (Ojala et al. 2002) andGabor (Lades et al. 1993)] for the image datasets (i.e., Yale,Extended YaleB, ORL, and COIL-20). The intensity fea-ture is the intensity of a single-channel image pixel, e.g.,image grayscale. The extracted standard LBP features arewith the sampling density size of 8 and the blocking numberof 7 × 8. We extract the Gabor wavelets at four orientationsθ = {0◦, 45◦, 90◦, 135◦}with one scale λ = 4. Accordingly,the dimensionality of intensity feature depends on the size ofimage and the numbers of dimensions for LBP andGabor are3304 and 6750, respectively. For the BBCSport dataset, eachdocument is divided into two segments. And then, standardstemming, stop-word removal and TF-IDF normalizationprocedures are applied to two segments separately to producetwo different views (Greene and Cunningham 2009). For theFootball and Politicsie datasets from Twitter, the social rela-tionships (networks): ‘follows’, ‘followed by’, ‘mentions’,‘mentioned by’, ‘retweets’ and ‘retweeted by’ between twousers are utilized as six views. Each user belongs to a specificuser list with detailed description. Additionally, two viewsare constructed by the belongs with two kinds of featuresof user lists, i.e., user-list names and key words of user-listnames with textual descriptions. Moreover, the tweet profile

5 http://mlg.ucd.ie/datasets/.6 http://mlg.ucd.ie/aggregation/.7 http://mlg.ucd.ie/aggregation/.

123

http://cvc.yale.edu/projects/yalefaces/yalefaces.html

http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html

http://www.uk.research.att.com/facedatabase.html

http://www.cs.columbia.edu/CAVE/software/softlib/

http://mlg.ucd.ie/datasets/

http://mlg.ucd.ie/aggregation/

http://mlg.ucd.ie/aggregation/


Table 1 Statistics of the used datasets

Dataset #instance #view #class Domain

Yale 165 3 15 Image

Extended YaleB 640 3 10 Image

ORL 400 3 40 Image

COIL-20 1440 3 20 Image

BBCSport 737 2 5 Text

Football 248 9 5 Social media

Politicsie 348 9 7 Social media

vector is constructed with a certain number of each user’smost recent tweets to generate the last view. The statistics ofthe used datasets are shown in Table 1.

For fair comparison, we do not use the must-link infor-mation for all the compared methods and ours in theunsupervised experiment. Specifically, the subspace repre-sentation matrices are learned according to Eq. (9), and thenthey are combined into an affinity graph. Moreover, in theconstrained multi-view representation learning experiment,we introduce the must-link information for all comparedmethods.

4.2 Experiments of UnsupervisedMulti-viewRepresentation Learning

We first compare our method with other multi-view cluster-ingmethods since the subspace representation is usually usedfor clustering task. For comprehensive evaluation, there are10 comparedmethods in our experiments, including 3 single-view and 7 multiview ones. Specifically, these methods areas follows:

• SPCbest. This is the standard spectral clustering algo-rithm (Ng et al. 2001) employing the most informativeview.

• LRRbest (Liu et al. 2013a). This is the low-rank constraintsubspace clustering algorithm with the best performedsingle view.

• RTC (Cao et al. 2013). The method utilizes tensor torepresent images and it is robust to the outliers.

• FeatConcatePCA. Themethodfirst concatenates togetherall views and then employs PCA to reduce the number ofdimensions to 300.

• PCA+LRR. The method concatenates all views andemploys PCA to reduce the feature dimension to 300,on which LRR is applied.

• Co-Reg SPC (Kumar et al. 2011). The method co-regularizes the clustering hypotheses to enforce corre-sponding samples to have the same cluster membership.

Yale Extended YaleB ORL COIL−200.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

LRRINTENSITY

LRRLBP

LRRGABOR

LT−MSC

AC

C

Yale Extended YaleB ORL COIL−200.3

0.4

0.5

0.6

0.7

0.8

0.9

1LRR

INTENSITY

LRRLBP

LRRGABOR

LT−MSC

Fig. 3 Comparison between LRRwith features of each single view andour TMSRL with multiple views

• Co-Training SPC (Kumar and Daumé III 2011). Themethod uses the co-training manner within the spectralclustering framework.

• Min-Disagreement (deSa2005).The ideaof “minimizing-disagreement” is realized based on a bipartite graph.

• ConvexReg SPC (Collins et al. 2014). Themethod learnsa common representation for all views.

• RMSC (Xia et al. 2014a). The method seeks a cross-view shared low-rank transition probability matrix forclustering.

• MSSC (Abavisani and Patel 2018). The method exploitsthe complementarity by using a common representationacross different modalities.

The above comparison methods are conducted by running30 times and reporting the average performance and standarddeviation. We utilize two commonly used metrics to evalu-ate the clustering quality: Normalized Mutual Information(NMI) and Accuracy (ACC), which have been widely usedfor performing a clustering evaluation (Christopher et al.2008; Lawrence and Phipps 1985). For instance, the com-pared methods, Co-Train SPC (Kumar and Daumé III 2011)and LRR (Liu et al. 2013a), also utilize the same metrics forevaluating. Specifically, Co-Train SPC uses NMI and LRRuses accuracy (ACC) for evaluating clustering task. ACCand NMI favors different properties in the clustering, anda higher value indicates a better clustering performance forboth of them.

In our experiments, we adopt the inner product kernelto compute the graph similarity. For the parameters of ourapproach on all the four datasets, we simply set the Mparameters with equal value, i.e., λ1 = .. = λM = λ, andaccordingly only one parameter λ needs to tune. For all thecompared methods, we try our best to tune the parametersfor the best performance.

In Fig. 3, we compare ourmodel with LRR using each sin-gle view. It is observed that LRR using the best single viewachieves promising performance, while the performanceswith different views vary significantly. For example, LBPis the best view on ORL and COIL-20, but there is a serious

123


Fig. 4 Results (mean ±standard deviation) in terms ofaccuracy and NMI

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(a) Yale

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(b) Extended YaleB

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(c) ORL

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(d) COIL-20

123


Fig. 4 continued

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(e) bbcsport

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(f) football

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPCbest

LRRbest

FeatConcate+PCA

PCA+LRR

Co-Reg SPC

Co-Train SPC

Min-Disagreement

ConvexReg SPC

RMSC

MSSC

Ours

(g) politicsie

ACC NMI0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPC

LRR

FeatConcatePCA

PCA-LRR

Co-Reg-SPC

Co-Train-SPC

Min-Disagreement

ConvexReg

RMSC

MSSC

CLRR

LTMSC

(h) averaged performance on all datasets

123


20

40

60

80

100

120

140

160

20

40

60

80

100

120

140

160

(a) Yale

100

200

300

400

500

600

100

200

300

400

500

600

(b) Extended YaleB

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

50

100

150

200

250

300

350

400

(c) ORL

100 200 300 400 500 600 200 400 600 800 1000 1200 1400

200

400

600

800

1000

1200

140020 40 60 80 100 120 140 160

20 40 60 80 100 120 140 160 100 200 300 400 500 600 50 100 150 200 250 300 350 400 200 400 600 800 1000 1200 1400

200

400

600

800

1000

1200

1400

(d) COIL-20

Fig. 5 Affinity matrices of using naive manner with LRR (top row) as in Eq. (7) and TMSRL (bottom row)

degeneration on Extended YaleB. Therefore, it is not rea-sonable to choose the same view for different datasets. Onthe contrary, our method directly uses all views and achievescompetitive performance, while other multi-view clusteringmethods can not produce promising results. This demon-strates that our model can effectively integrate informationfrom multiple views.

Figure 4 shows the detailed clustering results of differ-ent methods in terms of accuracy and NMI. Our methodbasically outperforms all the baselines on four benchmarkdatasets. Specifically for Yale dataset, it is worth noting thatRMSC, the most competitive multiview clustering model,obtains a promising performance, but LRR even achieves abetter result provided with the best feature. TMSRL outper-forms LRR with approximately 3.6% and 5.6% in terms ofACC and NMI, respectively. In addition, as shown in Fig. 4,concatenating all views and reducing dimensionality withPCA (FeatConcatePCA) is not promising as expected sinceits performance is not always superior to the result of thebest single view. Besides, our method also performs betterthan two state-of-the-art multi-view clustering methods (Xiaet al. 2014a; Collins et al. 2014). The experiments results onORL and COIL-20 in the last two rows of Fig. 4 verify theeffectiveness of our approach.

Note that, most comparison methods have unpromis-ing performances on Extended YaleB except the self-representationbased subspacemethods (e.g., LRR), as shownin the second row of Fig. 4. This is mainly due to the largevariation of illumination. For instance, benefiting from theself-representation manner, the subspace clustering methods

are robust with respect to the intensity feature, while the tra-ditional distance-based methods are dramatically degraded.We can see that LRR shows the best performance amongthe baselines (e.g., Co-Training SPC,Min-Disagreement andConvexReg SPC). The clustering results of our model aremuch better than PCA+LRR with the help of the high orderlow-rank tensor constraint. Besides, we find that our methodgains such a significant improvement over LRR on ExtendYaleB while it is not as much as that of other datasets. Thisis mainly because the LBP and Gabor features are basicallynot as effective as the intensity features, which degrades theclustering results of ours.

Figure 5 shows the visualizations for affinity matricesof our method and LRR which independently learns mul-tiple affinity matrices and then adds them. According to theground-truth clusters, we visualize these affinity matrices.Compared toLRR, our algorithm reveals the underlying clus-tering structures more clearly. The results of visualizationsfurther verifies that our method can well explore the high-order correlation across multiple views.

We also compare our proposed methods with FeatCon-cate, CCA, Deep Canonical Correlation Analysis (DCCA)(Andrew et al. 2013) and Deep Canonically CorrelatedAutoEncoders (DCCAE) (Wang et al. 2015) on classifi-cation task. As shown in Fig. 6, the performance of ourproposed method is rather competitive, where ours performsas the best on three out of four datasets. It is observed thatFeatConcate performs competitive when the quality of eachview is promising. Specifically, according to Figs. 2 and 6,we can find that the performance with each single view of

123


Fig. 6 Classificationcomparison in terms of accuracy

COL20MV is generally good, which indicates high qualityof each view. The possible reason is that simplemethodsmayalso work well when each view is enough for promising per-formance. We can also find that FeatConcate performs ratherunpromising on the difficult datasets, i.e., yale, football andpoliticsie.

4.3 Experiments for ConstrainedMulti-viewRepresentation Learning

We compare our method with 5 semi-supervised clusteringmethods under different ratios of supervised information.Weintroduce the semi-supervised manner for 3 subspace clus-teringmethods by introducingmust-link constraint (Liu et al.2012), which serves as the prior information to modify theaffinity matrix by setting Si, j = 1 if and only if xi and x jare of must-link. Specifically, these comparison approachesinclude 3 slightly modified subspace clustering methods and2 constrained clustering methods:

• SemiLRR (Liu et al. 2013a). The method concatenatesall views and employs PCA to reduce the feature dimen-sion to 1000. SemiLRR modifies the learned affinitymatrices by LRR with the must-link constraint.

• SemiSMR (Hu et al. 2014). SmoothRepresentation clus-tering (SMR) introduces the enforced grouping effectconditions a representation based subspace clusteringmodel. SemiSMR modifies the learned affinity matricesby SMR with the must-link constraint.

• SemiLS3C (Patel and Nguyen 2014). Latent SpaceSparse Subspace clustering (LS3C) learns the projec-tion of data and finds the sparse coefficients in thelow-dimensional latent space. SemiLS3C modifies the

learned affinitymatrices byLS3Cwith themust-link con-straint.

• CS-VFC (Zhou et al. 2014). Video Face Clustering viaConstrained Sparse Representation (CS-VFC) utilizesthe must-link and cannot-link constraints in the videoface clustering task on the two stages of sparse represen-tation and spectral clustering.

• COSC (Rangapuram and Hein 2012). Constrained 1-Spectral Clustering (COSC) presents a generalization ofthe popular spectral clustering techniquewhich integratesmust-link and cannot-link constraints.

• CLRR (Jing et al. 2016). The method ensures that datasharing amust-link constraint or same label have the samecoordinates in the new representation.

Figure 7 shows the affinity matrices on 4 datasets. Thevisualization shows the block-diagonal structure clearlywhich makes the results of subspace clustering more accu-rate. Moreover, with the increase of the supervision ratios(denoted by sup_ratio), the block-diagonal structure of theaffinity matrices becomes much clearer.

Figure 8 shows the comparison among diverse methodswith respect to must-link constraints under different ratiosof supervised information in terms of clustering accuracyand NMI. Obviously, it is observed that the clustering per-formances becomes better with the increase of supervisedinformation. Note that, COSC achieves a promising per-formances and defeats these subspace clustering methods,while our algorithm gains a competitive results. We also notethat the improvements of SemiSMR are not so significant asothers with the increase of supervision ratios on ORL andCOIL-20, while our competitive results further validates theeffectiveness of our method.

123


20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

16020 40 60 80 100 120 140 160

20

40

60

80

100

120

140

16020 40 60 80 100 120 140 160

20

40

60

80

100

120

140

16020 40 60 80 100 120 140 160

20

40

60

80

100

120

140

160

100 200 300 400 500 600

100

200

300

400

500

600

100 200 300 400 500 600

100

200

300

400

500

600

100 200 300 400 500 600

100

200

300

400

500

600

100 200 300 400 500 600

100

200

300

400

500

600

50 100 150 200 250 300 350 400

50

100

150

200

250

300

350

40050 100 150 200 250 300 350 400

50

100

150

200

250

300

350

40050 100 150 200 250 300 350 400

50

100

150

200

250

300

350

40050 100 150 200 250 300 350 400

50

100

150

200

250

300

350

400

200 400 600 800 1000 1200 1400

200

400

600

800

1000

1200

1400200 400 600 800 1000 1200 1400

200

400

600

800

1000

1200

1400200 400 600 800 1000 1200 1400

200

400

600

800

1000

1200

1400200 400 600 800 1000 1200 1400

200

400

600

800

1000

1200

1400

sup ratio = 0 sup ratio = 0.3 sup ratio = 0.6 sup ratio = 0.9

Fig. 7 Visualization of affinity matrices on Yale, Extended YaleB, ORL and COIL-20 under different ratios of constraints

4.4 Parameter Tuning, Convergence andComputational Cost

The experiments of parameter tuning are shown in Fig. 9. Inour model, there is only one parameter λ to be tuned.We tuneit on 4 benchmark datasets. Overall, the performances withregularization λ > 0 are better than λ = 0 on all 4 datasets,which demonstrates the effectiveness of the low-rank tensorconstraint. Moreover, we can find an reasonable parameterinterval for each dataset to achieve a promising performance.However, different dataset has a distinctly different parame-

ter interval. For instance, our method on ORL performs wellwith a small λ, which indicates a slight constraint is sufficientto cluster the ORL data. While for Extended YaleB, a muchlarger λ is needed.

Generally, it is difficult to select a value for the parameterλin advance for a new dataset because there is no validation setguiding the selection as in supervised task. Even though, weprovide a possible way to guide the hyper-parameter selec-tion for clustering. Specifically, since the label informationis invalid, we introduce an internal evaluation scheme, wherethe clustering result is evaluated by using quantities and fea-

123


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1A

ccur

acy

SemiLRRSemiSMRSemiLS3CCS-VFCCOSCCLLROurs

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Acc

urac

y


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Acc

urac

y


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Acc

urac

y


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

SemiLRRSemiSMRSemiLS3CCS-VFCCOSCCLRROurs

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sup-ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I


Fig. 8 Clustering performance under different ratios of constraints on Yale, Extended YaleB, ORL and COIL-20

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Sco

re

AccuracyDavies-Bouldin Index

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

1

1.1

Sco

re

NMIDavies-Bouldin Index

(a) Yale

0 0.1 0.2 0.3 0.4 0.50

0.2

0.4

0.6

0.8

1

1.2

Sco

re


0 0.1 0.2 0.3 0.4 0.50

0.2

0.4

0.6

0.8

1

1.2

Sco

re


(b) Extended YaleB

0 0.02 0.04 0.06 0.08 0.1 0.12

0.7

0.8

0.9

1

Sco

re


0 0.02 0.04 0.06 0.08 0.1 0.12

0.7

0.8

0.9

1

Sco

re


(c) ORL

0 2 4 6 8 100.6

0.7

0.8

0.9

1

1.1

Sco

re


0 2 4 6 8 100.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Sco

re


(d) COIL20

Fig. 9 Parameter tuning in terms of internal (i.e., Davies-Bouldin Index) and external (ACC and NMI) metrics on four benchmark datasets. Sincewe set λ1 = ... = λM = λ in our experiment, we only need tune the parameter λ

tures inherent to the dataset. We used the Davies-Bouldinindex (DBI) (Davies and Bouldin 1979) as a metric for eval-uating clustering algorithmswithout using label information,where the smaller value in terms of DBI indicates the betterclustering result. As shown in the Fig. 9, the promising per-formances in terms of external metric (e.g., ACC and NMI)are usually consistent with those of internal metric, demon-

strating the effectiveness of the proposed parameter selectionstrategy. Moreover, other internal or combination of multipleinternal metrics could be considered in the future.

Figure 10 shows the convergence experiments onYale andExtended YaleB. We normalize the values of convergenceconditions to (0, 1). The results validate that our algorithmcan achieve convergence within a few iterations.

123


Iteration0 20 40 60 80 100

Con

verg

ence

Con

ditio

ns

0

0.2

0.4

0.6

0.8

1||X (1) -X (1) U (1) Q T -E (1) || 2

||X (2) -X (2) U (2) Q T -E (2) || 2

||X (3) -X (3) U (3) Q T -E (3) || 2

||U (1) -G (1) || 2

||U (2) -G (2) || 2

||U (3) -G (3) || 2

(a) Yale

Iteration0 20 40 60 80 100

Con

verg

ence

Con

ditio

ns

0

0.2

0.4

0.6

0.8

1||X (1) -X (1) U (1) Q T -E (1) || 2

||X (2) -X (2) U (2) Q T -E (2) || 2

||X (3) -X (3) U (3) Q T -E (3) || 2

||U (1) -G (1) || 2

||U (2) -G (2) || 2

||U (3) -G (3) || 2

(b) Extended YaleB

Fig. 10 Convergence experiment on Yale and Extended YaleB

Table 2 Computation cost on COIL-20

Method SPC LRR RMSC MSSC LT-MSC

Time (s) 3.84 276.19 389.78 1833 452.25

We report the results about computational time of therepresentative multi-view learning methods on COIL-20, asshown in Table 2. All the methods are tested on a computerwith Intel(R) Core(TM) i5-8400 CPU and 8.00GB RAM.Because graph (with the size n × n) is involved for existingsubspace clustering methods, it leads to computational costmatrix operations. The time complexities of these subspace-based clustering methods are generally in the same level. Weobserve that the spectral clustering is much faster because itdoes not require a number of iterations.

5 Conclusion

We introduce a framework to learn representation for multi-view data by exploiting the complementary information from

multiple views. The tensor is introduced to explore high-order correlations of multi-view data, and a constraint matrixis devised to further promote the learned representation. Weformulate the problem within a unified optimization frame-work and propose an efficient algorithm to obtain the optimalsolution. The extensive experimental results validate theeffectiveness of the proposedmethod in exploring high-ordercorrelations and prior information.

Acknowledgements This work is supported by the National NaturalScience Foundation of China (Nos. 61976151, 61732011, 61925602and U1636214), Beijing Natural Science Foundation (No. 4172068).

References

Abavisani, M., & Patel, V. M. (2018). Multimodal sparse and low-ranksubspace clustering. Information Fusion, 39, 168–177.

Andrew,G., Arora, R., Bilmes, J., &Livescu, K. (2013). Deep canonicalcorrelation analysis. In ICML (pp. 1247–1255).

Bartels, R. H., & Stewart, G.W. (1972). Solution of the matrix equationAX + XB = C. Communications of the ACM, 15(9), 820–826.

Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilisticframework for semi-supervised clustering. In ACM SIGKDD (pp.59–68).

Bickel, S., & Scheffer, T. (2004). Multi-view clustering. In ICDM.Blaschko, M. B., & Lampert, C. H. (2008). Correlational spectral clus-

tering. In CVPR.Cai, J. F., Cand, E. J. S., & Shen, Z. (2010). A singular value thresh-

olding algorithm for matrix completion. Philadelphia: Society forIndustrial and Applied Mathematics.

Cao, X., Wei, X., Han, Y., Yang, Y., & Lin, D. (2013). Robust tensorclustering with non-greedy maximization. In AAAI.

Cao, X., Zhang, C., Fu, H., Liu, S., & Zhang, H. (2015a). Diversity-induced multi-view subspace clustering. In CVPR.

Cao, X., Zhang, C., Zhou, C., Fu, H., & Foroosh, H. (2015b). Con-strainedmulti-viewvideo face clustering.TIP, 24(11), 4381–4393.

Chaudhuri, K., Kakade, S. M., Livescu, K., & Sridharan, K. (2009).Multi-view clustering via canonical correlation analysis. In ICML.

Cheng, B., Liu, G., Wang, J., Huang, Z., & Yan, S. (2011). Multi-tasklow-rank affinity pursuit for image segmentation. In ICCV.

Cheng,M., Jing, L.,&Ng,M.K. (2018). Tensor-based low-dimensionalrepresentation learning for multi-view clustering. IEEE Transac-tions on Image Processing, 28(5), 2399–2414.

Christopher, D. M., Raghavan, P., & Schtze, H. (2008). Introduction toinformation retrieval (Vol. 1). Cambridge: Cambridge UniversityPress.

Collins, M. D., Liu, J., Xu, J., Mukherjee, L., & Singh, V. (2014).Spectral clusteringwith a convex regularizer onmillions of images.In ECCV.

Cortes, C., Mohri, M., &Rostamizadeh, A. (2009). Learning non-linearcombination of kernels. In NIPS.

Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure.IEEE Transactions on Pattern Analysis and Machine Intelligence,2, 224–227.

De Lathauwer, L., De Moor, B., & Vandewalle, J. (2000). On the bestrank-1 and rank-(r1, r2, ., rn) approximation of higher-order ten-sors. SIAM Journal on Matrix Analysis and Applications, 21(4),1324–1342.

de Sa, V. R. (2005). Spectral clustering with two views. In ICML.Ding, Z., Zhao, H., & Fu, Y. (2018). Learning representation for multi-

view data analysis: Models and applications. Berlin: Springer.

123


Elhamifar, E., & Vidal, R. (2013). Sparse subspace clustering: Algo-rithm, theory, and applications. IEEE Transactions on PatternAnalysis and Machine Intelligence, 35(11), 2765–2781.

Elkahky, A. M., Song, Y., & He, X. (2015). A multi-view deep learn-ing approach for cross domain user modeling in recommendationsystems. In WWW (pp. 278–288).

Gao, H., Nie, F., Li, X., & Huang, H. (2015). Multi-view subspaceclustering. In ICCV (pp. 4238–4246).

Greene, D., & Cunningham, P. (2009). A matrix factorization approachfor integrating multiple data views. In Joint European conferenceon machine learning and knowledge discovery in databases (pp.423–438). Springer.

Grigorios, T., &Aristidis, L. (2012). Kernel-basedweightedmulti-viewclustering. In ICDM.

Guo, Y. (2013). Convex subspace representation learning from multi-view data. In AAAI.

Hu, H., Lin, Z., Feng, J., & Zhou, J. (2014). Smooth representationclustering. In CVPR (pp. 3834–3841).

Jing, W., Xiao, W., Feng, T., Chang, H. L., & Yu, H. (2016). Con-strained low-rank representation for robust subspace clustering.IEEE Transactions on Cybernetics, 47(99), 1–13.

Kamvar, K., Sepandar, S., Klein, K., Dan, D., Manning, M., & Christo-pher, C. (2003). Spectral learning. In IJCAI (pp. 561–566).

Kumar, A., & Daumé III, H. (2011). A co-training approach for multi-view spectral clustering. In ICML.

Kumar, A., Rai, P., & Daumé III, H. (2011). Co-regularized multi-viewspectral clustering. In NIPS.

Lades, M., Vorbruggen, J. C., Buhmann, J., Lange, J., von der Mals-burg, C., Wurtz, R. P., et al. (1993). Distortion invariant objectrecognition in the dynamic link architecture. IEEE Transactionson Computers, 42(3), 300–311.

Lathauwer, L. D., Moor, B. D., & Vandewalle, J. (2000). A multilinearsingular value decomposition. SIAM Journal on Matrix Analysisand Applications, 21(4), 1253–1278.

Lawrence, H., & Phipps, A. (1985). Comparing partitions. Journal ofClassification, 2(1), 193–218.

Lin, Z., Chen, M., &Ma, Y. (2010). The augmented lagrange multipliermethod for exact recovery of corrupted low-rank matrices. arXivpreprint arXiv:1010.0789.

Lin, Z., Liu,R.,&Li,H. (2015). Linearized alternating directionmethodwith parallel splitting and adaptive penalty for separable convexprograms inmachine learning.MachineLearning,99(2), 287–325.

Li, Z., & Tang, J. (2016). Weakly supervised deep matrix factoriza-tion for social image understanding. IEEE Transactions on ImageProcessing, 26(1), 276–288.

Li, Z., Tang, J., & Mei, T. (2018). Deep collaborative embedding forsocial image understanding. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 41(9), 2070–2083.

Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., & Ma, Y. (2013a). Robustrecovery of subspace structures by low-rank representation. IEEETransactions on Pattern Analysis andMachine Intelligence, 35(1),171–184.

Liu, J., Musialski, P., Wonka, P., & Ye, J. (2013b). Tensor completionfor estimating missing values in visual data. IEEE Transactions onPattern Analysis and Machine Intelligence, 35(1), 208–220.

Liu,H.,Wu,Z., Cai,D.,&Huang,T. S. (2012).Constrained nonnegativematrix factorization for image representation. IEEE Transactionson Pattern Analysis and Machine Intelligence, 34(7), 1299–1311.

Liu, X., Zhu, X., Li, M., Wang, L., Tang, C., Yin, J., et al. (2019). Latefusion incomplete multi-view clustering. IEEE Transactions onPattern Analysis and Machine Intelligence, 41, 2410–2423.

Lu, Z., & Leen, T. K. (2007). Penalized probabilistic clustering. NeuralComputation, 19(6), 1528–1567.

Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering:Analysis and an algorithm. In NIPS.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011).Multimodal deep learning. In ICML (pp. 689–696).

Oh, T.-H., Matsushita, Y., Tai, Y.-W., & Kweon, I. S. (2015). Fastrandomized singular value thresholding for nuclear norm mini-mization. In CVPR.

Ojala, T., Pietikainen, M., &Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binarypatterns. IEEE Transactions on Pattern Analysis and MachineIntelligence, 24(7), 971–987.

Patel, V. M., & Nguyen, H. V. (2014). Latent space sparse subspaceclustering. In ICCV (pp. 225–232).

Quintana, E. (2001). A note on parallel matrix inversion. SIAM Journalon Scientific Computing, 22(5), 1762–1771.

Rangapuram, S. S., & Hein, M. (2012). Constrained 1-spectral cluster-ing. In AISTATS.

Soleymani, F. (2013). A fast convergent iterative solver for approximateinverse of matrices. Numerical Linear Algebra with Applications,21(3), 439–452.

Sui, J., Qi, S., van Erp, T. G., Bustillo, J., Jiang, R., Lin, D., et al. (2018).Multimodal neuromarkers in schizophrenia via cognition-guidedmri fusion. Nature Communications, 9(1), 1–14.

Tang, W., Lu, Z., & Dhillon, I. S. (2009). Clustering with multiplegraphs. In ICDM.

Tomioka, R., Hayashi, K., & Kashima, H. (2010). Estimationof low-rank tensors via convex optimization. arXiv preprintarXiv:1010.0789.

Tong, H., He, J., Li, M., Zhang, C., & Ma, W.-Y. (2005). Graph basedmulti-modality learning. In ACM MM (pp. 862–871). ACM.

Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Con-strained k-means clustering with background knowledge. In ICML(vol. 1, pp. 577–584).

Wang,H.,Weng, C.,&Yuan, J. (2014).Multi-feature spectral clusteringwith minimax optimization. In CVPR.

Wang, H., Yang, Y., & Liu, B. (2019). GMC: Graph-based multi-viewclustering. IEEE Transactions on Knowledge and Data Engineer-ing. https://doi.org/10.1109/TKDE.2019.2903810.

Wang, W., Arora, R., Livescu, K., & Bilmes, J. (2015). On deep multi-view representation learning, pp. 1083–1092.

White, M., Zhang, X., Schuurmans, D., & Yu, Y.-l. (2012). Convexmulti-view subspace learning. In NIPS.

Wu,B., Zhang,Y.,Hu,B.,& Ji,Q. (2013). Constrained clustering and itsapplication to face clustering in videos. InCVPR (pp. 3507–3514).

Xia, R., Pan, Y., Du, L., & Yin, J. (2014a). Robust multi-view spectralclustering via low-rank and sparse decomposition. In AAAI.

Xia, R., Pan, Y., Du, L., & Yin, J. (2014b). Robust multi-view spectralclustering via low-rank and sparse decomposition. In AAAI (pp.2149–2155).

Xie, Y., Tao, D., Zhang, W., Liu, Y., Zhang, L., & Qu, Y. (2018).On unifying multi-view self-representations for clustering by ten-sor multi-rank minimization. International Journal of ComputerVision, 126(11), 1157–1179.

Yang, E., Deng, C., Li, C., Liu, W., Li, J., & Tao, D. (2018). Sharedpredictive cross-modal deep quantization. IEEE Transactions onNeural Networks and Learning Systems, 29(11), 5292–5303.

Zhang, C., Adeli, E., Zhou, T., Chen, X., & Shen, D. (2018a). Multi-layer multi-view classification for alzheimers disease diagnosis. InAAAI.

Zhang, C., Fu, H., Hu, Q., Cao, X., Xie, Y., Tao, D., et al. (2020). Gen-eralized latent multi-view subspace clustering. IEEE Transactionson Pattern Analysis and Machine Intelligence, 42, 86–99.

Zhang, C., Fu, H., Hu, Q., Zhu, P., & Cao, X. (2017a). Flexible multi-view dimensionality co-reduction. IEEE Transactions on ImageProcessing, 26(2), 648–659.

Zhang, C., Hu, Q., Fu, H., Zhu, P., &Cao, X. (2017b). Latentmulti-viewsubspace clustering. In CVPR (pp. 4333–4341).

123

http://arxiv.org/abs/1010.0789

http://arxiv.org/abs/1010.0789

https://doi.org/10.1109/TKDE.2019.2903810


Zhang, T., Ghanem, B., Liu, S., Xu, C., & Ahuja, N. (2013). Low-ranksparse coding for image classification. In ICCV.

Zhang, T., Liu, S., Ahuja, N., Yang, M.-H., & Ghanem, B. (2014).Robust visual tracking via consistent low-rank sparse learning.International Journal of Computer Vision, 111(2), 171–190.

Zhang, Z., Liu, L., Shen, F., Shen, H. T., & Shao, L. (2018b). Binarymulti-view clustering. IEEE Transactions on Pattern Analysis andMachine Intelligence, 41(7), 1774–1782.

Zhan, K., Zhang, C., Guan, J., & Wang, J. (2017). Graph learning formultiview clustering. IEEE Transactions on Cybernetics, 48(10),2887–2895.

Zhao, H., Ding, Z., & Fu, Y. (2017). Multi-view clustering via deepmatrix factorization. In AAAI (pp. 2921–2927).

Zhao, X., Evans, N., & Dugelay, J.-L. (2014). A subspace co-trainingframework for multi-view clustering. Pattern Recognition Letters,41, 73–82.

Zhou, C., Zhang, C., Li, X., Shi, G., & Cao, X. (2014). Video faceclustering via constrained sparse representation. In ICME (pp. 1–6).

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

123

Documents

Tensorized Multi-view Subspace Representation Learning · perspectives, hence it is beneﬁcial to integrate the informa-tion from multiple views for more comprehensive learning (Ding