20
1 Deep Matrix Factorizations Pierre De Handschutter Nicolas Gillis Xavier Siebert Abstract—Constrained low-rank matrix approximations have been known for decades as powerful linear dimensionality reduction tech- niques to be able to extract the information contained in large data sets in a relevant way. However, such low-rank approaches are unable to mine complex, interleaved features that underlie hierarchical semantics. Recently, deep matrix factorization (deep MF) was introduced to deal with the extraction of several layers of features and has been shown to reach outstanding performances on unsupervised tasks. Deep MF was motivated by the success of deep learning, as it is conceptually close to some neural networks paradigms. In this paper, we present the main models, algorithms, and applications of deep MF through a comprehensive literature review. We also discuss theoretical questions and perspectives of research. Index Terms—matrix factorization, low-rank matrix approximation, non- negative matrix factorization, sparsity, deep learning, linear networks. 1 I NTRODUCTION In the current era of data deluge, the automatic extraction of interpretable features in unlabelled data sets is a key challenge. For many years, linear algebra tools have been used to deal with such tasks. Among these techniques, the constrained low-rank matrix approximations (CLRMA) [1] mine relevant information from large data sets and have therefore been drawing attention of numerous researchers. In practice, many data sets appear to be well approximated by low-rank matrices [2], and hence CLRMA are particularly appropriate to extract pertinent information. Within this general framework, some well-known techniques such as principal component analysis (PCA) [3], singular values de- composition (SVD) [4], sparse coding [5], sparse component analysis (SCA) [6], and non-negative matrix factorization (NMF) [7], to name only a few, have been used in many applications for the last decades. These variants mostly differ by the function chosen to measure the quality of the approximation and by the additional constraints considered. Given a set of n data points lying in an m-dimensional space, a matrix X R m×n is built such that each data point corresponds to a column of X. The goal of a low- rank matrix approximation is to express each data point as a linear combination of a few basis vectors. In other words, This work was supported by the European Research Council (ERC starting grant n o 679515), and by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS Project no O005318F-RG47. The authors are with the Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, 7000 Mons, Bel- gium. Pierre De Handschutter is a research fellow of the Fonds de la Recherche Scientifique - FNRS. E-mails: {pierre.dehandschutter, nicolas.gillis, xavier.siebert}@umons.ac.be. one has to find a matrix W R m×r and a matrix H R r×n such that each data point can be approximated as X(:,j ) r X k=1 W (:,k)H(k,j ) for j =1,...,n, where W (:,k) denotes the k-th column of W and corres- ponds to the k-th basis element and H(k,j ) is the weight with which the k-th basis element appears in the j -th data point. More precisely, H(:,j ) is the representation of data point X(:,j ) in an r-dimensional linear subspace spanned by W . In matrix form, this approximation, also sometimes called factorization, is written as X WH. On the other hand, deep neural networks have been widely used as deep learning gained success in many su- pervised classification tasks [8], [9] and even in generative models [10]. Their main advantage lies in their ability to combine features in a highly non-linear way but their theor- etical foundations remain quite elusive. At midway between the linear algebra models and the deep neural networks lies deep matrix factorization (deep MF), the core of this paper. The main motivation of deep MF is to combine both interpretability, as in classical matrix factorizations, of which it is an extension, and the extraction of multiple hierarchical features, as allowed by multilayer architectures. One layer matrix approximations are not able to extract multi-level features in complex data sets. The goal of deep MF is to decompose a data matrix X R m×n as X W 1 W 2 ··· W L H L , (1) where L is the number of layers, W l R d l-1 ×d l for l =1, ··· ,L with d 0 = m, and H L R + d L ×n . The ap- proximation in (1) corresponds to successive factorizations of X: X W 1 H 1 , H 1 W 2 H 2 , . . . H L-1 W L H L , (2) where H l R + d l ×n for all l. Each matrix W l (l =1,...,L) can be interpreted as the feature matrix of layer l and each H l can be interpreted as the representation matrix of layer l. In other words, successive factorizations of rank d l (1 l L) are performed such that various recombinations of the features of the first layers would appear in the following ones allowing numerous interpretations of the semantics hidden in the data set. arXiv:2010.00380v2 [cs.LG] 3 Oct 2020

1 Deep Matrix Factorizations - arXiv

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Deep Matrix Factorizations - arXiv

1

Deep Matrix FactorizationsPierre De Handschutter Nicolas Gillis Xavier Siebert

F

Abstract—Constrained low-rank matrix approximations have beenknown for decades as powerful linear dimensionality reduction tech-niques to be able to extract the information contained in large data setsin a relevant way. However, such low-rank approaches are unable tomine complex, interleaved features that underlie hierarchical semantics.Recently, deep matrix factorization (deep MF) was introduced to dealwith the extraction of several layers of features and has been shownto reach outstanding performances on unsupervised tasks. Deep MFwas motivated by the success of deep learning, as it is conceptuallyclose to some neural networks paradigms. In this paper, we presentthe main models, algorithms, and applications of deep MF through acomprehensive literature review. We also discuss theoretical questionsand perspectives of research.

Index Terms—matrix factorization, low-rank matrix approximation, non-negative matrix factorization, sparsity, deep learning, linear networks.

1 INTRODUCTION

In the current era of data deluge, the automatic extractionof interpretable features in unlabelled data sets is a keychallenge. For many years, linear algebra tools have beenused to deal with such tasks. Among these techniques, theconstrained low-rank matrix approximations (CLRMA) [1]mine relevant information from large data sets and havetherefore been drawing attention of numerous researchers.In practice, many data sets appear to be well approximatedby low-rank matrices [2], and hence CLRMA are particularlyappropriate to extract pertinent information. Within thisgeneral framework, some well-known techniques such asprincipal component analysis (PCA) [3], singular values de-composition (SVD) [4], sparse coding [5], sparse componentanalysis (SCA) [6], and non-negative matrix factorization(NMF) [7], to name only a few, have been used in manyapplications for the last decades. These variants mostlydiffer by the function chosen to measure the quality of theapproximation and by the additional constraints considered.Given a set of n data points lying in an m-dimensionalspace, a matrix X ∈ Rm×n is built such that each datapoint corresponds to a column of X . The goal of a low-rank matrix approximation is to express each data point asa linear combination of a few basis vectors. In other words,

This work was supported by the European Research Council (ERC startinggrant no 679515), and by the Fonds de la Recherche Scientifique - FNRSand the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOSProject no O005318F-RG47.The authors are with the Department of Mathematics and OperationalResearch, Faculté Polytechnique, Université de Mons, 7000 Mons, Bel-gium. Pierre De Handschutter is a research fellow of the Fonds de laRecherche Scientifique - FNRS. E-mails: {pierre.dehandschutter, nicolas.gillis,xavier.siebert}@umons.ac.be.

one has to find a matrix W ∈ Rm×r and a matrix H ∈ Rr×nsuch that each data point can be approximated as

X(:, j) ≈r∑

k=1

W (:, k)H(k, j) for j = 1, . . . , n,

where W (:, k) denotes the k-th column of W and corres-ponds to the k-th basis element and H(k, j) is the weightwith which the k-th basis element appears in the j-th datapoint. More precisely, H(:, j) is the representation of datapoint X(:, j) in an r-dimensional linear subspace spannedby W . In matrix form, this approximation, also sometimescalled factorization, is written as X ≈WH .

On the other hand, deep neural networks have beenwidely used as deep learning gained success in many su-pervised classification tasks [8], [9] and even in generativemodels [10]. Their main advantage lies in their ability tocombine features in a highly non-linear way but their theor-etical foundations remain quite elusive.

At midway between the linear algebra models and thedeep neural networks lies deep matrix factorization (deepMF), the core of this paper. The main motivation of deepMF is to combine both interpretability, as in classical matrixfactorizations, of which it is an extension, and the extractionof multiple hierarchical features, as allowed by multilayerarchitectures. One layer matrix approximations are not ableto extract multi-level features in complex data sets. The goalof deep MF is to decompose a data matrix X ∈ Rm×n as

X ≈ W1W2 · · · WLHL, (1)

where L is the number of layers, Wl ∈ Rdl−1×dl forl = 1, · · · , L with d0 = m, and HL ∈ R+

dL×n. The ap-proximation in (1) corresponds to successive factorizationsof X :

X ≈W1H1,

H1 ≈W2H2,

...HL−1 ≈WLHL,

(2)

where Hl ∈ R+dl×n for all l. Each matrix Wl (l = 1, . . . , L)

can be interpreted as the feature matrix of layer l andeach Hl can be interpreted as the representation matrix oflayer l. In other words, successive factorizations of rank dl(1 ≤ l ≤ L) are performed such that various recombinationsof the features of the first layers would appear in thefollowing ones allowing numerous interpretations of thesemantics hidden in the data set.

arX

iv:2

010.

0038

0v2

[cs

.LG

] 3

Oct

202

0

Page 2: 1 Deep Matrix Factorizations - arXiv

2

Fig. 1: Hierarchical features extracted by deep MF on the CMU-PIE face data set. At each layer l (l = 1, 2, 3), the columnsof the representation matrix Hl are clustered according to k-means to obtain the clusters shown above. Figure from [11].

One of the first applications for which deep MF hasproven to be useful is in the seminal paper of Trigeorgiset al. [11] for the extraction of facial features. Given a set ofn gray-scale facial images, each one described by m pixelvalues, deep MF extracts several layers of features, eachone corresponding to a specific interpretation ranging fromlow-level features at the first layer to high-level features atthe last layer. Fig. 1 illustrates such a decomposition for afactorization of depth L = 3 on the CMU-PIE face data set.The basis matrix W1 contains d1 archetypes of pose features,that is, d1 faces having discriminative pose attributes,W1W2

corresponds to d2 basis faces having different expressions,and W1W2W3 retrieves the identities of the faces. Each Hl

indicates in which proportions each feature appears in eachface of the original data set; for example the j-th column ofH2 contains the "proportions" in which the j-th subject isdisgusted, surprised, or neutral. Note that the use of priorinformation, such as the fact that some faces share the samelabel at some layer, helps to improve the performance of themodel; see Section 3.2.5 for more information.

Without any constraint on the factors of deep MF, (1)merely degenerates into classical matrix factorization. In thiscase, the product of the matrices Wl’s could be replaced bya single equivalent (without additional particular property)matrix whose rank is less than or equal to the minimumof the dl’s and the factorization is highly non-unique. Forexample, one could simply replace any Wl by WlQ andWl+1 by Q−1Wl+1 for any l and any invertible matrixQ ∈ Rdl×dl , and obtain another decomposition of X withthe same approximation error but most likely with a ratherdifferent interpretation. Therefore, constraints on the factorssuch as non-negativity and sparsity, and/or regularizationsshould be considered, which results in various deep MFmodels. Most deep MF models assume the non-negativityof several factors of the decomposition and therefore extend

some NMF ideas.This paper serves as a survey on the recent literature

on deep MF. It is organized as follows. We first brieflysummarize the main ideas behind CLRMA in Section 2. InSection 3, we present the early multi-level models, namelymultilayer MF up to recent deep MF, their regularizations,and the different algorithmic approaches to tackle them. InSection 4, we present the performances of deep MF on twoillustrative examples (namely, recommender systems andhyperspectral imaging), and review the main applications.Connections with deep learning are initiated in Section 5while Section 6 highlights the lack of theoretical guaran-tees that have come with the models so far. However,contributions from deep linear networks might open newdirections of research. Finally, in Section 7, we summarizethe identified perspectives of future research and conclude.

2 A BRIEF SUMMARY ON MATRIX FACTORIZATIONS

In this section, we recall the basics of matrix factorizations,which will be key to understand deep MF.

Low-rank matrix approximations consist in finding twomatrices W ∈ Rm×r and H ∈ Rr×n such that theproduct WH approximates as well as possible a data matrixX ∈ Rm×n made of n points in dimension m where r,called the rank of the approximation, is generally a smallvalue compared to m and n and is fixed in advance in manypractical applications.

A critical aspect of matrix approximations is the choiceof the loss metric between X and WH , that is, the wayto evaluate how good the approximation is. Most modelsaim at minimizing a divergence between the original datamatrix and its low-rank reconstruction. More precisely, theβ-divergences are usually considered to quantify the fidelitybetween the original data matrix and its low-rank approx-imation [12], and a common choice in the community is

Page 3: 1 Deep Matrix Factorizations - arXiv

3

to minimize the Frobenius norm of the difference betweenthese two matrices, which corresponds to β = 2. Therefore,the standard matrix factorization optimization problem isformulated as

minW∈Rm×r

H∈Rr×n

‖X −WH‖2F , (3)

where ||A||2F =∑i,j A(i, j)2 is the squared Frobenius norm

of matrix A. This essentially corresponds to PCA (althoughPCA typically performs mean centering before computingthe principal components), and can be computed via theSVD.

A usual feature of the data matrix is that it is entry-wise non-negative, that is, X(i, j) ≥ 0 for all i, j, which isdenoted X ≥ 0. Many real-world applications record suchnon-negative measurements, which has led to the devel-opment of the so-called non-negative matrix factorization(NMF) model [13]. In NMF, the input data matrix X iselement-wise non-negative and in turn, entry-wise non-negativity is required for both factors W and H . NMFhas been widely studied, in terms of theoretical guarantees,models and applications [14–18] and is formulated as

minW∈Rm×r

H∈Rr×n

‖X −WH‖2F such that W ≥ 0 and H ≥ 0, (4)

which corresponds to (3) with the additional non-negativityconstraints.

A strong advantage of NMF is the interpretability of thefactors [16]. The matrix W is often considered as the matrixof features, with each column of W corresponding to a basisvector, while the matrix H corresponds to the activationsof each basis vector in each original data point. Especially,if it is also required that the sum of the elements of anycolumn of H is equal to 1, that is, H is column stochasticwith

∑rj=1H(j, k) = 1 for all k, then the entries of the k-th

column of H can be interpreted as the proportions in whicheach feature vector appears in the k-th data point. In thissense, NMF can be seen as a soft clustering technique as forall j and k, H(k, j) is the membership indicator of the j-thdata point in the k-th cluster. This model is sometimes re-ferred to as simplex-structured matrix factorization; see [19]and the references therein.

Most of the time, additional properties are enforcedfor the two factors W and H . This can be translated byhard-coded constraints or through a penalty term calledregularizer added to the data fitting term in the objectivefunction. Several models and algorithms have been de-signed, exploiting geometric or algebraic properties [17].Among the most widely used techniques, minimum-volumeNMF (MinVolNMF) [20–22], sparse NMF [23], and variantsof archetypal analysis (AA) [24–26] have led to the bestperformances. For example, minimum-volume NMF aimsat minimizing the volume delimited by the basis vectors,while sparse NMF imposes that the factors only containa reduced number of non-zero entries. Moreover, thesemethods start to be supported by theoretical advances, suchas identifiability results which provide conditions underwhich the ground-truth matricesW andH are unique (up totrivial ambiguities such as permutation and scaling); see [22]and the references therein. Some of these variants will be

detailed in Section 3.2 as they have been extended to themulti-layer case.

NMF is an NP-hard non-convex problem [27] whichis generally solved through an alternated scheme knownas block-coordinate descent (BCD), as described in Al-gorithm 1. This consists in alternatively optimizing one ofthe two factors of (4) while keeping the other fixed. Notethat the corresponding subproblems are convex, namelythey are nonnegative least squares problems which areefficiently solvable1.

Algorithm 1 Two-block coordinate descent to solve NMF

Input: Nonnegative matrix X, rank r of the factorizationOutput: Matrices W and H minimizing (4)

1: Compute initial matrices W (0) and H(0), t = 12: while stopping criterion not met do3: W (t) = update_W(X,W (t−1), H(t−1))4: H(t) = update_H(X,W (t), H(t−1))5: t = t+ 16: end while

3 DEEP MF MODELS AND ALGORITHMS

Although CLRMA such as NMF with non-negativity con-straints and SCA with sparsity constraints lead to a com-pact and meaningful representation of the input data, theyare limited by the shallowness of the representation. Suchtechniques are only able to extract a single layer of fea-tures, preventing to reveal hierarchical features. While thestandard matrix factorization decomposes the data matrixin only two factors, deep MF, inspired by the success ofdeep learning, is able to extract several layers of features ina hierarchical way, giving new insights in a broad range ofapplications.

Deep MF considers a product of matrices Wl’s(l = 1, . . . , L) in place of a single matrix W in the ap-proximation; see (1). As constraints on the factors of thisdecomposition are necessary to make the model meaningful(see Section 1), the next sections present various models andalgorithms of deep MF. We first present the evolution fromthe early multi-layers models to the recent deep modelsin Section 3.1. Then, in Section 3.2, we describe the mainvariants, which are inspired from those of classical matrixfactorizations. Section 3.3 describes the possible algorithmicchoices and briefly discusses the computational cost.

3.1 From multilayer MF to deep MF

The first model extending CLRMA to several levels is mul-tilayer NMF proposed by Cichocki et al. in 2006 [28], [29].Based on the hierarchical factorizations of a non-negativedata matrix X ∈ Rm×n+ as described by (2), multilayerMF decomposes X in a sequential manner. At the firstlayer, a low-rank factorization of X is computed such thatX ≈W1H1. At the second layer, the matrix H1 is factorizedas H1 ≈ W2H2, and so on until HL−1 is decomposed asWLHL; see Algorithm 2. Moreover, all the factors of thedecomposition are constrained to be non-negative.

1. It can be solved for example in Matlab via the functionlsqnonneg.

Page 4: 1 Deep Matrix Factorizations - arXiv

4

X

W

H

(a)

Dire

con

of d

ecom

posi

on

L

1

LW

2

W1

W

X

H

L−1H

H

H

2

(b)

L

1

LW

2

W1

W

X

H

L−1H

H

H

2

Forw

ard

prop

aga

on

Backward propaga

on(c)

Fig. 2: (a) MF, (b) multilayer MF [28], (c) deep MF [11]. Anarrow means a matrix product is performed: H −→

WX

means that H is multiplied by W to approximate X .

Algorithm 2 Early multilayer NMF [28]

Input: Non-negative data matrix X, number of layers L,inner ranks dl

Output: Matrices W1, . . . ,WL and H1, . . . ,HL

1: Y = X2: for l = 1, . . . , L do3: (Wl, Hl) = Algorithm 1 (Y , dl)4: Y = Hl

5: end for

However, the multilayer NMF of [29] does not invest-igate much the hierarchical power of deep schemes asthe decomposition is purely sequential. More precisely, Al-gorithm 2 consists in sequentially minimizing the recon-struction errors ‖Hi−1 −WiHi‖2F for all i = 1, . . . , L withH0 = X , but it does not involve a global cost function. Inother words, the error is minimized layer by layer as in (4),but there is no retroaction of the last layers on the first onesthrough some backpropagation mechanism.

A key improvement was achieved by the papers ofTrigeorgis et al., who introduced deep MF [11], [30]. Thedata matrix X still undergoes successive factorizations asin (2), but the breakthrough lies in the way the optimiza-tion of the factors Wl’s and Hl’s is performed. The mainalgorithmic novelty is the fact that rather than using a purelysequential approach, the factors are iteratively updated andthe information not only propagates from the first, moreabstract layer, to the last, more refined layer, but also in thereverse direction. The following error function involving thefactors of all layers is considered

L(W1,W2, . . . ,WL;HL) = ‖X −W1W2 . . .WLHL‖2F , (5)

and a block-coordinate descent strategy is used to iterat-ively update all the factors. The deep MF algorithm [11]is described in Algorithm 3, and illustrated on Fig. 2 c. InAlgorithm 3, arg reduce means that the factor is updatedthrough some algorithm (see Section 3.3) that (typically)decreases the objective function for several iterations.

Algorithm 3 Deep semi-NMF [11]

Input: Data matrix X, number of layers L, inner ranks dlOutput: Matrices W1, . . . ,WL and H1, . . . ,HL

1: Compute initial matricesW (0)l andH(0)

l for all l througha sequential decomposition of X (for example Al-gorithm 2)

2: for k = 1, . . . do3: for l = 1, . . . , L do4: A

(k)l =

∏j<lW

(k)j

5: B(k)l =

{H

(k−1)L if l = L

W(k−1)l+1 H

(k−1)l+1 otherwise

6: W(k)l = arg reduce

W‖X −A(k)

l WB(k)l ‖2F

7: H(k)l = arg reduce

H≥0‖X −A(k)

l W(k)l H‖2F

8: end for9: end for

Several comments can be formulated:• First, the work of Trigeorgis et al. was inspired by semi

non-negative matrix factorization (semi-NMF) [31], avariant of NMF where only one factor, typically H ,must contain non-negative entries while the other Wis allowed to contain mixed-sign elements. Therefore,this model should rather be called deep semi-NMF astheWl’s are not directly constrained to be non-negative,and the matrix X is not required to have non-negativeentries. However, one should keep in mind that eachfactor Hl ≈ Wl+1Hl+1 is required to be non-negative,which implies an implicit constraint on the Wl’s.In practice, the requirement for non-negativity con-straints on the basis vectors Wl’s depends on the ap-plication: as most physical systems record non-negativedata, it often makes sense to impose the non-negativityof the basis vectors. Therefore, if non-negativity of thebasis vectors is meaningful, one can easily modify themodel by adding nonnegativity constraints on the Wl’s,and modifying line 6 of Algorithm 3.

• Second, to initialize all factors, a forward decompos-ition of the input matrix is employed, as done inAlgorithm 2. Once all the factors are initialized, theupdates of all matrices as in Algorithm 3 are performeduntil some stopping criterion is met.

• Third, the attentive reader may have noticed thatAlgorithm 3 does not correspond to applying aBCD method on (5) by optimizing the factors(W1, . . . ,WL, HL) alternatively. In fact, the nonnegat-ive matrices Hl (l = 1, · · · , L − 1) are intermedatevariables that do not appear in (5). However, oneneeds to remember the underlying sequential decom-position of (2): as Hl ≈ Wl+1Hl+1 (l = 1, · · · , L − 1)are constrained to be non-negative, they have a ded-icated update rule. However, this raises important

Page 5: 1 Deep Matrix Factorizations - arXiv

5

research questions that have not been investigatedmuch. In particular, other possibilities in the expres-sion of B(k)

l are possible; for example, [32] considersB

(k)l = (

∏j>lW

(k−1)j )H

(k−1)L while simply setting

B(k)l = H

(k−1)l also makes sense, without a clear

motivation why one should be preferred over the other.Moreover, how does replacing the function to minimizeat line 7 by ‖H(k)

l−1 −W(k)l H‖ change the iterates Hl’s?

If non-negativity constraints are imposed on the Wl’s, aclassical BCD can be applied to optimize alternativelythe factors of (5), as described in Algorithm 4, whereonly the variables (W1, . . . ,WL, HL) are alternativelyudpated. Note that nonnegativity constraints can bereplaced with other constraints such as sparsity.

Algorithm 4 BCD to minimize (5)

Input: Data matrix X, number of layers L, inner ranks dlOutput: Matrices W1, . . . ,WL and HL minimizing (5)

1: Compute initial matrices W (0)l for all l and H(0)

L2: for k = 1, . . . do3: for l = 1, . . . , L do4: A

(k)l =

∏j<lW

(k)j

5: B(k)l = (

∏j>lW

(k−1)j )H

(k−1)L

6: W(k)l = arg reduce

W≥0‖X −A(k)

l WB(k)l ‖2F

7: end for8: H

(k)L = arg reduce

H≥0‖X −W (k)

1 · · ·W (k)L H‖2F

9: end for

• Finally, the choice of the loss function itself is notobvious. In CLRMA, this issue has been investigatedthoroughly, and several strategies exist, based on thestatistic of the noise, or cross validation, among oth-ers [33]. In deep MF, the question of the structure of theloss function also arises. Is a loss function of the type

D(X,W1W2 . . .WLHL),

where D(A,B) is a similarity measure between twomatrices A and B, a good choice? Or would a lossfunction that balances the contribution of each layer,such as

L = D(X,W1H1) + λ1D(H1,W2H2) + . . .

+λL−1D(HL−1,WLHL),

be more appropriate? This question has not been ad-dressed yet, to the best of our knowledge. Moreover,most works in the deep MF literature have onlyconsidered the Frobenius norm. It would be worthto investigate other similarity measures such as theKullback-Leibler and Itakura-Saito divergences, whichhave been shown to be particularly appropriate forspecific applications in the case of standard NMF [34],[35].

To end up, a comparison of one-layer matrix factoriza-tion, multilayer MF [28] and deep MF [11] is illustrated onFig. 2. Multilayer MF on Fig. 2 b and deep MF on Fig. 2 cboth perform several levels of decomposition but the keydifference is the iterative nature of the update rules in deep

MF, while the decomposition is only sequential, that is,unidirectional, in multilayer MF.

3.2 Variants and regularizations of deep MFBeside the standard models presented in the previous sec-tion, some variants have been studied in the recent lit-erature. These variants consist in adding constraints onthe factors or enforcing some properties, and are mostlyinspired from CLRMA. As highlighted in Section 1, withoutany additional constraints on the factors, deep MF admitshighly non-unique decompositions. The uniqueness of thesolution is critical to ensure reproducibility and interpretab-ility of the results. Depending on the application at hand,various constraints and regularizations can be used. In thissection, we briefly review some of these models. In manyof them, non-negativity is assumed on the factors, andthe variants are therefore closely related to various NMFmodels.

3.2.1 Deep orthogonal NMFOrthogonal NMF (ONMF) [36] is a variant of NMF whichimposes that the matrix H is nonnegative and row-wiseorthogonal, that is, H ≥ 0 and HHT = Ir where Ir is theidentity matrix of dimension r. In other words, all rows ofH are orthogonal to each other, and their l2 norm is equal to1. It is easy to see that these constraints imply that there is atmost one non-zero value in each column of H . Hence eachdata point is only associated to one basis vector (one columnof W ), and ONMF is equivalent to a weighted variant ofspherical k-means [37], which is a hard clustering problem.ONMF therefore imposes that each data point belongs to asingle cluster which is represented by a single basis vector.This allows a very straightforward interpretation of ONMFfactors: the columns of W are cluster centroids, while thecolumns of H assign each data point to its closest centroid(up to a scaling factor).

A relaxation of the orthogonality constraint consists inadding a penalty term of the form

∑j<kH(j, :)H(k, :)T to

the objective function. This is referred to as approximatelyorthogonal NMF (AONMF) [38].

The deep version of ONMF was introduced in [39] andenriched in [40]. The decomposition is slightly different thanin multi-layer and deep MF because rather than havingthe activation matrices Hl’s successively decomposed, theydecompose the features matrices Wl’s:

X ≈W1H1,

W1 ≈W2H2,

...WL−1 ≈WLHL,

leading to X ≈ WLHL · · ·H1, with each Hl constrained tobe nonnegative and row-wise orthogonal, that is, Hl ≥ 0and HlH

Tl = Ir for all l. Similarly, deep AONMF adds a

penalty to the objective that minimizes the inner productsHl(j, :)Hl(k, :)

T , for all j 6= k in each layer l. Applyingthe successive decompositions over the basis matrices Wl’srather than the activation matrices Hl’s as in [11] seemsmore natural: it allows to directly interpret the basis vectorsof a given layer as combinations of the basis vectors of the

Page 6: 1 Deep Matrix Factorizations - arXiv

6

next layer. For example, if the ranks dl’s are chosen such thatdL = dL−1−1, dL−1 = dL−2−1,. . . , d2 = d1−1, each layerwill merge two clusters of the previous layer while keepingthe others unchanged, and hence deep ONMF performsa hierarchical clustering. This will be illustrated on twoshowcase examples in Section 4.1.

3.2.2 Deep sparse MFA very common constraint considered in CLRMA is thesparsity of some factors, referred to as SCA and closelyrelated to dictionary learning (see Section 3.2.7). It consistsin limiting the number of non-zero elements of W and/orH . Many papers have studied the case of one-layer sparseMF, see for example [23], [41–44] among others. The goal ofsparse MF is to render the factors more interpretable. In par-ticular, the fact that each column of H contains only a fewnon-zero entries means that each data point is associatedwith a few basis vectors.

The extension of sparse NMF to the deep setting wasproposed in [45]. Based on (2), a `1 norm penalty is con-sidered either on each column of the matrices Wl’s and/oron each column of the matrices Hl’s. Similarly to shallowsparse MF, the goal of sparse deep MF is to obtain sparseand easily interpretable factors at each layer. The subprob-lems w.r.t. the regularized factor can be efficiently solvedfor example through a proximal gradient descent method,such as the (fast) iterative shrinkage thresholding algorithm((F)ISTA) [46]. Note that a normalization of the other factorshould be used to avoid a pathological case where theentries of the factor for which sparsity is promoted tend tozero while those of the other factor tend to infinity, becauseof the scaling degree of freedom in such decompositions(WlHl = (αWl)(Hl/α) for any α > 0). Furthermore,using the same regularization parameter for all columns ofa factor at a given layer is discouraged. In practice, sev-eral regularization parameters can be tuned automaticallyto ensure balanced levels of sparsity [47]. Another sparseframework, inspired by multilayer NMF [28], consists inadding a regularizer based on the Dirichlet distribution onthe columns of the factors [48].

There exists many ways to impose sparsity on thefactors, such as `0 norm [49] or `1/2 norm [50] regulariz-ations, among others. Inspired by deep learning, dropoutcould also enforce sparsity. Dropout [51] consists in ran-domly “dropping” some activations during the learningprocess to improve generalization. It has recently been em-ployed for one layer NMF [52], and was shown to be equi-valent to a deterministic low-rank regularizer [53]. It wouldbe interesting to see to what extent dropout might regularizedeep MF networks as well. Therefore, deep sparse MF hasnot yet been explored to its full extent.

3.2.3 Deep non-smooth NMFNon-smooth NMF (nsNMF) [54] consists in using a so-calledsmoothing matrix S between W and H in NMF which hasthe form

S = (1− θ)Ir +θ

reeT ,

where e is the vector of all ones of appropriate dimension.Note that nsNMF reduces to NMF when θ = 0. The para-meter θ ∈ [0, 1) promotes the sparseness of both W and H .

Let us briefly explain why. We have WS = (1−θ)W+θweT

where w is the average of the columns of W , that is,w = 1

rWe, and w is denser than any column of W . Whenθ > 0, WS therefore moves the columns of W towards w,and W is sparser than W = WS.

Deep nsNMF (dnsNMF) [32] introduces a smoothingmatrix at each layer of (2), with a common fixed parameterθ, that is, X ≈ W1S1H1 and Hl−1 ≈ WlSlHl for alll = 2, · · · , L with Sl = (1 − θ)Idl + θ

dleeT . Note that

multilayer non-smooth NMF was already proposed in 2013in [55] to empirically show how a multilayer architecture,very similar to the one proposed by Cichocki et al. [29], isable to extract features in a hierarchical way in the contextof text mining, but they did not use any backpropagationstrategy.

3.2.4 Deep Archetypal Analysis

Archetypal analysis (AA) [24], also known as convexNMF [31] is a variant of NMF in which the basis vectors areconstrained to be convex combinations of the data points.In other words, in addition to the constraints stated in (4),one should have W = XA where A ≥ 0 and AT e = e.Intuitively, the basis vectors are restricted to lie in theconvex hull of the data points, and can be interpreted asextremal points of the data set. Although the fitting errormight be higher than in standard NMF, the closeness of thearchetypes to the convex hull of the data confers them abetter interpretability.

The first proposal of deep AA was made in [56], to thebest of our knowledge, for the acoustic scene classificationtask. Given a data matrix X composed of n temporal framescharacterized by an m-dimensional features vector, a dis-criminative representation is learnt through successive AAdecompositions performed in a greedy forward way:

X ≈ XA1H1,

H1 ≈ H1A2H2,

...HL−1 ≈ HL−1ALHL.

However, schemes including a backpropagation stage donot seem to have been tested yet for deep AA.

Finally, the non-determistic deep AA of Keller et al. [57]approximates the data points by samples drawn from aGaussian distribution whose parameters are learnt througha deep encoding phase and is based on the deep variationalinformation bottleneck framework [58]. Since the model isprobabilistic and non-linear, the spirit is quite different fromthe one of deep MF models presented previously.

Closely related to AA, concept factorization (CF) consistsin approximating the basis elements as linear combinationsof the data points, the difference with AA lies in the absenceof the sum-to-one constraints on both A and H . Again,the first model containing several levels of decompositionwas purely sequential [59], and was outperformed by morerecent approaches based on the deep MF algorithm; see forexample [60]. We refer the reader to [61] for a comprehensivereview of shallow and deep CF.

Page 7: 1 Deep Matrix Factorizations - arXiv

7

Fig. 3: Canonical polyadic decomposition of a 3-D tensor.

3.2.5 Semi-supervised settingsWhile the models presented so far were all unsupervised,some deep MF models are able to cope with availableprior information in a semi-supervised fashion, such asdeep weakly-supervised semi-NMF [11]. To handle sideinformation, a weighted graph is built at each layer, wherethe nodes are the data points and two nodes are connectedby an edge if they share the same label. In the simplest case,the graph weights, denoted by the n × n symmetric matrixGl for the l-th layer, are binary, that is, Gl(i, j) = 1 if X(:, i)and X(:, j) share the same label w.r.t. the features extractedat layer l. A smoothness regularization term is added to theloss function of (5) with the form:

L∑l=1

λl

n∑j,k=1j 6=k

‖Hl(:, j)−Hl(:, k)‖2Gl(j, k) =

L∑l=1

λl Tr(HlLlHTl )

(6)where Tr(.) denotes the trace of a matrix, that is, the sum ofits diagonal elements, and Ll = Dl − Gl is the Laplacianmatrix at layer l with Dl a diagonal matrix such thatDl(j, j) =

∑nk=1Gl(j, k) for j = 1, . . . , n. Intuitively, (6)

enforces the hidden representations Hl(:, j) and Hl(:, k) ofdata points j and k that share the same label at layer l to beas close as possible.

When the available information is such that each datapoint might be associated with several labels, a dual-hypergraph Laplacian is built to grasp richer underlyinginformation and is such that an edge can connect anynumber of vertices [62].

3.2.6 Deep tensor decompositionThe extension of deep MF to tensors, that is, arrays of morethan two dimensions, has not yet been much investigated.The analog of MF in the tensor world is the canonicalpolyadic decomposition (CPD), which decomposes a tensoras the sum of rank-one tensors; see [63] and the referencestherein. Given a tensor T ∈ RI1×I2×···×IK where K is thedimension of the tensor, the CPD of rank R aims at findingthe vectors a(i)j ∈ RIi (i = 1, . . . ,K , j = 1, . . . , R) such that

T ≈R∑j=1

a(1)j ◦ a

(2)j · · · ◦ a

(K)j , (7)

where the ◦ operator denotes the outer product. The prin-ciple of a CPD is illustrated on Fig. 3 for a 3-dimensionaltensor.

Multilayer frameworks of tensor decomposition havebeen proposed in [64] for contextual-aware recommendersystems (CARS), and in [65] for audio sources separation ina mixed scene, as an extension of the matrix model of [66].Deep tensor decompositions have also been introducedfor action recognition in [67] and for the optimization ofconvolutional neural nets in [68].

However, all these models are quite different from eachother and do not resemble deep MF as defined in this paper.No general framework has been introduced for deep tensordecompositions, and it would be worth investigating deepCPD. A particularly interesting feature of CPD is that it hasweak identifiability conditions (see for example [69] and thereferences therein), as opposed to standard MF, which couldbe leveraged in the deep setting.

3.2.7 Related modelsIn this section, we briefly introduce two models closelyrelated to MF that have also been extended to a deep setting,namely transform learning and dictionary learning. Thereare not the main focus of the survey so we encourage theinterested reader to look at the references for more details.

Given an input data matrix X ∈ Rm×n, the goal ofdictionary learning (DL) is to find a dictionary D ∈ Rm×r ,whose columns are referred to as atoms, and a repres-entation matrix H ∈ Rr×n such that each column of His sparse, typically k-sparse (that is, with at most k non-zero entries, where k is a parameter). When the numberr of atoms is smaller than m, the dictionary is said to beundercomplete and DL is equivalent to SCA. However, DLtypically looks for overcomplete dictionaries with m � r.A particularly interesting variant of DL is the convolutionalsparse coding (CSC) where the atoms are convoluted withthe representation matrix H . In [5], Papyan et al. proposeddeep CSC where successive factorizations are performedexactly as in (2), with sparsity constraints on the Hl’s; thecolumns of Hl are required to be kl-sparse, where kl is thedesired level of sparsity at layer l. Similarly to deep MF,this decomposition allows a hierarchical interpretation ofthe dictionaries extracted. When a sequential thresholdingalgorithm is applied to enforce sparsity, deep CSC is equi-valent to the forward pass of a convolutional neural network(CNN).

Transform learning [70] also shares similarity with MF.Given a matrix X , it consists in premultiplying X withW to obtain H , such that WX ≈ H . The matrix H ispromoted to be sparse through a `1 norm regularizer whileW is constrained to be full rank. Recently, a sequentialmultilayer transform learning framework was proposed byMaggu and Majumdar [71], similar to the multilayer ideaof Cichocki et al. [28]. At the first layer, the approximationW1X ≈ H1 is considered. Then, H1 is premultiplied suchthat W2H1 ≈ H2, and the process continues until WL andHL are found, with all Hl’s sparse and all Wl’s full-rank.Similarly to deep MF, the matrices Hl’s are hierarchicalrepresentations of the original data points that can be usedfor further clustering.

3.3 How to solve deep MF?In this part, we briefly describe the initialization techniquesas well as the algorithms that can be used to solve thesub-problems w.r.t. either Wl or Hl at lines 6 and 7 ofAlgorithm 3. As these algorithms are standard optimizationtechniques, we refer the reader to previous surveys [16],[18] for more details and references on their applicationsin CLRMA problems. We also discuss the choice of severalparameters such as the number of layers L and the innerranks dl’s.

Page 8: 1 Deep Matrix Factorizations - arXiv

8

3.3.1 InitializationsThe most commonly used initialization of deep MF consistsin applying a sequential decomposition of the data matrixX , but there is no guarantee about the quality of thisinitialization. For the initialization of each factor Wl and Hl

of each layer l, several routines have been presented in theliterature; for example random initializations, initializationsbased on the SVD of X [11], or column subset selectionmethods that initializeW with columns ofX [72]. The studyof initialization techniques dedicated to deep MF is still anopen direction of research.

3.3.2 AlgorithmsSimilarly to standard NMF, most algorithms for deep MFconsist in alternatively updating each factor while keepingthe others fixed as in Algorithm 3. The stopping criterioncan either be a maximum number of iterations, a sufficientdecrease of the loss function, or a sufficient modification ofthe factors between two consecutive iterations.

The subproblems w.r.t. either Wl or Hl for any l aretypically solved using standard first-order optimization al-gorithms. Trigeorgis et al. [11] used a closed-form expressionfor the Wl’s and a multiplicative update (MU) for the Hl’s.MU is a well-known algorithm to solve NMF [73], andwas also proposed to solve a sequential multilayer MFin [74]. Other techniques such as projected gradient descent(PGD) method [75], possibly combined with an accelerationscheme, such as Nesterov’s one [76] are widely used; see forexample [32], [45]. PGD, a well-known first-order methodto solve constrained optimization problems, is an extensionof gradient descent (GD) where the iterates are projected onthe feasible set at each iteration. The acceleration consistsin adding a momentum term to the gradient step to allowfaster convergence. Another standard optimization schemeis the alternating direction method of multipliers (ADMM)which consists in reformulating the problem by decouplingthe variables, and minimizing the augmented Lagrangian.It is standard in the CLRMA literature [77], and it has alsobeen used for constrained deep MF in [78].

Since most deep MF algorithms are based on first-order methods, their computational cost is linear w.r.t.the size of the input data and the ranks, and hencethese methods are scalable. For example, the computationalcost of the algorithm of Trigeorgis et al. [11] requiresO(Lt(mnd+ (m+n)d2)) operations, where t is the numberof iterations and d = max

l=1,...,Ldl.

3.3.3 ParametersThe choice of the parameters of deep MF model, and inparticular the number of layers and the inner ranks, mainlydepends on the application. In most cases, the number oflayers used for the results reported in the literature doesnot exceed three. Moreover, the ranks tend to be chosen indecreasing order as the first layers of the model are expectedto capture attributes with a larger variance, thus requiring alarger capacity to encode them, while the last layers captureattributes with a lower variance [11]. This observation is alsoderived from the analogy with autoencoders: the inner layerof an autoencoder is generally the one that contains fewerunits as the goal is to obtain a compact representation of

the input data; see Section 5 for more details. On the otherhand, when the decomposition is performed on the featuresmatrices such as in (6), the ranks should also be chosen indecreasing order [40]: given W1 ∈ Rm×d1 with d1 columnsin dimension m, it only makes sense to approximate thecolumns of W1 ≈ W2H2 as linear combinations of d2 ≤ d1columns of W2; see Section 4.1 for two numerical examples.

4 APPLICATIONS

We now describe several applications for which deep MF isuseful.

CLRMA has already been used successfully for countlessreal-world applications, and deep MF models have con-tributed to improve the performances. However, a cleardefinition of deep MF is absent within the community,and very diverse uses of this terminology have been used.Besides, [79] mentions that some researchers call their model“deep MF” but use it for supervised tasks, or introduce ahigh degree of non-linearity inside it. This is for example thecase of [80] that performs deep non-linear matrix comple-tion, and [81] where the inner representations are obtainedthrough a deep highly non-linear MF architecture. The term“deep MF” was also given to an iterative procedure to solveclassical MF through a deep unfolding of the iterations overtime in [82], though this has almost nothing in common withthe deep MF as we have defined it in this paper.

Therefore, in this part, we mainly focus on works inthe same spirit as Trigeorgis [11], aiming to extract hier-archical features in a non-supervised context, which hasled to breakthrough results in several applications. Thissection is organized as follows. In Section 4.1, we presenttwo simple showcase examples showing the ability of deepONMF to extract hierarchical features. We choose deepONMF because, as explained in Section 3.2.1, its factors areeasily interpretable. A Matlab implementation is available 2

to allow the interested reader to explore these deep MFexamples, and play with the different parameters. Then, inSection 4.2, we present several applications for which deepMF has been successfully used in the recent literature.

4.1 Two showcase examplesIn this section, we detail two showcase examples on whichdeep MF reveals its inner workings. The first one is re-commender systems (Section 4.1.1), and the second oneis hyperspectral unmixing (HU) (Section 4.1.2). For bothapplications, we describe the results obtained with deepONMF (see Section 3.2.1), which is a variant of deep MFparticularly easy to interpret. Indeed, as each representationmatrix Hl contains only one non-zero entry per column,each data point is associated with a single cluster.

4.1.1 Recommender systemsRecommender systems consist in predicting the ratings ofusers over unseen items based on historical ratings on seenitems. In other words, given an incomplete rating matrixX ∈ Rm×n of n users over m items (such as movies), thegoal is to predict the missing entries. A standard approachto perform this task is by factorizingX as the product of two

2. http://bit.ly/deepMF_v1

Page 9: 1 Deep Matrix Factorizations - arXiv

9

matrices W ∈ Rm×r and H ∈ Rr×n where W contains theratings of r basis users over the m items and H representsthe proportions in which each user behaves as the r basisusers [83]. In the following, we describe how deep MF isable to extract hierarchical levels of basis users on a simpleexample.

Let us consider a matrix X ∈ R9×15 such that X(i, j)contains the rating of user j for movie i, and is between 1(highly dislike) and 10 (highly like). Our goal is to applydeep MF on X and show the hierarchy of basis usersextracted. The matrix X that will be considered throughoutthis synthetic example is the following:

X =

7 8 10 8 9 4 1 2 3 2 4 1 3 5 28 8 7 8 10 5 1 2 6 1 4 2 2 1 29 9 9 9 10 4 1 4 2 1 4 3 1 1 13 1 2 1 3 8 8 7 8 10 3 4 1 2 22 1 3 2 3 9 10 9 9 8 2 2 2 2 21 2 1 1 2 8 8 9 9 8 3 2 3 3 14 1 1 2 2 2 5 1 1 5 9 7 8 9 72 2 2 1 2 2 6 2 1 3 8 8 8 8 81 2 2 1 3 2 4 1 1 4 6 9 8 10 7

.

Let us suppose, for example, that the first three movies(first three rows of X) are horror movies, the next three arecomedy, and the last three are biopics. We observe that thefirst five users mostly enjoy horror movies, the next fiveones comedies, and the last five biopics. Note that we donot consider missing data in this example, as our goal is tointerpret the deep MF decomposition, rather than predictingmissing entries.

We apply deep ONMF on X with L = 2, d1 = 4, d2 = 3,that is, by computing the following decomposition:

X ≈W1H1, H1HT1 = I4, (W1, H1) ≥ 0,

W1 ≈W2H2, H2HT2 = I3, (W2, H2) ≥ 0.

(8)

To render the interpretation of the factors easier, we relaxthe orthogonality constraints by only imposing that HlH

Tl

is diagonal, which does not change the hard clusteringinterpretation but simply allows each row of Hl’s to havea norm different from 1. In counterpart, we normalize Wl’ssuch that all the elements are between 0 and 10. This allowsan easier comparison between the features extracted at eachlayer.

Let us interpret such a decomposition, layer by layer. Atthe first layer, we have X ≈ W1H1, and the matrices W1

and H1 are as follows (the values are rounded to one digitof accuracy):

W1 =

9.1 1.7 3.4 3.88.9 1.1 4.9 2.710.0 1.1 3.7 2.52.2 10.0 8.6 3.02.4 10.0 10.0 2.51.5 8.9 9.6 3.02.2 5.5 1.5 10.02.0 5.0 1.8 9.92.0 4.4 1.5 10.0

,

H1 =

0.88 0 0 00.87 0 0 00.92 0 0 00.87 0 0 01.05 0 0 0

0 0 0.92 00 0.92 0 00 0 0.85 00 0 0.92 00 0.88 0 00 0 0 0.820 0 0 0.800 0 0 0.790 0 0 0.890 0 0 0.71

T

.

The columns of W1 are themselves combinations of those ofW2, as W1 = W2H2 with H2:

H2 =

1 0 0 00 1.02 0.98 00 0 0 1

.

At the second layer, we have X ≈ W2H2, withH2 = H2H1. As d2 = 3, we expect that each column ofW2 corresponds to the profile of a basis user liking only onecategory of movies, which is indeed the case as we obtain

W2 =

9.1 2.7 3.88.9 3.3 2.710.0 2.6 2.52.2 9.1 3.02.4 10.0 2.51.5 9.3 3.02.2 3.2 10.02.0 3.1 9.92.0 2.7 10.0

, HT

2 =

0.88 0 00.87 0 00.92 0 00.87 0 01.05 0 0

0 0.91 00 0.94 00 0.84 00 0.91 00 0.90 00 0 0.820 0 0.800 0 0.790 0 0.890 0 0.71

.

The first column of W2 corresponds to a basis user likinghorror movies, the second comedies, and the last biopics.

The first and fourth columns of W1 are identical to thefirst and third columns of W2 respectively while the secondand third column of W1 bring more refined information.While the second column of W2 only exhibits strong ratingsfor items 4 to 6 and low ones for the other items, thesecond and third columns ofW1 correspond to two differentpatterns such that the second basis user of layer 2 can beseen as the merging of two more informative basis users atlayer 1. Both the second and third column of W1 have highratings over items 4 to 6, as for the second column of W2 butthe other ratings are different. Indeed, the second columnof W1 has intermediate ratings for biopics (items 7 to 9)but very poor ones over horror movies (items 1 to 3), andconversely for the third column. This level of granularityis not caught by the second layer of decomposition, whichgrasps the more general three main patterns that appear atfirst sight when looking at X . This justifies the benefit of

Page 10: 1 Deep Matrix Factorizations - arXiv

10

using two layers of factorizations. To be more precise, asd1 > d2 in this example, deep MF first extracts 4 refinedbasis users and then gather two of them at layer 2 to modelthe more global structure of the data.

Let us mention that a single layer ONMF with r = 4recovers matrices similar to W1 and H1, and similarly whenr = 3 at the other layer. However, single layer factorizationsdo not exhibit any hierarchical relation between basis users.

4.1.2 Hyperspectral unmixing

Hyperspectral unmixing (HU) is a classical application ofNMF, and many models taking into account various priorshave been developed [84], [85]. A hyperspectral image iscomposed of n pixels, each one characterized by the reflect-ance value (fraction of the light reflected) in m wavelengths,which is referred to as its spectral signature. Representingthis image as a matrix X ∈ Rm×n where each column isthe spectral signature of a pixel, the purpose of HU is toidentify the spectral signature of the r materials presentin the image, that is, the columns of W , as well as theirrespective abundances in every pixel, that is, the columnsof H . However, the precise number of materials is notalways easy to determine as some materials have similarspectral signatures or are highly mixed. In this section, weconsider the HYDICE Urban hyperspectral image which isan airborne image of a Walmart in Copperas Cove (Texas);see Fig. 4. It is made of n = 307× 307 pixels with m = 162spectral bands. There are several versions of the groundtruth depending on the number of materials considered [86].

Deep ONMF is able to extract the materials in a hier-archical manner, as illustrated on Fig. 5 which shows theabundance maps Hl’s, representing the proportions of agiven material in the pixels. This solution was obtained byapplying deep ONMF on the Urban image with L = 4,d1 = 7, d2 = 6, d3 = 4, d4 = 2. The first layer extractsseveral materials, namely two types of grass, trees, road,dirt, metal and roof. At the next layers, the materials aresuccessively merged by two within a single cluster. At layer2, the two clusters corresponding to road and metal, whichhave similar spectral signatures, are merged in a singlecluster. At layer 3, the road/metal and dirt are merged tocreate a single cluster while the two kinds of grass are alsomerged in a single cluster. At layer 4, the road and roofare merged, while trees and grass are also merged in acluster made of vegetation. Clearly, this example illustratesthe ability of deep MF to extract materials in a hierarchicalmanner in hyperspectral images. Compared to traditionalshallow extraction methods, deep MF brings an undeniablevalue in terms of interpretability.

Fig. 6 provides a comparison between the extractedspectral signatures at the third (d3 = 4) layer with a groundtruth from [87]. The signatures retrieved by deep MF aresimilar to the ground truth, which indicates that deep MF isable to extract meaningful features through several layers.

4.2 Real-world applications of deep MF

In this section, we review several applications of deep MFpresented in the literature.

Fig. 4: The HYDICE Urban hyperspectral image.

4.2.1 Recommender systems

We have already shown that deep MF extracts hierarch-ical information in the context of recommender systems inSection 4.1.1. Several models based on deep MF have beenproposed in the recent literature.

Mongia et al. [88] use a projected gradient descentmethod to tackle deep MF with missing entries in the datamatrix X . They refer to their model as deep latent factormodel, and use it to infer missing entries while extractingseveral layers of explanatory factors, with a similar inter-pretation as in our showcase example in Section 4.1.

Xue et al. [89] derive a latent representation of both usersand items through a so-called "deep factorization" thoughthe model is different from the deep MF as defined in thispaper. More precisely, based on a rating matrix Y ∈ Rm×ncontaining both explicit ratings and non-preference implicitfeedback (corresponding to a 0 value) of m users over nitems, each row Y (i, :) is mapped to a vector pi such thatpi = Y (i, :)W1 . . .WL and similarly (with another set ofmatrices) for each column Y (:, j), which is mapped into rj .Then, a matrix Y is built, such that Y (i, j) =

pTi rj‖pi‖‖rj‖ , and

a cross-entropy based loss function between this similaritymatrix and the original rating matrix Y is minimized ona training subset of the data. This deep approach reacheshigher performance than state-of-the-art MF methods interms of the ranking of suggested items, based on theirpredicted ratings.

Deep MF was also used in the context of recommendersystems with implicit feedback when the ratings are notgiven as a numerical value but as a binary feedback (such as"like" or "dislike") [90]. For each user and each item, a vectorcontaining both a representation of this implicit feedbackand side information is provided. Then, deep MF is appliedseparately on the users and items to derive meaningfulrepresentations H(ui)

L ’s and H(vj)L ’s for all users ui’s and

items vj ’s respectively, where the inner dimension dL is thesame for both factorizations. The rating rij of user i overitem j is predicted as rij = H

(ui)T

L H(vj)L +Gui

+Gvj whereGui

and Gvj are obtained through maximum likelihood es-

Page 11: 1 Deep Matrix Factorizations - arXiv

11

Fig. 5: Abundance maps hierarchically extracted by deep ONMF on the Urban data set. From top to bottom: first, second,third and fourth layer.

timation and describe the specific influence of user and item,respectively. Using deep MF improves the root mean squareerror between the predicted and actual ratings compared tostandard MF models on several benchmark datasets.

4.2.2 Multi-view clusteringMulti-view clustering consists in clustering items for whichthe data are described by several views; for example, imagesdescribed both by their pixels and textual tags, see [91] fora survey. In [92] and [93], each data matrix X(v) of eachof the V views is deeply factorized and the last hiddenrepresentations H(v)

L ’s are constrained to be the same for allviews. Cui et al. design the objective function as a weightedsum of the squared Frobenius norm of the residuals of eachview, whose weights are also learned in [93]. A furtherrefinement is proposed by Wei et al. who add a penalty termmeasuring the redundancy of the clusterings Hi and Hj ofdifferent layers i and j to the objective function [94]. Moreprecisely, matrices C(l) = HT

l Hl ∈ Rn×n for all l indicate

if two data points are clustered identically or not at layer l.A penalty aiming at minimizing ‖C(i) � C(j)‖1 is added tothe objective for each pair of layers, to avoid redundancy,where � denotes an element-wise multiplication.

A semi-supervised variant is considered by Xu et al. [95]with a graph Laplacian penalty aiming to both minimizethe gap between the inner representation HL of instancessharing the same label and maximize the gap between theinner representation HL of instances belonging to differentclasses. Huang et al. [96] constrain the entries of HL to beeither 0 or 1. Finally, when the data are given through sev-eral views, such as images and documents, Xiong et al. [97]show that binary hashing codes derived through deep MFare able to find meaningful items with a binary code close tothe one of a given query. In addition to the data fitting error,the loss function contains terms that aim at finding a unifiedlatent representation H , and at minimizing the classificationerror of a linear classifier based on H . Moreover, each entryof the unified code matrix H is constrained to be either +1

Page 12: 1 Deep Matrix Factorizations - arXiv

12

20 40 60 80 100 120 140 160Wavelength band number

0

0.5

1R

efle

cta

nce

Road

20 40 60 80 100 120 140 160Wavelength band number

0

0.5

1

Re

fle

cta

nce

Grass

20 40 60 80 100 120 140 160Wavelength band number

0

0.5

1

Re

fle

cta

nce

Tree

20 40 60 80 100 120 140 160Wavelength band number

0

0.5

1

Re

fle

cta

nce

Roof

Ground truth from [84]

Deep MF endmembers

(a)

Fig. 6: Comparison between the endmembers extracted by deep ONMF at the third (d3 = 4) layer and the "ground-truth"endmembers for the HYDICE Urban hyperspectral image.

or −1.

4.2.3 Community detectionCommunity detection consists in identifying communities,that is, subsets of nodes that are highly connected, insidea given graph. While NMF is able to extract overlappingcommunities [98], deep MF allows to interpret the dynamicsalong which the nodes are progressively grouped. Moreprecisely, taking as input of deep MF the adjacency matrixleads to the extraction of the membership coefficients ofall nodes to dl communities Hl ∈ Rdl×n at layer l withdL ≤ dL−1 ≤ · · · ≤ d0 = n [99]. The interest of the deeparchitecture lies in the fact that nodes belonging to the samecommunity gather closer to each other in terms of innerrepresentations as the layers go deeper. In other words, deepMF allows to extract communities at different scales, smallercommunities at the first layers are merged together in largercommunities in the deeper layers as deep MF unfolds.

4.2.4 Hyperspectral unmixingAs illustrated in Section 4.1, deep MF can be used meaning-fully for HU, extracting several layers of materials.

The early sequential multilayer NMF of Cichocki etal. [28] (2) was used, together with sparsity regularization,by Rajabi and Ghassemian [100]. Though the endmembersare estimated more accurately, no additional insight is givenon the interpretability power of the model. Later, Tong etal. [101] use the deep model of Trigeorgis et al. [11] andshow that it is efficient for the extraction of endmembers,though the interpretation of the successive inner repres-entations is not emphasized. A similar approach takes into

account an additional regularization [102]: on the one handa sparsity constraint is considered on the abundance matrixHL while on the other hand, a spatial regularization isapplied through the total variation minimization (TVM). Ina nutshell, TVM [103] is a well-known regularization whichconsists in computing the differences between the abund-ances of each pair of adjacent pixels and minimizing theirsum to reduce the noise and get a smooth abundance map.A deep purely sequential model of archetypal analysis (deepAA), similar to the one developed by [56] (see Section 3.2),was also used for HU in [104].

4.2.5 Synthetic aperture radar (SAR)SAR consists in analysing the changes that appear on thesurface of the Earth through high-resolution images. Giventwo images of the same location at different times, the goalof SAR change detection is to produce a binary map indicat-ing the changed and unchanged pixels over the consideredperiod.

Gao et al. [105] use deep semi-NMF to cluster the pixelsin three categories: "unchanged", "changed" and "interme-diate". Then, a more refined classification step accuratelydetermines which pixels of the landscape have changed ornot. Similarly, Li et al. propose to solve the SAR changedetection problem with non-smooth deep MF [106]. Theframework is again semi-supervised since a classificationstage aims at reconstructing the label matrix based on theinner representation matrix HL.

4.2.6 Audio processingThe audio source separation problem consists in extractingthe frequential spectra of the sources contained in a sound

Page 13: 1 Deep Matrix Factorizations - arXiv

13

recording as well as their respective activations over time.NMF has been shown to be efficient to solve this problem,when the matrix X is a time-frequency representation ofthe input data, for example the spectrogram obtained witha short-time Fourier transform (STFT); see [34] and thereferences therein.

Sharma et al. [107] use deep MF for speech recognition:given a matrix X of n frames, each one correspondingto a sequence of successive words, described by an m-dimensional vector corresponding to the well-known cep-stral coefficients [108], a deep MF alternating sparse anddense layers factorizes X . The authors empirically noticethat alternating sparse (l odd) and dense (l even) layersleads to more discriminative features, and the features usedfor the classification of the frames are obtained by applyinga PCA on the concatenation of the inner representationsHl’scorresponding to sparse layers.

Hsu et al. [109] apply deep MF on the spectrogrammatrix of a set of spoken sentences to extract several layersof frequential basis features, and is better able to separate thespeakers in a mixture than a simple one-layer NMF. Thakuret al. [110] used deep AA to extract sources based on thespectrograms of bioacoustics signals, with the dictionarieslearnt at the first layers corresponding to archetypes on theconvex hull of the data while deeper atoms being more inthe center of the data. The classification accuracy obtainedwith a SVM based on the inner representations HL’s ishigher than other state-of-the-art classification methods. Anextension was proposed in [111] with a more sophisticatedclassification approach.

4.2.7 Perspectives

Deep MF does not seem to have been tested yet on severalapplications in which it has important potentialities. For ex-ample, in text mining tasks, it seems logical that hierarchicalstructures appear. For example, for NMF, given a word-by-document matrix X where the entry X(i, j) is the numberof times the word i appears in the document j, NMF allowsto automatically extract topics as the columns of the basismatrix W , while H indicates which document discusseswhich topic [7]. In this context, deep NMF would be able toextract hierarchies of topics, from coarser to finer topics. Forexample, the first layers would extract general topics suchas politics, geography and sports, while the deeper layerswould refine these topics in sub-topics. For example, sportswould be divided into tennis, soccer and golf, while soccerwould contain results from different competitions. Note thatNMF is known to be a simple topic model equivalent tolatent semantic analysis/indexing (PLSA/PLSI) [112], anddesigning refined deep models would be of particular in-terest, similarly as done for NMF [113].

Though this survey focuses on linear deep MF, someapplications would benefit from the introduction of non-linearities (see Section 5 for more details). For example, inhyperspectral unmixing, scattering and various interactionsmay justify the use of non-linear models [114]. Therefore, itwould be interesting to investigate non-linear deep MF inview of the specific requirement of the applications.

5 CONNECTIONS WITH NEURAL NETWORKS

Several connections can be made between deep MF anddeep learning. However, we restrict ourselves as much aspossible to models aiming at extracting features from data inan unsupervised and interpretable way. Works embeddingMF ideas in a neural network architecture, such as [115–118]are interesting but are further away from the focus of thissurvey.

Deep artificial neural networks [8] have been known forseveral years as one of the best classification paradigms.On Fig. 7, we have represented a standard neural networkmade of a succession of P fully-connected layers3. Eachlayer k, k = 0, . . . , P − 1, is made of sk units. Let usconsider a data matrix X ∈ Rm×n of n points in dimensionm = s0 and a binary label matrix Y ∈ Rc×n indicatingthe membership of each data point X(:, j) to each of thec = sP−1 classes, that is, Y (i, j) = 1 if X(:, j) belongs tothe i-th class. Given X(:, j) for any j as input, the networkproduces as output a c-dimensional vector Y (:, j). CallingZk ∈ Rsk−1×sk , k = 1, . . . , P − 1, the weights matrixbetween layer k − 1 and layer k, the first layer computesa vector M1(:, j) = g(Z1X(:, j)) where g is a non-linearactivation function applied element-wise. Then, any layerk, k = 2, . . . , P − 1, computes Mk(:, j) = ZkMk−1(:, j),with MP−1(:, j) = Y (:, j). The goal of the neural networkis to classify the data points X(:, j)’s at best, that is, op-timize the Zk’s such that the prediction Y (:, j) is as closeas possible to the ground-truth Y (:, j) for all j. Overall,considering all the data points, the prediction matrix isgiven by Y = g(ZP−1g(ZP−2 . . . g(Z1X))).

Autoencoders [119] are particular neural networkswhere the output matrix does not correspond to a mem-bership matrix but is identical to the input, that is, Y = X .Assuming that the number of layers P is odd, the purposeof an autoencoder is to extract a compressed representationMQ of the input data at the central layer Q = P−1

2 throughthe encoder, and approximate as well as possible the initialdata back after the decoder layers. Fig. 8 a provides an illus-tration when the encoder and decoder are symmetric, that is,sk = sP−1−k for all k = 0, . . . , P − 1 and Zk = ZTP−k for allk = 1, . . . , P − 1. This leads to the following approximation

X ≈ Y = g(ZT1 g(ZT2 . . . g(ZTQMQ))).

Let us number the layers in the reverse sense, that is, letus consider l = P − 1 − k and dl = sl for l = 1, . . . , L.Let us also denote Wl = ZP−l = ZTl and Hl = MP−1−l,with L = Q such that the decoder performs the followingdecomposition:

X ≈ Y = g(W1g(W2 · · · g(WLHL))). (9)

When the activation function g is the identity, (9) becomes

X ≈ Y = W1 . . .WLHL, (10)

which corresponds to a so-called linear network. The de-composition performed by (10) is the same as deep MF butdeep MF usually requires additional constraints, such asthe non-negativity of some factors, to render the solutionmeaningful and interpretable (see Section 1).

3. We use an unusual naming of the parameters to avoid the confu-sion with the notation introduced for deep MF.

Page 14: 1 Deep Matrix Factorizations - arXiv

14

...

......

...

X(1, :)

X(2, :)

X(3, :)

X(m, :)

g

g

M1(1, :)

M1(s1, :)

g

g

MP−2(1, :)

MP−2(sP-2, :)

g

g

Y(1, :)

Y(c, :)

Inputlayer 0

Z1

Zp−1

Hiddenlayer 1

Hiddenlayer P − 2

Outputlayer P − 1

. . .

Fig. 7: Illustration of an artificial neural network.

A widely used activation function is the rectified linearunit, that is,

g(x) = ReLu(x) = max(x, 0).

In this setting, each inner representation matrixHl−1 = g(WlHl) for l = 2, . . . , L is imposed tobe non-negative, as in the original deep MF of [11].Though such a network is very similar to deep MF,as shown on Fig. 8 b, the two models are not exactlyequivalent since the representation matrix HL of anautoencoder is learnt in a supervised way and is given byHL = g(WT

L g(WTL−1 . . . g(WT

1 X)))), which makes it closeto deep archetypal analysis. In fact, autoencoders are mainlyused in semi-supervised settings, for example to pre-trainthe networks for classification tasks, while deep MF minesunknown hierarchical features hidden in the data set. Thisconnection suggests that the ranks of the factorization dl’sin deep MF should be chosen in a decreasing order, as forthe central layer of an autoencoder, corresponding to HL, isusually the smaller one.

Interestingly, the use of non-negativity constraints onthe representation matrices within an autoencoder with asingle layer in the encoder has shown to produce parts-based representations, as for NMF, while the overfitting wasreduced [120]. Also, improvements were achieved in [121]using a deep structure promoting sparse activations Hl’s,which is similar to deep sparse MF. However, only thepretraining stage is unsupervised and is similar to deep MF(though non-linear activations are used), while a supervisedclassification stage follows. This connection between neuralnetworks and deep MF was also highlighted in [122] wherethe discriminative power of such a hierarchical model wasobserved on topic mining and audio source separation tasks.

Also inspired by deep learning non-linearities, Trigeorgiset al. [11] proposed to introduce a non-linear function, suchas the sigmoid, at each layer of the model (2), that is, useHl−1 = g(WlHl) for all l where g(x) = 1

1+e−x , which stillprovides a parts-based decomposition but at the cost of apossible weaker interpretability [123].

Similarly, deep AA is closely related to neural networks.An archetypal regularization based on an autoencoder wasproposed by van Dijk et al. [124]. The latent representationH is learnt through a deep encoder performing a non-linear transformation of the input data and the additionof Gaussian noise to H enforces the basis vectors to beclose to the data at the decoding layer. More precisely, thenoise pushes the columns of H outside the unit simplex,which in turn enforces the columns of W to shrink in orderto maintain a low reconstruction error. The strong connec-tion between autoencoders and deep AA in the processof learning hierarchical features in image patches was alsohighlighted in [125].

6 THEORETICAL ASPECTS OF DEEP MFAlthough numerous formulations, algorithms and applica-tion have been developed for deep MF, proper theoreticalstudies remain scarce, apart from insights from the deeplearning community working on linear networks. To the bestof our knowledge, the main theoretical contributions so farare mostly the convergence of algorithms, and to a lesserextent identifiability.

6.1 Convergence issues

When the factors of deep MF are updated through a BCD(see Algorithm 4), the subproblems w.r.t. a single factor are

Page 15: 1 Deep Matrix Factorizations - arXiv

15

. . .

. . .

. . .

. . .

. . .

. . .

. . .

X(1, :) X(2, :) X(3, :) X(m, :)

g gM1(1, :) M1(s1, :)

g gMQ−1(1, :) MQ−1(sQ-1, :)

g gMQ(1, :) MQ(sQ, :)

g gMQ+1(1, :) MQ+1(sQ+1, :)

g gMP−2(1, :) MP−2(sP-2, :)

g g g g

Y(1, :) Y(2, :) Y(3, :) Y(m, :)

Inputlayer 0

Z1

ZT1

ZQ

ZTQ

Hiddenlayer 1

Hiddenlayer Q− 1

Hiddenlayer Q

Hiddenlayer Q+ 1

Hiddenlayer P − 2

Outputlayer P − 1

...

...

(a)

. . .

. . .

. . .

. . .

HL(1, :) HL(dL, :)

HL−1(1, :) HL−1(dL-1, :)

H1(1, :) H1(d1, :)

Y(1, :) Y(2, :) Y(3, :) Y(m, :)

W1

WL

Layer L

Layer L− 1

Layer 1

...

(b)

Fig. 8: Illustration of the similarity between (a) deep autoencoders and (a) deep MF.

convex, as for most CLRMA. Standard convergence resultsgive conditions under which the iterates tend to stationarypoints, depending on whether the subproblems are solvedthrough an exact algorithm or through an approximateframework, such as majorization minimization (MM). Thisencompasses most gradient descents used in practice. Forexample, when the global objective function is the leastsquares (5), using alternating projected gradient descent isguaranteed to converge to stationary points, because thesubproblems are convex and Lipschitz smooth [126].

6.1.1 Convergence of first-order methods

The effect of the number of layers on the convergence offirst-order methods applied on the problem (5) has not beenmuch studied, to the best of our knowledge, but recentresults have been obtained on deep linear networks (seeSection 5).

Some of the theoretical results presented in the followingare not directly related to the deep MF models describedso far but rather concern networks aimed at supervised

learning. However, we strongly believe that these insightscoming from the deep linear networks community mightbe helpful to better understand deep MF and possiblyopen directions of future research. Let us mention a fewimportant results. We refer the interested reader to the recentsurvey [127] for more details.

When the thinnest layer of a deep linear network is eitherthe input or the output one, Laurent et al. [128] showed thatdeep linear networks with arbitrary convex differentiableloss produce local minima that are all global. In addition,when the input data is whitened (that is, the covariancematrix is the identity) and a proper initialization of all layersis chosen, Arora et al. [129] proved the linear convergenceof gradient descent to a global minimum on such a network.This generalizes the results of [130] in which linear residualnetworks, where the weights of each layer are initialized tobe the identity matrix and the inner ranks d0 = m, . . . , dl arethe same, are considered. Indeed, the network architectureis more general and softer restrictions on the initializationare required. When the loss function is the squared error

Page 16: 1 Deep Matrix Factorizations - arXiv

16

between Y and Y , Arora et al. [131] provide an interestingresult: If the weightsWl’s are updated with gradient descentand if the initialization W (0)T

l+1 W(0)l+1 = W

(0)T

l W(0)l holds for

all l, then there exists an equivalent update rule for the end-to-end matrix W = W1 · · ·WL which can be seen as anacceleration of the gradient descent update as long as thelearning step is sufficiently small. As the depth L grows, theeffect is intensified which shows that over-parametrization,that is, considering several hidden layers, might acceleratethe optimization process.

In the same spirit, when the network is restricted to besuch that each hidden layer contains the same number d ofunits, but without considering specific assumptions on theinput data nor the initialization, Du et al. [132] prove thelinear convergence of gradient descent to a global optimumif the width of each layer is sufficient.

Finally, the convergence of gradient descent on afunction of a product of matrices, especially the loss‖Y −

∏lWl‖2F , where Y = −Id and each Wl is a square

matrix of size d was studied by Shamir et al. in [133].Independent initialization of each layer is considered, thatis either Xavier initialization (the entries of the Wl’s aresampled from a zero-mean Gaussian distribution) or near-identity initialization (each Wl = I +M with I the identityand M a matrix whose elements are sampled from a zero-mean Gaussian distribution). The smaller the variance ofthe initialization distribution is, the more likely it is thatgradient descent has exponential runtime w.r.t. the depthof the network, therefore advocating for shallow nets inthis case. However, this result is rather empirical and thearchitecture of the network is particular, as each layer hasthe same number of units.

6.1.2 Low-rank structureArora et al. [79] demonstrate some advantages of uncon-strained deep MF compared to standard shallow MF [134]in terms of regularization properties. Indeed, deep MFenhances an implicit tendency towards low-rank solutions.The problem considered is matrix completion, that is, im-pute missing entries in a given matrix X ∈ Rm×n. Whenthe number of known entries is sufficiently large (this de-pends on the rank of X), factorizations of any depth admitsolutions that tend to minimize the nuclear norm of theend-to-end matrix W = W1 · · ·WL, that is, to minimize thesum of the singular values of W . However, when there arefewer observed entries, the approximation tends to have alower effective rank at the expense of a higher nuclear norm,especially when the depth increases. More interestingly,the evolution of the singular values of W obtained withgradient flow, that is, gradient descent with infinitesimallysmall learning rate, reveals that the solutions tend to have afew large singular values and many small ones, with a gapthat intensifies with the depth of the factorization. This canbe seen as an implicit regularization promoting low-ranksolutions.

In summary, the recent literature gives evidence of theadvantages of using a deep factorization, in terms of bothspeed of convergence of gradient descent to a global min-imum and low-rank-ness of the factors. However, the set-tings described do not assume constraints on the factors of

the decomposition, unlike most deep MF models. Extendingthese observations to the constrained case is an importantdirection of research.

6.2 Identifiability

Identifiability of exact deep MF is an important theoret-ical research question. It consists in establishing the con-ditions under which the factors W1, . . . ,WL and HL can beuniquely retrieved, up to trivial permutation and scaling.For various CLRMA problems, thorough conditions havebeen proposed; see for example [15] for NMF, [43] forSCA and DL, and [19] for simplex-structured MF, and thereferences therein.

Of course, any result for CLRMA can be extended to thecorresponding deep MF model. Let us illustrate this withNMF. A necessary condition for an exact NMF X = W1H1

with W1 ≥ 0 and H1 ≥ 0 to be unique, up to permuta-tion and scaling, is that WT

1 and H1 satisfy the so-calledsufficiently scattered condition (SSC) [135]. Intuitively, theSSC requires that the rows of W1 and the columns of H1

are sufficiently well spread within the nonnegative orthantand have some degree of sparsity. Then, the two-layerX = W1W2H2 is also unique, up to permutation andscaling, ifH1 = W2H2 is unique, which is guaranteed ifWT

2

and H2 satisfy the SSC. Similar observations would applyfor SCA, among others.

However, there are very few results tackling theidentifiability of deep MF directly. As far as weknow, the only attempt is by Malgouyres and Lands-berg [136], in a very particular setting. The factorizationX ≈ M1(q1)M2(q2) . . .ML(qL) is considered where eachmatrix Ml is described through a small number S of para-meters with ql ∈ RS for all l. A necessary and sufficientcondition for identifiability in the noiseless case is providedas well as stability guarantees in the noisy case. However,these are quite abstract conditions involving advanced con-cepts such as the tensorial lifting property and the Segreembedding, and these conditions are difficult to check inpractice. These results are further discussed in [137] wherethe conditions are extended to the case of convolutionallinear networks.

Needless to say that the robustness to noise (also some-times referred to as the stability) of deep MF models is alsoan important issue that has not been investigated yet. In fact,even in the matrix case, most known results apply to theunconstrained case or under orthogonality constraints [138].For most other CLRMA problems, such results are ratherscarce and difficult to derive.

7 PERSPECTIVES AND CONCLUSION

Deep MF is an emerging research topic, at the intersectionof low-rank matrix approximations and deep learning. Inthis literature review, we presented multilayer and deep MFvariants, which are being used successfully in an increasingnumber of applications, from recommender systems andhyperspectral unmixing to multi-view clustering and com-munity detection.

Although many models and algorithms have been intro-duced for deep MF, the theoretical insights remain weak. In

Page 17: 1 Deep Matrix Factorizations - arXiv

17

our opinion, this is a main direction of research that shouldbe tackled by deep MF researchers. Interestingly, a similartrend was observed for neural networks: the theory startedto be investigated thoroughly (and it is still a very activearea of research) only after many models and algorithmswere shown to perform well in practice.

Many perspectives have been presented throughout thissurvey, concerning the various aspects of deep MF:• Choice of the parameters: The choice of the parameters,

namely the inner ranks and the number of layers,has not been discussed much as it is mostly applic-ation dependent (see Section 3.3). Establishing properguidelines to choose these parameters is a crucial issue.

• Identifiability: Identifiability of deep MF has not beeninvestigated much, see Section 6.2. However, derivingconditions for deep MF to be unique could be particu-larly meaningful in some applications.

• Loss function: Very few works have carefully investig-ated the choice of the loss function, see the discussionin Section 3.1. Besides, this influences the way thealgorithms are designed.

• Design new models and algorithms: As evoked inSection 4.2, many MF models have not been extendedto a deep setting yet. Moreover, efficient algorithms andinitializations dedicated to deep MF are still lacking (seeSection 3.3).

• Links between deep MF and deep learning: Thoughthere exist obvious connections between deep neuralnetworks and deep MF as described in Section 5, theydo not seem to have been fully exploited yet. It is notclear either whether it is possible to integrate advanceddeep learning frameworks, such as convolutions, insidedeep MF. Convolutional neural networks are knownto extract several levels of visual features in imagepatches through highly non-linear operations [139]: isdeep MF able of such performance in a linear and moreinterpretable way?

• Applications: Deep MF has not been applied yet onseveral important applications such as text mining.Moreover, in the applications described in Section 4.2,the interpretation of the features obtained at each layeris not always clear. This is also an important researchissue. In particular, the original data points are usuallyclustered by applying k-means on the last inner repres-entation matrix HL. However, more robust techniquestaking into account the information of the previouslayers have not been used yet, to the best of ourknowledge.

We believe deep MF could be a particularly usefulframework as it combines the ability to extract hierarchicalfeatures, as deep learning models, with a high interpretab-ility power, as low-rank matrix approximations. These ad-vantages justify the necessity to maintain research effortsin deep MF, especially to improve the explainability of AItechniques.

REFERENCES

[1] M. Udell, C. Horn, R. Zadeh, and S. Boyd, “Generalized low rankmodels,” Foundations and Trends in Machine Learning, vol. 9, no. 1,pp. 1–118, 2016.

[2] M. Udell and A. Townsend, “Why are big data matrices approx-imately low rank?” SIAM Journal on Mathematics of Data Science,vol. 1, no. 1, pp. 144–160, 2019.

[3] S. Wold, K. Esbensen, and P. Geladi, “Principal component ana-lysis,” Chemometrics and intelligent laboratory systems, vol. 2, no.1-3, pp. 37–52, 1987.

[4] G. H. Golub and C. Reinsch, “Singular value decomposition andleast squares solutions,” in Linear Algebra. Springer, 1971, pp.134–151.

[5] V. Papyan, Y. Romano, J. Sulam, and M. Elad, “Theoreticalfoundations of deep learning via sparse representations: A mul-tilayer sparse model and its connection to convolutional neuralnetworks,” IEEE Signal Processing Magazine, vol. 35, no. 4, pp.72–89, 2018.

[6] P. Georgiev, F. Theis, and A. Cichocki, “Sparse component ana-lysis and blind source separation of underdetermined mixtures,”IEEE Transactions on Neural Networks, vol. 16, no. 4, pp. 992–996,2005.

[7] D. D. Lee and H. S. Seung, “Learning the parts of objects bynon-negative matrix factorization,” Nature, vol. 401, no. 6755, pp.788–791, 1999.

[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.521, no. 7553, p. 436, 2015.

[9] G. Marcus, “Deep learning: A critical appraisal,” arXiv preprintarXiv:1801.00631, 2018.

[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in Advances in Neural Information Processing Sys-tems, 2014, pp. 2672–2680.

[11] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. Schuller, “Adeep matrix factorization method for learning attribute repres-entations,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 39, no. 3, pp. 417–429, 2016.

[12] C. Févotte and J. Idier, “Algorithms for nonnegative matrixfactorization with the β-divergence,” Neural computation, vol. 23,no. 9, pp. 2421–2456, 2011.

[13] Y.-X. Wang and Y.-J. Zhang, “Nonnegative matrix factorization: Acomprehensive review,” IEEE Transactions on Knowledge and DataEngineering, vol. 25, no. 6, pp. 1336–1353, 2012.

[14] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrixand tensor factorizations: A unified view based on block coordin-ate descent framework,” Journal of Global Optimization, vol. 58,no. 2, pp. 285–319, 2014.

[15] X. Fu, K. Huang, N. D. Sidiropoulos, and W. Ma, “Nonnegativematrix factorization for signal and data analytics: Identifiability,algorithms, and applications.” IEEE Signal Process. Mag., vol. 36,no. 2, pp. 59–80, 2019.

[16] N. Gillis, “The why and how of nonnegative matrix factoriz-ation,” Regularization, Optimization, Kernels, and Support VectorMachines, vol. 12, no. 257, 2014.

[17] ——, “Introduction to nonnegative matrix factorization,”SIAG/OPT Views and News, vol. 25, no. 1, pp. 7–16, 2017.

[18] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari, Nonnegativematrix and tensor factorizations: applications to exploratory multi-waydata analysis and blind source separation. John Wiley & Sons, 2009.

[19] M. Abdolali and N. Gillis, “Simplex-structured matrix factor-ization: Sparsity-based identifiability and provably correct al-gorithms,” arXiv preprint arXiv:2007.11446, 2020.

[20] L. Miao and H. Qi, “Endmember extraction from highly mixeddata using minimum volume constrained nonnegative matrixfactorization,” IEEE Transactions on Geoscience and Remote Sensing,vol. 45, no. 3, pp. 765–777, 2007.

[21] M. A. Ang and N. Gillis, “Volume regularized non-negativematrix factorizations,” in 2018 9th Workshop on Hyperspectral Imageand Signal Processing: Evolution in Remote Sensing (WHISPERS).IEEE, 2018, pp. 1–5.

[22] X. Fu, K. Huang, N. D. Sidiropoulos, and W. Ma, “Nonnegativematrix factorization for signal and data analytics: Identifiability,algorithms, and applications,” IEEE Signal Processing Magazine,vol. 36, no. 2, pp. 59–80, 2019.

[23] P. O. Hoyer, “Non-negative sparse coding,” in Neural Networks forSignal Processing, 2002. Proceedings of the 2002 12th IEEE Workshopon. IEEE, 2002, pp. 557–565.

[24] M. Mørup and L. K. Hansen, “Archetypal analysis for machinelearning and data mining,” Neurocomputing, vol. 80, pp. 54–63,2012.

Page 18: 1 Deep Matrix Factorizations - arXiv

18

[25] P. De Handschutter, N. Gillis, A. Vandaele, and X. Siebert,“Near-convex archetypal analysis,” IEEE Signal Processing Letters,vol. 27, pp. 81–85, 2019.

[26] H. Javadi and A. Montanari, “Nonnegative matrix factorizationvia archetypal analysis,” Journal of the American Statistical Associ-ation, pp. 1–22, 2019.

[27] S. A. Vavasis, “On the complexity of nonnegative matrix factor-ization,” SIAM J. Opt., vol. 20, no. 3, pp. 1364–1377, 2009.

[28] A. Cichocki and R. Zdunek, “Multilayer nonnegative matrixfactorisation,” Electronics Letters, vol. 42, no. 16, pp. 947–948, 2006.

[29] ——, “Multilayer nonnegative matrix factorization using projec-ted gradient approaches,” International Journal of Neural Systems,vol. 17, no. 06, pp. 431–446, 2007.

[30] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. Schuller, “Adeep semi-NMF model for learning hidden representations,” inInternational Conference on Machine Learning, 2014, pp. 1692–1700.

[31] C. H. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrix factorizations,” IEEE transactions on patternanalysis and machine intelligence, vol. 32, no. 1, pp. 45–55, 2010.

[32] J. Yu, G. Zhou, A. Cichocki, and S. Xie, “Learning the hierarchicalparts of objects by deep non-smooth nonnegative matrix factor-ization,” IEEE Access, vol. 6, pp. 58 096–58 105, 2018.

[33] O. Dikmen, Z. Yang, and E. Oja, “Learning the informationdivergence,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 37, no. 7, pp. 1442–1454, 2014.

[34] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrixfactorization with the itakura-saito divergence: With applicationto music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830,2009.

[35] V. Leplat, N. Gillis, and A. M. S. Ang, “Blind audio sourceseparation with minimum-volume beta-divergence NMF,” IEEETransactions on Signal Processing, pp. 3400–3410, 2020.

[36] C. H. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegativematrix t-factorizations for clustering,” in Proceedings of the 12thACM SIGKDD international conference on Knowledge discovery anddata mining, 2006, pp. 126–135.

[37] F. Pompili, N. Gillis, P.-A. Absil, and F. Glineur, “Two algorithmsfor orthogonal nonnegative matrix factorization with applicationto clustering,” Neurocomputing, vol. 141, pp. 15–25, 2014.

[38] B. Li, G. Zhou, and A. Cichocki, “Two efficient algorithms for ap-proximately orthogonal nonnegative matrix factorization,” IEEESignal Processing Letters, vol. 22, no. 7, pp. 843–846, 2014.

[39] B. Lyu, K. Xie, and W. Sun, “A deep orthogonal non-negativematrix factorization method for learning attribute representa-tions,” in International Conference on Neural Information Processing.Springer, 2017, pp. 443–452.

[40] Y. Qiu, G. Zhou, and K. Xie, “Deep approximately orthogonalnonnegative matrix factorization for clustering,” arXiv preprintarXiv:1711.07437, 2017.

[41] J. Eggert and E. Korner, “Sparse coding and NMF,” in 2004 IEEEInternational Joint Conference on Neural Networks, vol. 4. IEEE,2004, pp. 2529–2533.

[42] J. Kim and H. Park, “Sparse nonnegative matrix factorization forclustering,” Georgia Institute of Technology, Tech. Rep., 2008.

[43] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious:dictionary learning with noise and outliers,” IEEE Transactions onInformation Theory, vol. 61, no. 11, pp. 6298–6319, 2015.

[44] J. E. Cohen and N. Gillis, “Nonnegative low-rank sparse compon-ent analysis,” in ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019,pp. 8226–8230.

[45] Z. Guo and S. Zhang, “Sparse deep nonnegative matrix factor-ization,” Big Data Mining and Analytics, vol. 3, no. 1, pp. 13–28,2019.

[46] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM journal on imagingsciences, vol. 2, no. 1, pp. 183–202, 2009.

[47] N. Gillis, “Sparse and unique nonnegative matrix factorizationthrough data preprocessing,” Journal of Machine Learning Research,vol. 13, no. Nov, pp. 3349–3386, 2012.

[48] S. Lyu and X. Wang, “On algorithms for sparse multi-factorNMF,” in Advances in Neural Information Processing Systems, 2013,pp. 602–610.

[49] R. Peharz and F. Pernkopf, “Sparse nonnegative matrix factor-ization with `0-constraints,” Neurocomputing, vol. 80, pp. 38–46,2012.

[50] Y. Qian, S. Jia, J. Zhou, and A. Robles-Kelly, “Hyperspectralunmixing via `1/2 sparsity-constrained nonnegative matrix fac-torization,” IEEE Transactions on Geoscience and Remote Sensing,vol. 49, no. 11, pp. 4282–4297, 2011.

[51] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural net-works from overfitting,” The journal of machine learning research,vol. 15, no. 1, pp. 1929–1958, 2014.

[52] Z. He, J. Liu, C. Liu, Y. Wang, A. Yin, and Y. Huang, “Dropoutnon-negative matrix factorization,” Knowledge and InformationSystems, vol. 60, no. 2, pp. 781–806, 2019.

[53] J. Cavazza, P. Morerio, B. Haeffele, C. Lane, V. Murino, andR. Vidal, “Dropout as a low-rank regularizer for matrix factor-ization,” in International Conference on Artificial Intelligence andStatistics, 2018, pp. 435–444.

[54] A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann, andR. D. Pascual-Marqui, “Nonsmooth nonnegative matrix factoriz-ation (nsNMF),” IEEE transactions on pattern analysis and machineintelligence, vol. 28, no. 3, pp. 403–415, 2006.

[55] H. A. Song and S. Lee, “Hierarchical representation usingNMF,” in International conference on neural information processing.Springer, 2013, pp. 466–473.

[56] P. Sharma, V. Abrol, and A. Thakur, “Ase: Acoustic scene embed-ding using deep archetypal analysis and GMM.” in Interspeech,2018, pp. 3299–3303.

[57] S. M. Keller, M. Samarin, M. Wieser, and V. Roth, “Deep ar-chetypal analysis,” in German Conference on Pattern Recognition.Springer, 2019, pp. 171–185.

[58] A. Alemi, I. Fischer, J. Dillon, and K. Murphy, “Deep variationalinformation bottleneck,” in ICLR, 2017.

[59] X. Li, C. Zhao, Z. Shu, and Q. Wang, “Multilayer concept factoriz-ation for data representation,” in 2015 10th International Conferenceon Computer Science & Education (ICCSE). IEEE, 2015, pp. 486–491.

[60] Y. Zhang, Z. Zhang, Z. Zhang, M. Zhao, L. Zhang, Z. Zha,and M. Wang, “Deep self-representative concept factorizationnetwork for representation learning,” in Proceedings of the 2020SIAM International Conference on Data Mining. SIAM, 2020, pp.361–369.

[61] Z. Zhang, Y. Zhang, L. Zhang, and S. Yan, “A survey on conceptfactorization: From shallow to deep representation learning,”arXiv preprint arXiv:2007.15840, 2020.

[62] Y. Meng, R. Shang, F. Shang, L. Jiao, S. Yang, and R. Stolkin,“Semi-supervised graph regularized deep NMF with bi-orthogonal constraints for data representation,” IEEE transactionson neural networks and learning systems, 2019.

[63] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E.Papalexakis, and C. Faloutsos, “Tensor decomposition for signalprocessing and machine learning,” IEEE Transactions on SignalProcessing, vol. 65, no. 13, pp. 3551–3582, 2017.

[64] X. Bi, A. Qu, and X. Shen, “Multilayer tensor factorization withapplications to recommender systems,” The Annals of Statistics,vol. 46, no. 6B, pp. 3308–3333, 2018.

[65] J. Casebeer, M. Colomb, and P. Smaragdis, “Deep tensor factor-ization for spatially-aware scene decomposition,” in 2019 IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics(WASPAA). IEEE, 2019, pp. 180–184.

[66] P. Smaragdis and S. Venkataramani, “A neural network altern-ative to non-negative audio models,” in 2017 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2017, pp. 86–90.

[67] C. Jia, M. Shao, and Y. Fu, “Sparse canonical temporal alignmentwith deep tensor decomposition for action recognition,” IEEETransactions on Image Processing, vol. 26, no. 2, pp. 738–750, 2016.

[68] S. Oymak and M. Soltanolkotabi, “End-to-end learning of aconvolutional neural network via deep tensor decomposition,”arXiv preprint arXiv:1805.06523, 2018.

[69] I. Domanov and L. D. Lathauwer, “Generic uniqueness condi-tions for the canonical polyadic decomposition and indscal,”SIAM Journal on Matrix Analysis and Applications, vol. 36, no. 4,pp. 1567–1589, 2015.

[70] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,”IEEE Transactions on Signal Processing, vol. 61, no. 5, pp. 1072–1086, 2012.

[71] J. Maggu and A. Majumdar, “Unsupervised deep transformlearning,” in 2018 IEEE international conference on acoustics, speechand signal processing (ICASSP). IEEE, 2018, pp. 6782–6786.

Page 19: 1 Deep Matrix Factorizations - arXiv

19

[72] N. Gillis, “Successive nonnegative projection algorithm for robustnonnegative blind source separation,” SIAM Journal on ImagingSciences, vol. 7, no. 2, pp. 1420–1450, 2014.

[73] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Advances in Neural Information Processing Sys-tems, 2001, pp. 556–562.

[74] J. Ahn, S. Choi, and J. Oh, “A multiplicative up-propagation al-gorithm,” in Proceedings of the twenty-first International Conferenceon Machine Learning. ACM, 2004, p. 3.

[75] C. Lin, “Projected gradient methods for nonnegative matrixfactorization,” Neural computation, vol. 19, no. 10, pp. 2756–2779,2007.

[76] Y. E. Nesterov, “A method for solving the convex programmingproblem with convergence rate O(1/kˆ2),” in Dokl. akad. nauk Sssr,vol. 269, 1983, pp. 543–547.

[77] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, “A flex-ible and efficient algorithmic framework for constrained matrixand tensor factorization,” IEEE Transactions on Signal Processing,vol. 64, no. 19, pp. 5052–5065, 2016.

[78] Y. Zhou and L. Xu, “A deep structure-enforced nonnegative mat-rix factorization for data representation,” in Chinese Conference onPattern Recognition and Computer Vision (PRCV). Springer, 2018,pp. 340–350.

[79] S. Arora, N. Cohen, W. Hu, and Y. Luo, “Implicit regularizationin deep matrix factorization,” in Advances in Neural InformationProcessing Systems, 2019, pp. 7411–7422.

[80] J. Fan and J. Cheng, “Matrix completion by deep matrix factoriz-ation,” Neural Networks, vol. 98, pp. 34–41, 2018.

[81] Q. Wang, M. Sun, L. Zhan, P. Thompson, S. Ji, and J. Zhou,“Multi-modality disease modeling via collective deep matrix fac-torization,” in Proceedings of the 23rd ACM SIGKDD internationalconference on knowledge discovery and data mining, 2017, pp. 1155–1164.

[82] J. Le Roux, J. R. Hershey, and F. Weninger, “Deep NMF for speechseparation,” in 2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 66–70.

[83] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech-niques for recommender systems,” Computer, vol. 42, no. 8, pp.30–37, 2009.

[84] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du,P. Gader, and J. Chanussot, “Hyperspectral unmixing over-view: Geometrical, statistical, and sparse regression-based ap-proaches,” IEEE journal of selected topics in applied earth observationsand remote sensing, vol. 5, no. 2, pp. 354–379, 2012.

[85] W.-K. Ma, J. M. Bioucas-Dias, T.-H. Chan, N. Gillis, P. Gader,A. J. Plaza, A. Ambikapathi, and C.-Y. Chi, “A signal processingperspective on hyperspectral unmixing: Insights from remotesensing,” IEEE Signal Processing Magazine, vol. 31, no. 1, pp. 67–81, 2013.

[86] F. Zhu, “Spectral unmixing datasets with ground truths,” arXivpreprint arXiv:1708.05125, 2017.

[87] “Data - rslab,” https://rslab.ut.ac.ir/data, (Accessed on09/09/2020).

[88] A. Mongia, N. Jhamb, E. Chouzenoux, and A. Majumdar, “Deeplatent factor model for collaborative filtering,” Signal Processing,vol. 169, p. 107366, 2020.

[89] H. Xue, X. Dai, J. Zhang, S. Huang, and J. Chen, “Deep matrixfactorization models for recommender systems,” in IJCAI, 2017,pp. 3203–3209.

[90] B. Yi, X. Shen, H. Liu, Z. Zhang, W. Zhang, S. Liu, and N. Xiong,“Deep matrix factorization with implicit feedback embedding forrecommendation system,” IEEE Transactions on Industrial Inform-atics, vol. 15, no. 8, pp. 4591–4601, 2019.

[91] Y. Yang and H. Wang, “Multi-view clustering: A survey,” Big DataMining and Analytics, vol. 1, no. 2, pp. 83–107, 2018.

[92] H. Zhao, Z. Ding, and Y. Fu, “Multi-view clustering via deepmatrix factorization,” in Proceedings of the Thirty-First AAAI Con-ference on Artificial Intelligence, 2017, pp. 2921–2927.

[93] B. Cui, H. Yu, T. Zhang, and S. Li, “Self-weighted multi-viewclustering with deep matrix factorization,” in Asian Conference onMachine Learning, 2019, pp. 567–582.

[94] S. Wei, J. Wang, G. Yu, C. Domeniconi, and X. Zhang, “Multi-view multiple clusterings using deep matrix factorization.” inAAAI, 2020, pp. 6348–6355.

[95] C. Xu, Z. Guan, W. Zhao, Y. Niu, Q. Wang, and Z. Wang, “Deepmulti-view concept learning.” in IJCAI, 2018, pp. 2898–2904.

[96] S. Huang, Z. Kang, and Z. Xu, “Auto-weighted multi-viewclustering via deep matrix decomposition,” Pattern Recognition,vol. 97, p. 107015, 2020.

[97] Y. Xiong, Y. Xu, and X. Shu, “Cross-view hashing via superviseddeep discrete matrix factorization,” Pattern Recognition, vol. 103,p. 107270, 2020.

[98] J. Yang and J. Leskovec, “Overlapping community detection atscale: a nonnegative matrix factorization approach,” in Proceed-ings of the sixth ACM international conference on Web search and datamining, 2013, pp. 587–596.

[99] F. Ye, C. Chen, and Z. Zheng, “Deep autoencoder-like nonnegat-ive matrix factorization for community detection,” in Proceedingsof the 27th ACM International Conference on Information and Know-ledge Management. ACM, 2018, pp. 1393–1402.

[100] R. Rajabi and H. Ghassemian, “Spectral unmixing of hyperspec-tral imagery using multilayer NMF,” IEEE Geoscience and RemoteSensing Letters, vol. 12, no. 1, pp. 38–42, 2014.

[101] L. Tong, J. Yu, C. Xiao, and B. Qian, “Hyperspectral unmixing viadeep matrix factorization,” International Journal of Wavelets, Mul-tiresolution and Information Processing, vol. 15, no. 06, p. 1750058,2017.

[102] X. Feng, H. Li, J. Li, Q. Du, A. Plaza, and W. J. Emery, “Hyper-spectral unmixing using sparsity-constrained deep nonnegativematrix factorization with total variation,” IEEE Transactions onGeoscience and Remote Sensing, vol. 56, no. 10, pp. 6245–6257, 2018.

[103] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variationbased noise removal algorithms,” Physica D: nonlinear phenomena,vol. 60, no. 1-4, pp. 259–268, 1992.

[104] G. Zhao, C. Zhao, and X. Jia, “Multilayer unmixing for hy-perspectral imagery with fast kernel archetypal analysis,” IEEEGeoscience and Remote Sensing Letters, vol. 13, no. 10, pp. 1532–1536, 2016.

[105] F. Gao, X. Liu, J. Dong, G. Zhong, and M. Jian, “Change detectionin SAR images based on deep semi-NMF and SVD networks,”Remote Sensing, vol. 9, no. 5, p. 435, 2017.

[106] H. Li, G. Yang, W. Yang, Q. Du, and W. J. Emery, “Deepnonsmooth nonnegative matrix factorization network factoriz-ation network with semi-supervised learning for SAR imagechange detection,” ISPRS Journal of Photogrammetry and RemoteSensing, vol. 160, pp. 167–179, 2020.

[107] P. Sharma, V. Abrol, and A. K. Sao, “Deep sparse representationbased features for speech recognition,” IEEE/ACM Transactions onAudio, Speech, and Language Processing, vol. 25, no. 11, pp. 2162–2175, 2017.

[108] S. Davis and P. Mermelstein, “Comparison of parametric rep-resentations for monosyllabic word recognition in continuouslyspoken sentences,” IEEE transactions on acoustics, speech, and signalprocessing, vol. 28, no. 4, pp. 357–366, 1980.

[109] C. Hsu, J. Chien, and T. Chi, “Layered nonnegative matrixfactorization for speech separation,” in 16th annual conference ofthe international speech communication association (interspeech 2015),Vols 1-5, 2015, pp. 628–632.

[110] A. Thakur, V. Abrol, P. Sharma, and P. Rajan, “Deep convexrepresentations: Feature representations for bioacoustics classi-fication.” in Interspeech, 2018, pp. 2127–2131.

[111] A. Thakur and P. Rajan, “Deep archetypal analysis based interme-diate matching kernel for bioacoustic classification,” IEEE Journalof Selected Topics in Signal Processing, vol. 13, no. 2, pp. 298–309,2019.

[112] C. Ding, T. Li, and W. Peng, “On the equivalence between non-negative matrix factorization and probabilistic latent semanticindexing,” Computational Statistics & Data Analysis, vol. 52, no. 8,pp. 3913–3927, 2008.

[113] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag,Y. Wu, and M. Zhu, “A practical algorithm for topic modelingwith provable guarantees,” in International Conference on MachineLearning, 2013, pp. 280–288.

[114] N. Dobigeon, J. Tourneret, C. Richard, J. C. M. Bermudez,S. McLaughlin, and A. O. Hero, “Nonlinear unmixing of hyper-spectral images: Models and algorithms,” IEEE Signal ProcessingMagazine, vol. 31, no. 1, pp. 82–94, 2013.

[115] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ram-abhadran, “Low-rank matrix factorization for deep neural net-work training with high-dimensional output targets,” in 2013IEEE international conference on acoustics, speech and signal pro-cessing. IEEE, 2013, pp. 6655–6659.

Page 20: 1 Deep Matrix Factorizations - arXiv

20

[116] Y. Zhang, E. Chuangsuwanich, and J. Glass, “Extracting deepneural network bottleneck features using low-rank matrix factor-ization,” in 2014 IEEE international conference on acoustics, speechand signal processing (ICASSP). IEEE, 2014, pp. 185–189.

[117] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, “NMF-basedtarget source separation using deep neural network,” IEEE SignalProcessing Letters, vol. 22, no. 2, pp. 229–233, 2014.

[118] S. Ozkan, B. Kaya, and G. B. Akar, “Endnet: Sparse autoencodernetwork for endmember extraction and hyperspectral unmix-ing,” IEEE Transactions on Geoscience and Remote Sensing, no. 99,pp. 1–15, 2018.

[119] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no.2011, pp. 1–19, 2011.

[120] A. Lemme, R. F. Reinhart, and J. J. Steil, “Online learning and gen-eralization of parts-based image representations by non-negativesparse autoencoders,” Neural Networks, vol. 33, pp. 194–203, 2012.

[121] E. Hosseini-Asl, J. M. Zurada, and O. Nasraoui, “Deep learning ofpart-based representation of data using sparse autoencoders withnonnegativity constraints,” IEEE transactions on neural networksand learning systems, vol. 27, no. 12, pp. 2486–2498, 2016.

[122] J. Flenner and B. Hunter, “A deep non-negative matrix factoriza-tion neural network,” Semantic Scholar, 2017.

[123] S. Tariyal, A. Majumdar, R. Singh, and M. Vatsa, “Deep dictionarylearning,” IEEE Access, vol. 4, pp. 10 096–10 109, 2016.

[124] D. van Dijk, D. B. Burkhardt, M. Amodio, A. Tong, G. Wolf,and S. Krishnaswamy, “Finding archetypal spaces using neuralnetworks,” in 2019 IEEE International Conference on Big Data.IEEE, 2019, pp. 2634–2643.

[125] C. Bauckhage, K. Kersting, F. Hoppe, and C. Thurau, “Archetypalanalysis as an autoencoder,” in Workshop New Challenges in NeuralComputation, 2015, p. 8.

[126] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified conver-gence analysis of block successive minimization methods fornonsmooth optimization,” SIAM Journal on Optimization, vol. 23,no. 2, pp. 1126–1153, 2013.

[127] R. Sun, D. Li, S. Liang, T. Ding, and R. Srikant, “The global land-scape of neural networks: An overview,” IEEE Signal ProcessingMagazine, vol. 37, no. 5, pp. 95–108, 2020.

[128] T. Laurent and J. Brecht, “Deep linear networks with arbitraryloss: All local minima are global,” in International Conference onMachine Learning. PMLR, 2018, pp. 2902–2907.

[129] S. Arora, N. Golowich, N. Cohen, and W. Hu, “A convergenceanalysis of gradient descent for deep linear neural networks,” in7th International Conference on Learning Representations, ICLR 2019,2019.

[130] P. L. Bartlett, D. P. Helmbold, and P. M. Long, “Gradient des-cent with identity initialization efficiently learns positive-definitelinear transformations by deep residual networks,” Neural com-putation, vol. 31, no. 3, pp. 477–502, 2019.

[131] S. Arora, N. Cohen, and E. Hazan, “On the optimization ofdeep networks: Implicit acceleration by overparameterization,”in International Conference on Machine Learning, 2018, pp. 244–253.

[132] S. Du and W. Hu, “Width provably matters in optimizationfor deep linear neural networks,” in International Conference onMachine Learning, 2019, pp. 1655–1664.

[133] O. Shamir, “Exponential convergence time of gradient descentfor one-dimensional deep linear neural networks,” in Conferenceon Learning Theory, 2019, pp. 2691–2713.

[134] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur,and N. Srebro, “Implicit regularization in matrix factorization,”in Advances in Neural Information Processing Systems, 2017, pp.6151–6159.

[135] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negativematrix factorization revisited: Uniqueness and algorithm forsymmetric decomposition,” IEEE Transactions on Signal Pro-cessing, vol. 62, no. 1, pp. 211–224, 2013.

[136] F. Malgouyres and J. Landsberg, “On the identifiability and stablerecovery of deep/multi-layer structured matrix factorization,” in2016 IEEE Information Theory Workshop (ITW). IEEE, 2016, pp.315–319.

[137] ——, “Multilinear compressive sensing and an application toconvolutional linear networks,” SIAM Journal on Mathematics ofData Science, vol. 1, no. 3, pp. 446–475, 2019.

[138] G. W. Stewart and J.-G. Sun, “Matrix perturbation theory,” 1990.[139] O. Seddati, S. Dupont, S. Mahmoudi, and M. Parian, “Towards

good practices for image retrieval based on CNN features,” in

Proceedings of the IEEE International Conference on Computer VisionWorkshops, 2017, pp. 1246–1255.