Singular Value Decomposition for High-dimensional High ... · Singular Value Decomposition for High-dimensional High-order Data Anru Zhang Department of Statistics University of Wisconsin-Madison

Singular Value Decomposition for High-dimensionalHigh-order Data

Anru ZhangDepartment of StatisticsUniversity of Wisconsin-Madison

CSML @ Princeton UniversityMay 14, 2018

Joint work with Dong Xia

Introduction

Introduction

• Tensors, or high-order arrays, attract lots of attention recently.

• e.g. an order-d tensor

X ∈ Rp1×···×pd , X =(Xi1···id

), 1 ≤ ik ≤ pk, k = 1, · · · , d.

Anru Zhang (UW-Madison) Tensor SVD 2

Introduction

Example

• Brain imaging

• Hyperspectral imaging

• Recommender system


Introduction

• Higher order is fancier!

• Higher order tensor problems are far more than extension ofmatrices.

I More structuresI High-dimensionalityI Computational difficulty


Introduction

In this talk, we focus on tensor SVD.


Tensor SVD

SVD and PCA

• Singular value decomposition (SVD) is one of the most importanttools in multivariate analysis.

• Goal: Find the underlying low-rank structure from the data matrix.

• Closely related to Principal component analysis (PCA): Find theone/multiple directions that explain most of the variance.

• Variations: sparse PCA, robust PCA, sparse SVD, kernel SVD, ...


Tensor SVD

Related Works

• Rank-1 Tensor SVD: Richard and Montanari, 2014; Hopkins, Shi,Steurer, 2015; Perry, Wein, Bandeira, 2017, Anandkumar, Deng, Ge,Mobahi, 2017.

Y = λ · u ⊗ v ⊗ w +Z, Ziid∼ N(0, σ2).

• Methods:I power methods, sum-of-squares, approximate message passing,

homotopy...I MLE, warm-start power iterations...

• Statistical and computational trade-off & phase transition effects.

• The statistical framework for tensor SVD when r ≥ 2 is not welldefined or solved.


Tensor SVD

Tensor SVD

• We propose a general framework for tensor SVD.

•

Y = X +Z,

whereI Y ∈ Rp1×p2×p3 is the observation;I Z is the noise;I X is a low-rank tensor.

• We wish to recover the high-dimensional low-rank structure X.→ Unfortunately, there is no uniform definition for tensor rank.


Tensor SVD

Low-rankness for Tensors

• Canonical polyadic (CP) low-rankness is widely used in literature.Definition:

rcp = minr

s.t. X =

r∑i=1

λi · ui ⊗ vi ⊗ wi.

• Disadvantages:I possibly rcp > max{p1, p2, p3};I the set of rank-r tensors may not be close;

Possible situation: limit of a series of cp rank-2 tensors is of rank 3!I ui’s (vi,wi’s) are usually not orthogonal.

Picture Source: Guoxu Zhou’s website. http://www.bsp.brain.riken.jp/ zhougx/tensor.html


Tensor SVD

Tucker Low-rankness

• If X1, X2, and X3 are matricizations of X,

• We assume r1 = rank(X1), r2 = rank(X2), r3 = rank(X3), and denote

Tucker rank(X) = (r1, r2, r3).

• Note:Matrix: dim(row-span) = dim(column-span).Tensor: parallel definitions, r1, r2, r3, are not necessarily equal.

Picture Source: Tao, M., Zhou, F., Liu, Y. and Zhang, Z. (2015)


Tensor SVD

More General Assumption: Tucker Low-rankness

• Equivalent form of definition: Tucker decomposition

X = S ×1 U1 ×2 U2 ×3 U3

• Smallest (r1, r2, r3) are exactly the Tucker rank of X.

Picture Source: Guoxu Zhou’s website. http://www.bsp.brain.riken.jp/ zhougx/tensor.html


Tensor SVD

Model

• Observations: Y ∈ Rp1×p2×p3 ,

Y = X +Z = S ×1 U1 ×2 U2 ×3 U3 +Z,

Ziid∼ N(0, σ2), Uk ∈ Opk ,rk , S ∈ Rr1×r2×r3 .

• Goal: estimate U1,U2,U3, and the original tensor X.


Tensor SVD

Straightforward Idea 1: Higher order SVD (HOSVD)

• Since Uk is the subspace forMk(X), let

Uk = SVDrk (Mk(Y)), k = 1, 2, 3.

i.e. the leading rk singular vectors of all mode-k fibers.

Note: SVDr(·) represents the first r left singular vectors of any given matrix.


Tensor SVD

Straightforward Idea 1: Higher order SVD (HOSVD)

(De Lathauwer, De Moor, and Vandewalle, SIAM J. Matrix Anal. & Appl. 2000a)

• Advantage: easy to implement and analyze.

• Disadvantage: perform sub-optimally.Reason: simply unfolding the tensor fails to utilize the tensorstructure!


Tensor SVD

Straightforward Idea 2: Maximum Likelihood Estimator

• Maximum-likelihood estimator

Umle1 , Umle

2 , Umle3 , Smle = argmax

U1,U2,U3,S

‖Y − S ×1 U1 ×2 U2 ×3 U3‖2F

• Equivalently, Umle1 , Umle

2 , Umle3 can be calculated via

max∥∥∥Y ×1 V>1 ×2 V>2 ×3 V>3

∥∥∥2F

subject to V1 ∈ Op1,r1 ,V2 ∈ Op2,r2 ,V3 ∈ Op3,r3 .

• Advantage: achieves statistical optimality. (will be shown later)• Disadvantage:

I Non-convex, computational intractable.I NP-hard to approximate even r = 1 (Hillar and Lim, 2013).


Tensor SVD

Phase Transition in Tensor SVD

• The difficulty is driven by signal-to-noise ratio (SNR).

λ = mink=1,2,3

σrk (Mk(X))

= least non-zero singular value ofMk(X), k = 1, 2, 3,

σ = SD(Z) = noise level.

• Suppose p1 � p2 � p3 � p. Three phases:

λ/σ ≥ Cp3/4 (Strong SNR case),

λ/σ < cp1/2 (Weak SNR case),

p1/2 � λ/σ � p3/4 (Moderate SNR case).


Tensor SVD Strong SNR Case

Strong SNR Case: Methodology

• When λ/σ ≥ Cp3/4, apply higher-order orthogonal iteration (HOOI).(De Lathauwer, Moor, and Vandewalle, SIAM. J. Matrix Anal. & Appl. 2000b)

• (Step 1. Spectral initialization)

U(0)k = SVDrk (Mk(Y)) , k = 1, 2, 3.

• (Step 2. Power iterations)Repeat Let t = t + 1. Calculate

U(t)1 = SVDr1

(M1(Y ×2 (U(t−1)

2 )> ×3 (U(t−1)3 )>)

),

U(t)2 = SVDr2

(M2(Y ×1 (U(t)

1 )> ×3 (U(t−1)3 )>)

),

U(t)3 = SVDr3

(M3(Y ×1 (U(t)

1 )> ×2 (U(t)2 )>)

).

Until t = tmax or convergence.



Interpretation

1. Spectral initialization provides a “warm start.”2. Power iteration refines the initializations.

Given U(t−1)1 , U(t−1)

2 , U(t−1)3 , denoise Y via:

Y ×2 U(t−1)2 ×3 U(t−1)

3 .

I Mode-1 singular subspace is reserved;I Noise can be highly reduced.

Thus, we update

U(t)1 = SVDr1

(Mr1

(Y ×2 U(t−1)

2 ×3 U(t−1)3

)).



Higher-order orthogonal iteration (HOOI)

(De Lathauwer, Moor, and Vandewalle, SIAM. J. Matrix Anal. & Appl. 2000b)



Strong SNR Case: Theoretical Analysis

Theorem (Upper Bound)

Suppose λ/σ > Cp3/4 and other regularity conditions hold, after at mostO

(log(p/λ) ∨ 1

)iterations,

• (Recovery of U1,U2,U3)

E minO∈Or

∥∥∥Uk − UkO∥∥∥

F ≤C√

pkrk

λ/σ, k = 1, 2, 3;

• (Recovery of X)

supX∈Fp,r(λ)

maxk=1,2,3

E∥∥∥X − X∥∥∥2

F ≤ C (p1r1 + p2r2 + p3r3)σ2,

supX∈Fp,r(λ)

maxk=1,2,3

E‖X − X‖2F

‖X‖2F

≤C (p1 + p2 + p3)σ2

λ2 .



Key Tool

• Key tool: one-sided perturbation bound.• The matricizationsM1(Y) ∈ Rp1×(p2p3) are flat,

M1(Y) =M1(X) +M1(Z) ∈ Rp1×(p2p3)

M1(Y) is observed, rank(M1(X)) ≤ r1, M1(Z) iid∼ N(0, σ2).

• LetU1 = SVDr(M1(Y)), U1 = SVDr(M1(X)),

V1 = SVDr(M1(Y)>), V1 = SVDr(M1(X)>).

Naturally, the left singular subspace U1 ofM1(Y) is more“informative” than the right one V1.



One-sided Perturbation Analysis

• Traditional perturbation analysis were usually two-sided, e.g., Wedin’slemma (Wedin, 1972)

max{‖ sin Θ(U1,U1)‖F, ‖ sin Θ(V1,V1)‖F

}≤ ...

• One-sided perturbation bound (Cai and Z. 2018)

Y = X + Z, X,Y ,Z ∈ Rp1×(p2p3),

rank(X) = r, σr(X) = λ, Z iid∼ N(0, 1).

Theorem

E‖ sin Θ(U1,U1)‖2 �p1

λ2 +p1p2p3

λ4 ,

E‖ sin Θ(V1,V1)‖2 �p2p3

λ2 +p1p2p3

λ4 .



Strong SNR Case: Lower Bound

Define the following class of low-rank tensors with signal strength λ.

Fp,r(λ) ={X ∈ Rp1×p2×p3 : rank(X) = (r1, r2, r3), σrk (Mk(X)) ≥ λ

}Theorem (Lower Bound)

(Recovery of U1,U2,U3)

infUk

supX∈Fp,r(λ)

E minO∈Or

∥∥∥Uk − UkO∥∥∥

F ≥ c√

pkrk

λ/σ, k = 1, 2, 3.

(Recovery of X)

infX

supX∈Fp,r(λ)

E∥∥∥X − X∥∥∥2

F ≥ c(p1r1 + p2r2 + p3r3)σ2,

infX

supX∈Fp,r(λ)

E‖X − X‖2F

‖X‖2F

≥c(p1 + p2 + p3)σ2

λ2 .



HOSVD vs. HOOI

E minO∈Or‖UHOSVD

k − UkO‖F �√

pkrk

λ/σ+

√p1p2p3rk

(λ/σ)2 ;

E minO∈Or‖UHOOI

k − UkO‖F �√

pkrk

λ/σ.

• When λ/σ ≤ cp, HOOI significantly improves upon HOSVD.

• Analysis of rank-r tensor SVD is more difficult than rank-1 tensor SVDor rank-r matrix SVD.

I Many concepts (e.g. singular values) are not well defined for tensors.


Tensor SVD Weak SNR Case

Weak SNR Case

Under the weak SNR case λ/σ < cp1/2, U1,U2,U3, or X cannot be stablyestimated in general.

Theorem(Recovery of U1,U2,U3)

infUk

supX∈Fp,r(λ)

E minO∈Or

r−1/2k ‖Uk − UkO‖F ≥ c, k = 1, 2, 3.

(Recovery of X)

infX

supX∈Fp,r(λ)

E‖X − X‖2F

‖X‖2F

≥ c.


Tensor SVD Moderate SNR Case

Moderate SNR Case: Statistical Optimality

• First, MLE achieves statistical optimality.

Theorem (Performance of MLE Estimator)

When λ/σ ≥ Cp1/2,

I (Recovery of U1,U2,U3)

supX∈Fp,r(λ)

EminO∈Or

∥∥∥Umlek − UkO

∥∥∥F ≤ C

√pkrk

λ/σ, k = 1, 2, 3;

I (Recovery of X)

supX∈Fp,r(λ)

E∥∥∥Xmle − X

∥∥∥2F ≤ C (p1r1 + p2r2 + p3r3)σ2,

supX∈Fp,r(λ)

E‖Xmle − X‖2F

‖X‖2F

≤C (p1 + p2 + p3)σ2

λ2 .

• However MLE is computationally intractable.Anru Zhang (UW-Madison) Tensor SVD 26

Tensor SVD Simulation Analysis

Simulation Analysis

• Consider random settings: λ = pα, α ∈ [.4, .9], σ = 1.

0.4 0.5 0.6 0.7 0.8 0.9

α

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

l ∞(U

)

warm start: p=50warm start: p=80warm start: p=100spectral start: p=50spectral start: p=80spectral start: p=100

• Two phase transitions:I The computational inefficient method performs well starting atλ/σ ≈ p1/2;

I The computational efficient HOOI performs well starting at λ/σ ≈ p3/4.



Moderate SNR Case: Computational Optimality

Moreover, the following theorem shows the computational hardness forpolynomial-time algorithms under moderate SNR.

TheoremAssume the conjecture of hypergraphic planted clique holds, andλ/σ = O

(p3(1−τ)/4) for any τ > 0, then for any polynomial-time algorithm

U1, U2, U3, X,(Recovery of U1,U2,U3)

lim infp→∞

supX∈Fp,r(λ)

E∥∥∥∥ sin Θ

(U(p)

k ,Uk)∥∥∥∥2≥ c1, k = 1, 2, 3,

(Recovery of X)

lim infp→∞

supX∈Fp,r(λ)

E‖X(p) − X‖2F

‖X‖2F

≥ c1.



Remarks

• The analysis relies on the hypergrahic planted clique detectionassumption.

• Result shows the hardness of tensor SVD in moderate SNR case.

• More recently, Ben Arous, Mei, Montanari, Nica (2017) analyzed thelandscape of rank-1 spiked tensor model.

I MLE is with exponentially growing many critical points.


Summary

Summary

Tensor SVD exhibits three phases,

• (Strong SNR) λ/σ ≥ Cp3/4,→ there is efficient algorithm to estimate U1,U2,U3, and X.

• (Weak SNR) λ/σ < cp1/2,→ no algorithm can stably recover U1,U2,U3, or X.

• (Moderate SNR) p1/2 � λ/σ � p3/4,I non-convex MLE stably recovers U1,U2,U3, and X;I Maybe no polynomial time algorithm performs stably.


Summary

Further Generalization to Order-d Tensors

• The results can be generalized to order-d tensors.

• Three phasesI (Strong SNR) λ/σ ≥ Cpd/4,→ Efficient algorithm exists.

I (Weak SNR) λ/σ < cp1/2,→ No algorithm exists.

I (Moderate SNR) p1/2 � λ/σ � pd/4,F Inefficient algorithm exists;F Maybe no polynomial time algorithm performs stably.

• RemarkI d = 2, i.e. matrix SVD: computation and statistical gap closes.I d ≥ 3: tensor SVD is with not only statistical, but also computational

challenges.


Summary

References

• Zhang, A. and Xia, D. (2018). Tensor SVD: Statistical and Computational Limits.

IEEE Transactions on Information Theory, to appear.

• Cai, T. and Zhang, A. (2018). Rate-Optimal Perturbation Bounds for Singular

Subspaces with Applications to High-Dimensional Statistics. Annals of Statistics, 46,

60-89.

• Zhang, A. and Han, R. (2017+). Optimal Denoising and Singular Value

Decomposition for Sparse High-dimensional High-order Data. submitted.


Documents

Singular Value Decomposition for High-dimensional High ... · Singular Value Decomposition for High-dimensional High-order Data Anru Zhang Department of Statistics University of Wisconsin-Madison