Tensor decompositions, sum-of-squares proofs, and spectral ...prevailing wisdom: sum-of-squares is “strictly theoretical” sum-of-squares algorithms have nothing to do with practical

Quarterly Theory Workshop, Northwestern, May 2016

Tensor decompositions, sum-of-squares proofs, and

spectral algorithms

David SteurerCornell

Tengyu Ma

Princeton

Sam B. Hopkins

Cornell

Jonathan Shi

Cornell

Tselil Schramm

Berkeley

tensor multi-index array of numbers (typically ≥ 3 indices/modes)

𝑇 =

𝑖,𝑗,𝑘∈ 𝑑

𝑇𝑖𝑗𝑘 ⋅ 𝑒𝑖 ⊗𝑒𝑗 ⊗𝑒𝑘 ∈ ℝ𝑑 ⊗3

standard basis 𝑒1, … , 𝑒𝑑 ∈ ℝ𝑑

𝑎 ⊗ 𝑏⊗ 𝑐 =


𝑎, 𝑒𝑖 𝑏, 𝑒𝑗 ⟨𝑐, 𝑒𝑘⟩ ⋅ 𝑒𝑖 ⊗𝑒𝑗 ⊗𝑒𝑘

tensor multi-index array of numbers (typically ≥ 3 indices/modes)

𝑇 =


𝑇𝑖𝑗𝑘 ⋅ 𝑒𝑖 ⊗𝑒𝑗 ⊗𝑒𝑘 ∈ ℝ𝑑 ⊗3

standard basis 𝑒1, … , 𝑒𝑑 ∈ ℝ𝑑

𝑎 ⊗ 𝑏⊗ 𝑐 =


𝑎, 𝑒𝑖 𝑏, 𝑒𝑗 ⟨𝑐, 𝑒𝑘⟩ ⋅ 𝑒𝑖 ⊗𝑒𝑗 ⊗𝑒𝑘

“tensors are the new matrices”

natural shape of data

"deep learning” frameworks: torch / theano / tensorflow

coefficients of multivariate polynomials 𝑇 = σ𝑖𝑗𝑘 𝑇𝑖𝑗𝑘 ⋅ 𝑥𝑖𝑥𝑗𝑥𝑘

moments of multivariate distributions 𝑇 = 𝔼𝑥∼𝐷𝑥⊗3

states of composite quantum systems 𝜓 ∈ 𝐴⊗𝐵⊗𝐶

“algorithms for the tensor age”

tie together wide range of disciplines

hope to repeat success for matrices

tensor decomposition (tensor rank)

given 3-tensor 𝑇, find as few vectors 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 𝑖∈[𝑟] as possible such that

𝑇 =

𝑖=1

𝑟

𝑎𝑖 ⊗𝑏𝑖 ⊗ 𝑐𝑖

𝑇

tensor decomposition is NP-hard in worst case

cannot hope for same theory as for matrices

but: can still hope for algorithms with strong provable guarantees

key advantage over matrix rank/factorization

intuition: explain data in simplest way possible

in contrast: tensor decomposition often unique

tractability appears to go hand in hand with uniqueness

matrix factorization suffers from “rotation problem”

key challenge

𝑀 = 𝐴𝐵⊤

⇔𝑀 = 𝐴𝑈 𝐵𝑈−1 ⊤

tensor decomposition (tensor rank)

given 3-tensor 𝑇, find as few vectors 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 𝑖∈[𝑟] as possible such that

𝑇 =

𝑖=1

𝑟

𝑎𝑖 ⊗𝑏𝑖 ⊗ 𝑐𝑖

𝑇

poly-time & practical unsupervised learning via tensor decomposition

Gaussian mixtures

topic modelling (latent Dirichlet allocation)

phylogenetic tree / hidden Markov model

blind-source separation, independent component analysis

[Chang’96; Mossel, Roch]

[Leurgans; Lathauwer, Castaing, Cardoso’07]

[Anandkumar, Ge, Hsu, Kakade, Telgarsky’14]

[Bhaskara-Charikar-Moitra-Vijayaraghavan’14]

moment problem for multivariate discrete distributions

hidden: set of vectors 𝑎1, … , 𝑎𝑛 ∈ ℝ𝑑

given: low-degree moments ℳ1, … ,ℳ𝑘 of uniform distribution over 𝑎1, … , 𝑎𝑛

find: set of vectors ≈ 𝑎1, … , 𝑎𝑛

under what conditions on the vectors and 𝒌 can we solve this problem efficiently and robustly?

(reformulation of tensor decomposition problem)

ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

linearly independent vectors (thus, 𝑛 ≤ 𝑑)

spectral algorithm for 𝑘 = 3 (matrix diagonalization) [Jennrich via Harshman’70; Leurgans-Ross-Abel’93; rediscovered many times]

wlog 𝑎1, … , 𝑎𝑛 orthonormal (apply linear transformation 1

𝑛ℳ2

−1/2)





ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

key challenge: decompose overcomplete tensors, i.e., rank ≫ dimension

∃ poly-time algorithm for rank 𝒏 = 𝒅𝟏.𝟎𝟏?





ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

random unit vectors for rank 𝒏 ≫ 𝒅 and 𝒌 = 𝟑 moments

[Anandkumar-Ge-Janzamin’15]

[Ge-Ma’15 analysis of Barak-Kelner-S.’15 algorithm]

2𝐶2⋅ 𝑑3

𝑑log 𝑑

𝐶 ⋅ 𝑑

𝑑1.5

𝑑1.5 [Anandkumar-Ge-Janzamin’15](only local convergence)tensor power iteration

sum-of-squares

spectral algorithm

largest rank 𝒏 running time

[Hopkins-Schramm-Shi-S’16]

[Ma-Shi-S’16+]

𝑑1+𝜔 ≤ 𝑑3.33𝑑𝑂 1

𝑑1.33𝑑1.5

this talk:

SOS-flavored spectral

sum-of-squares





ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

random unit vectors for rank 𝒏 ≫ 𝒅 and 𝒌 = 𝟑 moments

[Anandkumar-Ge-Janzamin’15]

[Ge-Ma’15 analysis of Barak-Kelner-S.’15 algorithm]

2𝐶2⋅ 𝑑3

𝑑log 𝑑

𝐶 ⋅ 𝑑

𝑑1.5

𝑑1.5 [Anandkumar-Ge-Janzamin’15](only local convergence)tensor power iteration

sum-of-squares

spectral algorithm

largest rank 𝒏 running time

smoothed unit vectors (Spielman–Teng smoothed analysis framework)

assume each vector is independently perturbed by 𝑛−𝑂 1 norm Gaussian

combines large linear system and spectral algorithm (FOOBI)

assumes exact input; not known to tolerate 𝑛−𝑂 1 error

spectral algorithm; tolerates 𝑛−𝑂 1 error

this talk:





ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

[Lathauwer, Castaing, Cardoso’07]

poly-time algorithm for 𝑘 = 4 up to rank 𝑛 ≤ 𝑑2

poly-time algorithm for 𝑘 = 5 up to rank 𝑛 ≤ 𝑑2 [Bhaskara-Charikar-Moitra-Vijayaraghavan’14]

same guarantees as FOOBI but tolerate 𝑛−𝑂 1 errorbased on sum-of-squares

general unit vectors

for simplicity: isotropic position σ𝑖=1𝑛 𝑎𝑖𝑎𝑖

⊤ =𝑛

𝑑Id

quasi-poly time algorithm with accuracy 𝜀 for 𝑘 ≥ 𝜀−1 log𝑛

𝑑

based on sum-of-squares

corollary: overcomplete dictionary learning with constant relative sparsity and constant accuracy in polynomial time

previous best: either sparsity 𝑛−Ω 1 or time 𝑛𝑂(log 𝑛)

this talk: poly-time algorithm (in size of input) with same recovery guarantees





ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

[Barak-Kelner-S.’15]

[Barak-Kelner-S.’15]





ℳ𝑘 ≔1

𝑛

𝑖=1

𝑛

𝑎𝑖⊗𝑘

𝑔

ℳ3

Jennrich’s algorithm on 3rd moments

assume 𝑎1, … , 𝑎𝑛 orthonormal

let 𝑔 ∼ 𝒩(0, Id𝑑) be standard Gaussian vector

then, Id ⊗ Id⊗ 𝑔⊤ ℳ3 =1

𝑛σ𝑖 𝑔, 𝑎𝑖 ⋅ 𝑎𝑖𝑎𝑖

⊤

every 𝑎𝑖 is eigenvector with value 𝑔, 𝑎𝑖 /𝑛;w.h.p. all eigenvalues distinct

eigendecomposition recovers 𝑎1, … , 𝑎𝑛

“random contraction of one mode”

challenge: what can we do when 𝒏 ≫ 𝒅 (overcomplete case)?

approach for random overcomplete 3-tensors

(lifts 3rd moments to 6th moments)

magic box

Jennrich’salgorithm

𝑎1⊗2, … , 𝑎𝑛

⊗2

(𝑎1⊗2, … , 𝑎𝑛

⊗2 orthogonal enoughfor Jennrich’s algorithm)

6th moments of 𝑎1, … , 𝑎𝑛= 3rd moments of 𝑎1

⊗2, … , 𝑎𝑛⊗2?

let 𝑎1, … , 𝑎𝑛 be random unit vectors for 𝑛 ≪ 𝑑1.5

ℳ3 =

𝑖

𝑎𝑖⊗3 ℳ6 =

𝑖

𝑎𝑖⊗6


ℳ3 =

𝑖

𝑎𝑖⊗3 ℳ6 =

𝑖

𝑎𝑖⊗6magic

box


𝑎1⊗2, … , 𝑎𝑛

⊗2

(𝑎1⊗2, … , 𝑎𝑛



“ideal implementation”(ignore efficiency for now)

find distribution 𝐷 over unit sphere

subject to 𝔼𝑢∼𝐷𝑢⊗3 =ℳ3

𝔼𝑢∼𝐷𝑢⊗6


ℳ3 =

𝑖

𝑎𝑖⊗3 ℳ6 =

𝑖

𝑎𝑖⊗6magic

box


𝑎1⊗2, … , 𝑎𝑛

⊗2

(𝑎1⊗2, … , 𝑎𝑛







claim: 𝔼𝐷 𝑢 𝑢⊗6 ≈ℳ6

ℳ3, 𝔼𝐷 𝑢 𝑢⊗3 =

1

𝑛𝔼𝐷 𝑢 σ𝑖=1

𝑛 𝑎𝑖 , 𝑢3

ℳ3,ℳ3 =1

𝑛2σ𝑖,𝑗∈ 𝑛 𝑎𝑖 , 𝑎𝑗

3 =1±𝑜 1

𝑛

𝔼𝐷 𝑢 σ𝑖=1𝑛 𝑎𝑖 , 𝑢

3 = 1 ± 𝑜 1 (∗)

crucially: with high prob. over 𝑎1, … , 𝑎𝑛,

∀𝑢. σ𝑖=1𝑛 𝑎𝑖 , 𝑢

3 = max𝑖∈[𝑛]

𝑎𝑖 , 𝑢3 ± 𝑜(1)

therefore, ∗ implies

Pr𝐷 𝑢

max𝑖

𝑎𝑖 , 𝑢 ≥ 1 − 𝑜 1 ≥ 1 − 𝑜 1

𝔼𝐷 𝑢 σ𝑖=1𝑛 𝑎𝑖 , 𝑢

6 = 1 ± 𝑜 1

ℳ6 − 𝔼𝐷 𝑢 𝑢⊗6 ≤ 𝑜 1 ⋅ ℳ6

…

proof:

𝑎6

𝑎2𝑎3𝑎4𝑎5

𝑎1

plot of σ𝑖=1𝑛 𝑎𝑖 , 𝑢

3


ℳ3 =

𝑖

𝑎𝑖⊗3 ℳ6 =

𝑖

𝑎𝑖⊗6magic

box


𝑎1⊗2, … , 𝑎𝑛

⊗2

(𝑎1⊗2, … , 𝑎𝑛







claim: 𝔼𝐷 𝑢 𝑢⊗6 ≈ℳ6

two remaining questions:

1. how to find 𝐷 efficiently?

2. can Jennrich tolerate this kind of error?

relax search to sum-of-squares pseudo-distributions

no, error is too large!

add maximum entropy constraint 𝔼𝑢∼𝐷𝑢⊗4

spectral≤

1+𝑜(1)

𝑛

suppose ෩ℳ3 −ℳ3 𝐹≤ 𝑜 1 ⋅ ℳ3 𝐹 and ෩ℳ2 spectral

≤ 𝑂 1 /𝑛.

𝑎1, … , 𝑎𝑛 ∈ ℝ𝑑 orthonormal; moments ℳ𝑘 =

1

𝑛σ𝑖=1𝑛 𝑎𝑖

⊗𝑘

distribution 𝐷 over sphere; moments ෩ℳ𝑘 = 𝔼𝐷 𝑢 𝑢⊗𝑘

then, for most 𝑖 ∈ [𝑛], with probability 1

𝑛𝑂 1 over the choice 𝑔 ∼ 𝒩 0, Id𝑑2 ,

Id𝑑 ⊗ Id𝑑 ⊗𝑔⊤ ෩ℳ3 has top eigenvector ≈ 𝑎𝑖

robust analysis of Jennrich’s algorithm

+ Id𝑑 ⊗ Id𝑑 ⊗ ො𝑔⊤ ෩ℳ3

𝑔, 𝑎𝑖 ⋅ 𝑎𝑖𝑔

ො𝑔= 𝑔, 𝑎𝑖 Id𝑑 ⊗ Id𝑑 ⊗𝑎𝑖

⊤ ෩ℳ3

Id𝑑 ⊗ Id𝑑 ⊗𝑔⊤ ෩ℳ3

≈1

𝑛𝑎𝑖𝑎𝑖

⊤‖ ⋅ ‖spectral ≤

log𝑑

𝑛

overwhelms noise with

probability 𝑒− 𝑙𝑜𝑔 𝑑2

≥ 𝑑−𝑂 1

□

degree-𝒌 pseudo-distribution over unit sphere 𝕊𝑑−1 ⊆ ℝ𝑑

• finitely supported function 𝐷: 𝕊𝑑−1 → ℝ• σ𝑢𝐷 𝑢 = 1 (sum is only over support of 𝐷)

• σ𝑢𝐷 𝑢 ⋅ 𝑓 𝑢 2 ≥ 0 for every 𝑓: 𝕊𝑑−1 → ℝ with deg 𝑓 ≤ 𝑘/2

notation: ෩𝔼𝐷 𝑢 𝑓 𝑢 ≝ σ𝑢𝐷 𝑢 ⋅ 𝑓 𝑢

probability theory meets complexity theory

efficiency of pseudo-distributions [Shor, Parrilo, Lasserre]

deg∞

deg 𝑑

deg2

deg6

set of degree-𝑘 pseudo-moments has 𝑑𝑂(𝑘)-time separation oracle;

key step: check 𝑘th pseudo-moment satisfies ෩𝔼𝐷 𝑢 𝑢⊗𝑘/2 𝑢⊗𝑘/2 ⊤

≽ 0

• low-complexity events always have nonnegative probability• high-complexity events may have negative probability

— pseudo-expectation of 𝑓 under 𝐷

generalizes best known poly-time algorithms for wide range of problems

degree-𝒌 pseudo-distribution over unit sphere 𝕊𝒅−𝟏 ⊆ ℝ𝒅

• finitely supported function 𝐷: 𝕊𝑑−1 → ℝ• σ𝑢𝐷 𝑢 = 1 (sum is only over support of 𝐷)

• σ𝑢𝐷 𝑢 ⋅ 𝑓 𝑢 2 ≥ 0 for every 𝑓: 𝕊𝑑−1 → ℝ with deg 𝑓 ≤ 𝑘

notation: ෩𝔼𝐷 𝑢 𝑓 𝑢 ≝ σ𝑢𝐷 𝑢 ⋅ 𝑓 𝑢

degree-𝒌 sum-of-squares proof of ∀𝑢 ∈ 𝕊𝑑−1. 𝑓 𝑢 ≥ 𝑔(𝑢)

functions ℎ1, … , ℎ𝑟 with deg ℎ1 , … , deg ℎ𝑟 ≤ 𝑘/2

∀𝑢 ∈ 𝕊𝑑−1. 𝑓 𝑢 − 𝑔 𝑢 = ℎ1 𝑢 2 +⋯+ ℎ𝑟 𝑢 2

if 𝑓 ≥ 𝑔 has degree-𝑘 sos proof, then ෩𝔼𝐷 𝑢 𝑓 𝑢 ≥ ෩𝔼𝐷 𝑢 𝑔 𝑢

for every degree-𝑘 pseudo-distribution 𝐷

duality: pseudo-distributions vs sum-of-squares proofs

lifting moments higher via sum-of-squares

ℳ3 =1


⊗3 for random unit vectors 𝑎1, … , 𝑎𝑛 ∈ ℝ𝑑 and 𝑛 ≪ 𝑑1.5

theorem: w.h.p. over 𝑎1, … , 𝑎𝑛, every degree-12 pseudo-distribution 𝐷

with ෩𝔼𝐷 𝑢 𝑢⊗3 = ℳ3 satisfies ෩𝔼𝐷 𝑢 𝑢

⊗6 −ℳ6 ≤ 𝑜 1 ℳ6 .

lifting moments higher via sum-of-squares

ℳ3 =1


⊗3 for random unit vectors 𝑎1, … , 𝑎𝑛 ∈ ℝ𝑑 and 𝑛 ≪ 𝑑1.5

theorem: w.h.p. over 𝑎1, … , 𝑎𝑛, every degree-12 pseudo-distribution 𝐷

with ෩𝔼𝐷 𝑢 𝑢⊗3 = ℳ3 satisfies ෩𝔼𝐷 𝑢 𝑢

⊗6 −ℳ6 ≤ 𝑜 1 ℳ6 .

enough to show (same as in previous proof for probability distributions):

෩𝔼𝐷 𝑢 σ𝑖=1𝑛 𝑎𝑖 , 𝑢

3 ≥ 1 − 𝑜 1 ෩𝔼𝐷 𝑢 σ𝑖=1𝑛 𝑎𝑖 , 𝑢

6 ≥ 1 − 𝑜 1⇒

w.h.p. over 𝑎1, … , 𝑎𝑛, the following inequality has degree-12 sos proof

∀𝑢 ∈ 𝕊𝑑−1. σ𝑖=1𝑛 𝑎𝑖 , 𝑢

3 ≤3

4+

1

4σ𝑖=1𝑛 𝑎𝑖 , 𝑢

6 + 𝑜 1

key ingredient is to bound spectral norm of random matrix polynomial

[Ge–Ma’15]

σ𝑖≠𝑗 𝑎𝑖 , 𝑎𝑗 ⋅ 𝑎𝑖𝑎𝑖⊤⊗𝑎𝑗𝑎𝑗

⊤ ≤ 𝑜 1

prevailing wisdom: sum-of-squares is “strictly theoretical”

sum-of-squares algorithms have nothing to do with practical ones

at odds with tenet of computational complexity that polynomial-time is a good model for practical algorithms

next:

algorithm to decompose random overcomplete 3-tensorwith close to linear running time (in size of input)and guarantees close to those of sum-of-squares

general recipe for new kinds of fast spectral algorithms inspired by SOS

ℳ3෩ℳ6SOS


𝑎1⊗2, … , 𝑎𝑛

⊗2

random unit vectors 𝑎1, … , 𝑎𝑛 ∈ ℝ𝑑 with 𝑛 ≪ 𝑑1.5 ; moments ℳ𝑘 =

1


⊗𝑘

approach for fast decomposition of overcomplete 3-tensor

too slow!

ℳ3෩ℳ6SOS


𝑎1⊗2, … , 𝑎𝑛

⊗2

random unit vectors 𝑎1, … , 𝑎𝑛 ∈ ℝ𝑑 with 𝑛 ≪ 𝑑1.5 ; moments ℳ𝑘 =

1


⊗𝑘

approach for fast decomposition of overcomplete 3-tensor

direct algorithmto fool Jennrich

𝑖𝑗𝑘

𝑎𝑖 , 𝑎𝑘 ⟨𝑎𝑗 , 𝑎𝑘⟩ ⋅ 𝑎𝑖 ⊗𝑎𝑗 ⊗ 𝑎𝑖 ⊗𝑎𝑗 ⊗𝑎𝑘

=𝑖=1

𝑛

𝑎𝑖⊗5 + 𝐸 with 𝐸 “random-like”

and 𝐸 𝐹2 ≤

𝑛3

𝑑2

claim: if 𝑛 ≪ 𝑑1.33 then 𝐸 contributes negligible spectral error for Jennrich

(huge Frobenius norm)

input to Jennrich has “size” 𝑑5 (computing it takes naively O 𝑑6 time)

exploit tensor structure to implement Jennrich in time 𝑂 𝑑1+𝜔 ≤ 𝑂 𝑑3.3…

meta result* (* some technical conditions omitted)

efficient algorithm to solve polynomial optimization problems that have only few global optima

sum-of-squares method (based on semidefinite programming) [Shor, Parrilo, Lasserre]

running time poly #solutions also need short sum-of-squares certificate for this fact

previous work: running time 𝑛𝑂 log #solutions

(quasi-poly time for poly #solutions)

[Barak-Kelner-S STOC’15]

# bad local optimacan be exponential

local-search algorithms fail

meta result* (* some technical conditions omitted)

efficient algorithm to solve polynomial optimization problems that have only few global optima

sum-of-squares method (based on semidefinite programming) [Shor, Parrilo, Lasserre]

running time poly #solutions also need short sum-of-squares certificate for this fact

applications: unsupervised learning problems tend to have this property

identifiability: data uniquely determines parameters of model

our work: notion of constructive identifiability proofs that implies efficient inference algorithms

tensor decomposition / polynomial optimization via sum-of-squares

use Jennrich’s algorithm (small spectral gaps) as rounding algorithm

fast spectral algorithms via sum-of-squares

fool rounding algorithm by low-degree matrix polynomial of input

sum-of-squares proof for approximate uniqueness (identifiability)

exploit tensor structure for fast algebraic operations

conclusions

tensor decomposition / polynomial optimization via sum-of-squares

use Jennrich’s algorithm (small spectral gaps) as rounding algorithm

fast spectral algorithms via sum-of-squares

fool rounding algorithm by low-degree matrix polynomial of input

sum-of-squares proof for approximate uniqueness (identifiability)

exploit tensor structure for fast algebraic operations

conclusions

thank youvery much!random 3-tensors beyond rank 𝒅𝟏.𝟓?

lower bounds? hard to distinguish from completely random 3-tensors?

smoothed analysis for overcomplete 3-tensors?

strong bounds known for 4-tensors [Lathauwer, Castaing, Cardoso’07]

questions

Documents

Tensor decompositions, sum-of-squares proofs, and spectral ...prevailing wisdom: sum-of-squares is “strictly theoretical” sum-of-squares algorithms have nothing to do with practical