TENSORDECOMPOSITIONSANDTHEIR …people.csail.mit.edu/moitra/docs/Tensors.pdfSPEARMAN’SHYPOTHESIS...

TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS

ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY

SPEARMAN’S HYPOTHESIS

Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve

educIve (adj): the ability to make sense out of complexity reproducIve (adj): the ability to store and reproduce informaIon

To test this theory, he invented Factor Analysis:

≈ M A

tests (10)

students (1000)

To test this theory, he invented Factor Analysis:

≈ M A BT

tests (10)

students (1000) inner-‐dimension (2)

Given: M = ai bi ×

= A BT

“correct” factors

Given: M = ai bi ×

When can we recover the factors ai and bi uniquely?

= A BT

“correct” factors

Given: M = ai bi ×

= A BT = AR R-‐1BT

“correct” factors alternaIve factorizaIon

Given: M = ai bi ×

Claim: The factors {ai} and {bi} are not determined uniquely unless we impose addiIonal condiIons on them

Given: M = ai bi ×

e.g. if {ai} and {bi} are orthogonal, or rank(M)=1

Given: M = ai bi ×

This is called the rota=on problem, and is a major issue in factor analysis and moIvates the study of tensor methods…

e.g. if {ai} and {bi} are orthogonal, or rank(M)=1

OUTLINE

Part I: Algorithms

� The RotaIon Problem

� Jennrich’s Algorithm

Part II: Applica=ons

� PhylogeneIc ReconstrucIon

� Pure Topic Models

The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiIons

Part III: Smoothed Analysis

� Overcomplete Problems

� Kruskal Rank and the Khatri-‐Rao Product

MATRIX DECOMPOSITIONS

M = a1 ⌦ b1 + a2 ⌦ b2 + · · ·+ aR ⌦ bR

MATRIX DECOMPOSITIONS

M = a1 ⌦ b1 + a2 ⌦ b2 + · · ·+ aR ⌦ bR

TENSOR DECOMPOSITIONS

T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR

(i, j, k) entry of x⌦ y ⌦ z is x(i)⇥ y(j)⇥ z(k)

When are tensor decomposiIons unique?

Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other…

Theorem [Jennrich 1970]: Suppose {ai} and {bi} are linearly independent and no pair of vectors in {ci} is a scalar mulIple of each other. Then

T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR

is unique up to permuIng the rank one terms and rescaling the factors.

T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR

Equivalently, the rank one factors are unique

T = a1 ⌦ b1 ⌦ c1 + · · ·+ aR ⌦ bR ⌦ cR

Equivalently, the rank one factors are unique

There is a simple algorithm to compute the factors too!

JENNRICH’S ALGORITHM

i.e. add up matrix slices

Compute T( � , � , x )

T = a b c × × If then T(� , � , x ) = c, x a b ×

ci, x ai bi × Compute T( � , � , x ) =

ci, x ai bi ×

(x is chosen uniformly at random from Sn-‐1)

Compute T( � , � , x ) =

(x is chosen uniformly at random from Sn-‐1)

Compute T( � , � , x ) = A Dx BT

Diag( ci, x )

Compute T( � , � , y ) = A Dy BT

Diagonalize T( � , � , x ) T( � , � , y )-‐1

A DxBT(BT)-‐1 Dy-‐1 A-‐1

A Dx Dy-‐1 A-‐1

Claim: whp (over x,y) the eigenvalues are disInct, so the EigendecomposiIon is unique and recovers ai’s

Diagonalize T( � , � , y ) T( � , � , x )-‐1

Match up the factors (their eigenvalues are reciprocals) and find {ci} by solving a linear syst.

Given: M = ai bi × When can we recover the factors ai and bi uniquely?

This is only possible if {ai} and {bi} are orthonormal, or rank(M)=1

When can we recover the factors ai, bi and ci uniquely?

Given: T = ai bi ci × ×

When can we recover the factors ai, bi and ci uniquely?

Jennrich: If {ai} and {bi} are full rank and no pair in {ci} are scalar mulIples of each other

Given: T = ai bi ci × ×

OUTLINE

Part I: Algorithms

PHYLOGENETIC RECONSTRUCTION

= exInct

= extant

“Tree of Life”

= exInct

= extant

root: π : Σ R+ “iniIal distribuIon”

= exInct

= extant

Σ = alphabet

Rz,b “condiIonal distribuIon”

= exInct

= extant

Σ = alphabet

Rz,b “condiIonal distribuIon”

= exInct

= extant

In each sample, we observe a symbol (Σ) at each extant ( ) node where we sample from π for the root, and propagate it using Rx,y, etc

Σ = alphabet

HIDDEN MARKOV MODELS

= hidden

= observed

π : Σs R+ “iniIal distribuIon”

= hidden

= observed

= hidden

= observed

“transiIon matrices” “obs. m

atrices”

= hidden

= observed

In each sample, we observe a symbol (Σo) at each obs. ( ) node where we sample from π for the start, and propagate it using Rx,y, etc (Σs)

“transiIon matrices” “obs. m

atrices”

Ques=on: Can we reconstruct just the topology from random samples?

Usually, we assume Tx,y, etc are full rank so that we can re-‐root the tree arbitrarily

[Steel, 1994]: The following is a distance funcIon on the edges

dx,y = -‐ ln |det(Px,y)| + ½ ln πx,σ -‐ ½ ln πy,σ σ in Σ σ in Σ

where Px,y is the joint distribuIon

where Px,y is the joint distribuIon, and the distance between leaves is the sum of distances on the path in the tree

(It’s not even obvious it’s nonnega=ve!)

[Erdos, Steel, Szekely, Warnow, 1997]: Used Steel’s distance funcIon and quartet tests

to reconstrucIon the topology

c d OR

d b OR …

to reconstrucIon the topology, from polynomially many samples

c d OR

d b OR …

to reconstrucIon the topology, from polynomially many samples

c d OR

d b OR …

For many problems (e.g. HMMs) finding the transiIon matrices is the main issue…

[Chang, 1996]: The model is idenIfiable (if R’s are full rank)

Joint distribu=on over (a, b, c):

× Pr[z = σ] Pr[a|z = σ] Pr[b|z = σ] Pr[c|z = σ] × σ

Joint distribu=on over (a, b, c):

columns of Rz,b

[Mossel, Roch, 2006]: There is an algorithm to PAC learn a

phylogeneIc tree or an HMM (if its transiIon/output matrices are full rank) from polynomially many samples

Ques=on: Is the full-‐rank assumpIon necessary?

[Mossel, Roch, 2006]: It is as hard as noisy-‐parity to learn the parameters of a general HMM

Noisy-‐parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2n/log(n)

Due to [Blum, Kalai, Wasserman, 2003]

Noisy-‐parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2n/log(n)

Due to [Blum, Kalai, Wasserman, 2003]

(It’s now used as a hard problem to build cryptosystems!)

THE POWER OF CONDITIONAL INDEPENDENCE

[Phylogene=c Trees/HMMS]:

(joint distribuIon on leaves a, b, c)

PURE TOPIC MODELS

words (m

topics (r)

� Each topic is a distribuIon on words

PURE TOPIC MODELS

words (m

topics (r)

� Each document is about only one topic

(stochasIcally generated)

PURE TOPIC MODELS

words (m

topics (r)

� Each document is about only one topic

(stochasIcally generated)

� Each document, we sample L words from its distribuIon

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

= A W M

PURE TOPIC MODELS

≈ [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)

PURE TOPIC MODELS

Ques=on: Where can we find three condiIonally independent random variables?

PURE TOPIC MODELS

The first, second and third words are independent condiIoned on the topic t (and are random samples from At)

[Pure Topic Models/LDA]:

× Pr[topic = j] Aj Aj Aj × j

(joint distribuIon on first three words)

[Pure Topic Models/LDA]:

× Pr[topic = j] Aj Aj Aj × j

(joint distribuIon on first three words)

[Community Detec=on]:

× Pr[Cx = j] (CAΠ)j (CBΠ)j (CCΠ)j × j

(counIng stars)

OUTLINE

Part I: Algorithms

So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon.

So far, Jennrich’s algorithm has been the key but it has a crucial limitaIon. Let

T = ai ai ai × × i = 1

where {ai} are n-‐dimensional vectors

Ques=on: What if R is much larger than n?

T = ai ai ai × × i = 1

This is called the overcomplete case e.g. the number of factors is much larger than the number of observaIons…

T = ai ai ai × × i = 1

This is called the overcomplete case e.g. the number of factors is much larger than the number of observaIons…

In such cases, why stop at third-‐order tensors?

Consider a sixth-‐order tensor T:

T = ai ai ai × × i = 1

ai ai ai × × ×

T = ai ai ai × × i = 1

ai ai ai × × ×

Ques=on: Can we find its factors, even if R is much larger than n?

T = ai ai ai × × i = 1

ai ai ai × × ×

flat(T) = bi bi bi × × i = 1

(where bi = ai ai × KR )

n2-‐dimensional vector whose (j,k)th entry is the product of the jth and kth entries of ai Khatri-‐Rao product

Let’s flaven it:

T = ai ai ai × × i = 1

ai ai ai × × ×

flat(T) = bi bi bi × × i = 1

(where bi = ai ai × KR )

n2-‐dimensional vector whose (j,k)th entry is the product of the jth and kth entries of ai Khatri-‐Rao product

Let’s flaven it by rearranging its entries into a third-‐order tensor:

Ques=on: Can we apply Jennrich’s Algorithm to flat(T)?

When are the new factors bi = ai ai × KR linearly independent?

Example #1:

Let {ai} be all ( ) n 2 vectors with exactly two ones

Example #1:

Then {bi} are vectorizaIons of:

Example #1:

Non-‐zero only in bi

Example #1:

Non-‐zero only in bi

and are linearly independent

Example #2:

Let {a1…n} and {an+1..2n} be two random orthonormal bases

Example #2:

Then there is a linear dependence with 2n terms:

Example #2:

ai ai × KR i = 1

-‐ ai ai × KR i = n+1

Example #2:

ai ai × KR i = 1

-‐ ai ai × KR i = n+1

= 0 (as matrices, both sum to the idenIty)

THE KRUSKAL RANK

Defini=on: The Kruskal rank (k-‐rank) of {bi} is the largest k s.t. every set of k vectors is linearly independent

THE KRUSKAL RANK

bi = ai ai × KR k-‐rank({ai}) = n

THE KRUSKAL RANK

bi = ai ai × KR Example #1: k-‐rank({bi}) = R = ( ) n

k-‐rank({ai}) = n

THE KRUSKAL RANK

Example #2: k-‐rank({bi}) = 2n-‐1

k-‐rank({ai}) = n

THE KRUSKAL RANK

Example #2: k-‐rank({bi}) = 2n-‐1

k-‐rank({ai}) = n

The Kruskal rank always adds under the Khatri-‐Rao product, but someImes it mul=plies and that can allow us to handle R >> n

[Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank mulIplies under the Khatri-‐Rao product

Proof: The set of {ai} where

bi = ai ai × KR det({bi}) = 0 and

is measure zero

But this yields a very weak bound on the condi=on number of {bi}…

is measure zero

But this yields a very weak bound on the condi=on number of {bi}…

… which is what we need to apply it to learning/staIsIcs, where we have an esImate to T

Defini=on: The robust Kruskal rank (k-‐rankγ) of {bi} is the largest k s.t. every set of k vector has condiIon number at most O(γ)

[Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri-‐Rao product

[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed…

[Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {ai} are ε-‐perturbed. Then

k-‐rankγ({bi}) = R

for R = n2/2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)

Hence we can apply Jennrich’s Algorithm to flat(T) with R >> n

Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ

This can be extended to any constant order Khatri-‐Rao product

Sample applica=on: Algorithm for learning mixtures of nO(1) spherical Gaussians in Rn, if their means are ε-‐perturbed

This was also obtained independently by [Anderson, Belkin, Goyal, Rademacher, Voss, 2014]

Any QuesIons? Summary:

� Tensor decomposiIons are unique under much more general condiIons, compared to matrix decomposiIons

� Jennrich’s Algorithm (rediscovered many Imes!), and its many applicaIons in learning/staIsIcs

� Introduced new models to study overcomplete problems (R >> n)

� Are there algorithms for order-‐k tensors that work with R = n0.51 k?

TENSORDECOMPOSITIONSANDTHEIR …people.csail.mit.edu/moitra/docs/Tensors.pdfSPEARMAN’SHYPOTHESIS...

Documents

Ryan O’Donnell (CMU, IAS) joint work with Ankur Moitra (MIT)

The story of us: The journey of man by karobi moitra

Technique 1904

Deependra Moitra Determinants of Success in Global R&D Lessons from India’s IT Industry

New Algorithms for Nonnegative Matrix Factorization and …people.csail.mit.edu/moitra/docs/Provable8.pdfNew Algorithms for Nonnegative Matrix Factorization and Beyond Ankur Moitra

Fészek 1904

Sir Halford Mackinder’s 1904 definition - InterSciWikiintersci.ss.uci.edu/wiki/eBooks/Articles/1904 HEARTLAND THEORY... · Sir Halford Mackinder’s 1904 definition - InterSciWiki

Super-resolution, Extremal Functions and the Condition ...people.csail.mit.edu/moitra/docs/Beurling2.pdfSuper-resolution, Extremal Functions and the Condition Number of Vandermonde

1904-003library.fes.de/gewerkzs/gaertnerzeitung/1904/pdf/1904-003.pdf · Bei Bestellungen berufe man Sich stets auf diese Zeitung. Deutscher * Gärtner-KaIenðer * für 1904. 1,00

Download these slides people.csail.mit.edu/seneff/2015/WAPF_folate. pptx

Biology of cancer lectures 1 to 5 revision Karobi Moitra

Biology of cancer lecture 5 cell cycle Karobi Moitra

Download these slides people.csail.mit.edu/seneff/2015/WAPF_heart.p ptx

E ciently Learning Mixtures of Two Gaussianspeople.csail.mit.edu/moitra/docs/2g-full.pdf · 2010-04-23 · E ciently Learning Mixtures of Two Gaussians Adam Tauman Kalai Ankur Moitra

1904 Hilferding

Computing a Nonnegative Matrix Factorization …people.csail.mit.edu/moitra/docs/Provably3.pdfComputing a Nonnegative Matrix Factorization { Provably Ankur Moitra, IAS joint work with

Korea (1904)

1904 jb18051

Storytelling in STEM poster lily conference 2014 Karobi Moitra

Approximation Algorithms for Multicommodity-Type Problems ...people.csail.mit.edu/moitra/docs/mit.pdf · Approximation Algorithms for Multicommodity-Type Problems with Guarantees