Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
School of Computer Science
LDA, Sparse Coding, Matrix Factorization, and All That
Yaoliang Yu Carnegie Mellon University
1 © Eric Xing @ CMU, PGM 2015
Take-home l Topic models are cool
2
© Eric Xing @ CMU, ACL Tutorial 2012
Take-home l Topic models are cool
3
© Eric Xing @ CMU, ACL Tutorial 2012
Take-home l Topic models are cool
l Matrix factorizations are … simple
4
© Eric Xing @ CMU, ACL Tutorial 2012
X ⇡ AB
Take-home l Topic models are cool
l Matrix factorizations are … simple
l But, they are intimately related l Allow knowledge transfer
5
© Eric Xing @ CMU, ACL Tutorial 2012
X ⇡ AB
Take-home l Topic models are cool
l Matrix factorizations are … simple
l But, they are intimately related l Allow knowledge transfer
6
© Eric Xing @ CMU, ACL Tutorial 2012
X ⇡ AB
The Vector Space Model l Bag-of-words representation of doc l Order ignored, only counts matter
⇒
7 © Eric Xing @ CMU, ACL Tutorial 2012
* *
U (W x K)
Λ (K x K)
VT (K x D)
=
X (W x D)
Document
Term
...
Latent Semantic Indexing (Deerwester et al., JASIS’90)
l Startling applications (Berry et al., SIAM Rev’95)
l Cross-language retrieval l TOEFL/GRE synonym l Match paper submissions with reviewers (SIGIR’92!)
8 © Eric Xing @ CMU, ACL Tutorial 2012
What is really LSI? l From an optimization point of view
l Nothing but matrix factorization/approximation !
l K << W, K << D, otherwise trivial l Solvable by SVD, even though not jointly convex
l Things can be improved: l Real-valued? l How to set K? l No probabilistic model? l Squared loss?
© Eric Xing @ CMU, ACL Tutorial 2012
9
minU,V
kX � UV k2F
s.t. U 2 RW⇥K , V 2 RK⇥D
Towards A Prob. Model
l What prob. dist. should we use? l Exponential family!
l ML with independence:
l Low-rank makes estimation possible l Gaussian à à LSI
© Eric Xing @ CMU, ACL Tutorial 2012
10
X ⇠ Pr(·|⇥),where
⇥ = UV >has low-rank
L(⇥) = � log p(X|⇥) / �hX,⇥i+G(⇥)
G(⇥) = 12k⇥k2F
Exponential Family PCA (Collins et al., NIPS’01)
l G: log-partition function, strictly convex and analytic l Alternating minimization
l Fix U, solve the convex subproblem w.r.t. V; l Fix V, solve the convex subproblem w.r.t. U; l Converge to a stationary point
l Amounts to: l Factorizing X under fancier loss than least squares
l Choose Poisson for count data, unconstrained
© Eric Xing @ CMU, ACL Tutorial 2012
11
minU,V
�hX,UV >i+G(UV >)
min
⇥=UV >
X
w,d
�Xw,d⇥w,d + exp(⇥w,d)
Nonnegative Matrix Factorization (Lee & Seung, Nature’99)
l Count data X is nonnegative l Only (?) make sense to approximate with nonnegative numbers
l Can also use KL divergence
l Very similar to EXP PCA ! l Yet another fancier loss
l Multiplicative updates l Must initialize densely !
© Eric Xing @ CMU, ACL Tutorial 2012
12
minU,V�0
kX � UV >k2F
min
U,V�0
X
w,d
�Xw,d log⇥w,d +⇥w,d
Uw,k Uw,k
Pd Vd,kXw,d/⇥w,dP
d Vd,k
Vd,k Vd,k
Pw Uw,kXw,d/⇥w,dP
w Uw,k
Synonymy vs. Polysemy
© Eric Xing @ CMU, ACL Tutorial 2012
13
l LSI (and relatives aforementioned) is good at synonym l Similar words are close in latent semantic space
l But less so at polysemy: l “It was a nice shot. ”
Probabilistic LSI (Hoffman, ML’01)
wn zn d
N M
θ
k
14 © Eric Xing @ CMU, ACL Tutorial 2012
U
p(X) =DY
d=1
p(d)NdY
w=1
KX
k=1
p(Xw,d|Zw,d = k)| {z }
Uw,k
p(Zw,d = k|d)| {z }
Vd,k
pLSI cont’
l A "generative" model l Models each word in a document as a sample from a mixture
model. l Each word is generated from a single topic, different words in
the document may be generated from different topics. l A topic is characterized by a distribution over words. l Each document is represented as a list of admixing
proportions for the components (i.e. topic vector θ ).
wnznd
NM
θ
βk
wnznd
NM
θθ
βk
βk
15 © Eric Xing @ CMU, ACL Tutorial 2012
pLSI cont’’ l Maximize the marginal likelihood:
l Similar to EXP PCA, NMF l But with more stringent normalization constraints
l Can use EM to optimize U, V (Hoffman’01) l Or simply alternate U, V
l Solves polysemy since same word can be drawn from different topics l But may overfit !
l Parameter V grows with doc size D
© Eric Xing @ CMU, ACL Tutorial 2012
16
min
⇥=UV
X
w,d
�Xw,d log⇥w,d
s.t. U, V � 0, U>1 = 1, V 1 = 1
Latent Dirichlet Allocation (Blei et al., JMLR’03)
wn zn θ
N M
α
K
η βk
€
p(w) = p(θ )p(β)∫ p(zn θ)p(wn βzn )n=1
N
∏&
' (
)
* + dθ dβ
z∑
Essentially a Bayesian pLSI:
17 © Eric Xing @ CMU, ACL Tutorial 2012
LDA
l Generative model l Models each word in a document as a sample from a mixture
model. l Each word is generated from a single topic, different words in
the document may be generated from different topics. l A topic is characterized by a distribution over words. l Each document is represented as a list of admixing
proportions for the components (i.e. topic vector). l The topic vectors and the word rates each follows a Dirichlet
prior --- essentially a Bayesian pLSI l How does LDA avoid overfitting?
wnznθ
NM
α
K
η βk
wnznθ
NM
α
K
η βk
18 © Eric Xing @ CMU, ACL Tutorial 2012
PCA -- revisited
l PCA enjoys a set of salient properties: l Orthogonal basis l Implicit Gaussian assumption l 2nd order de-correlation l “noiseless”
l All above have been modified: l Overcomplete basis (Sparse coding) l Exponential family (more later) l High order de-correlation (ICA) l Probabilistic model
l LDA is Probabilistic PCA
19 © Eric Xing @ CMU, ACL Tutorial 2012
Probabilistic PCA (Tipping & Bishop, JRSSB’99)
l Probabilistic model (U, sigma hyperparameter): l Analytic marginalization: l MLE:
l Apply von Neuman’s trace inequality:
l Set derivative to 0:
20 © Eric Xing @ CMU, ACL Tutorial 2012
X ⇠ N (0, UU> + �2IW )
min
�2k,�
2
KX
k=1
log(�2k + �2
) +
s2k�2k + �2
+
WX
k=K+1
log �2+
s2k�2
min
U,�2log det(UU>
+ �2I) + hS, (UU>+ �2I)�1i
�2k = s2k � �2 �2 =
1
W �K
WX
k=K+1
s2k
X|V ⇠ N (UV >,�2IW ), V > ⇠ N (0, IK)
PPCA cont’ l Similar to conventional PCA
l Take sample covariance eigenvector l Shrink first K eigenvalues by the average of the tail eigenvalues l Recover PCA when sigma = 0, i.e., noiseless
l Many extensions l Mixture of PPCA (Tipping & Bishop, NC’99) l Hierarchical PPCA (Bishop & Tipping, PAMI’98) l Sparse PPCA (Guan & Dy, AISTATS’09) l Robust PPCA (Archambeau et al., ICML’06) l Infinite PPCA ??
21 © Eric Xing @ CMU, ACL Tutorial 2012
Multinomial PCA (Buntine, ECML’02)
l Apply PPCA to count data X:
l Like before, U, alpha are hyperparameters l But each column in U is a probability distribution
l This is LDA ! l Essentially marginalizing Z out
l But no longer can analytically marginalize out V l Solve by VI, SVI, or Gibbs sampling
22 © Eric Xing @ CMU, ACL Tutorial 2012
X:,d|Vd,: ⇠ Mul(UV >d,:, Nd), Vd,: ⇠ Dir(↵)
What have learned so far l Matrix factorization view of topic models
l LSI, EXP PCA, NMF, pLSI are all matrix factorizations, under different loss / constraints
l Probabilistic view of matrix factorizations l PPCA, LDA = Multinomial PCA
l This connection is exploited in recent theoretical results l Papers have “ .. provable ...” or “… spectral …”
l Ideas from matrix factorization easily translate to topic models, and vice versa l SVI, parallel implementation, etc.
l Issues have not been considered: l Sparsity: each doc contains few topics; each word appears in few topics l How to set K? (Yes, you’ve learned Dirichlet processes)
© Eric Xing @ CMU, ACL Tutorial 2012
23
Strive after Sparsity l Each doc contains few topics; each word appear in few topics
l Each column in V or U needs to be sparse
l Cannot achieve exact sparsity with Bayesian methods l Bayesian estimates are conditional mean (under least squares risk) l Averaging never yields exact sparsity ! l Can still achieve sparsity by ad hoc truncation
l But, hey, LASSO achieves exact sparsity l LASSO is MAP, not Bayesian
l Seen topic model = matrix factorization l There is sparse matrix factorization l Steal ideas from there!
© Eric Xing @ CMU, ACL Tutorial 2012
24
Sparse Coding
l U are called dictionary, e.g., topic distributions l V are called coefficients (coding / loading / mixing proportion)
l For each column , encoding x with v, under dictionary U
l For v to be sparse, U needs to be overcomplete l Wavelets, random matrices l Can be learned jointly with V
…
… = X
25 © Eric Xing @ CMU, ACL Tutorial 2012
X ⇡ UV >
x ⇡ Uv
Sparse coding cont’ (Olshausen and Field, Nature’96)
l ell: likelihood, g: prior on V, MAP estimate of U, V l Need to constrain U due to indeterminism l Choose g e.g. the 1-norm
© Eric Xing @ CMU, PGM
26
minU,V
`(X,UV >)| {z }Reconstruction
+�X
d
g(Vd,:)| {z }
sparseness
Sparse coding cont’’
l Exponential family ell (Lee et al., AAAI’10)
l Solve by alternating U and V l Converge to stationary point l In general NP-hard, recent global convergence guarantees
l Group sparse coding (Bengio et al., NIPS’09) l Group along the doc dim, e.g., l1/l2
© Eric Xing @ CMU, ACL Tutorial 2012
27
minU,V
`(X,UV >)| {z }Reconstruction
+�X
d
X
k
g(Vd,k
)| {z }
sparseness
Letting K à infty (Zhang et al., AAAI’11, NIPS’12)
l Take g = || . ||
l K à infty, becomes a norm (convex!) l Gauge induced by the set l Dual norm
l For both l l Thus need to solve l Induce low-rank, convex, no local minima !
l Can play with other choices of norms (e.g., White et al., NIPS’12)
© Eric Xing @ CMU, ACL Tutorial 2012
28
|||⇥|||{uv> : |u| 1, kvk 1}
|||⇥|||� = max{u>⇥v : |u| 1, kvk 1}
|u| = kuk2, kvk = kvk2|||⇥|||� = k⇥ksp
min⇥
`(X,⇥) + �k⇥ktr
min⇥
`(X,⇥) + � min⇥=UV >,|U:,k|1
X
k
kV:,kk
| {z }|||⇥|||
Sparse Topical Coding (Zhu & Xing, UAI’11)
non-negative codes topical bases
reconstruction loss
truncated aggregation
sparse codes
l Goal: design a non-probabilistic topic model that is amenable to l direct control on the posterior sparsity of inferred representations l avoid dealing with normalization constant when considering supervision or rich features l seamless integration with a convex loss function (e.g., svm hinge loss)
l We extend sparse coding to hierarchical sparse topical coding l word code θ l document code s
29 © Eric Xing @ CMU, ACL Tutorial 2012
Comparisons
LDA vs. STC
30 © Eric Xing @ CMU, ACL Tutorial 2012
Sparse word codes l Sparsity ratio: percentage of zeros
• NMF: non-negative matrix factorization • MedLDA (Zhu et al., 2009) • regLDA: LDA with entropic regularizer • gaussSTC: use L2 rather than L1-norm
31 © Eric Xing @ CMU, ACL Tutorial 2012
l Hierarchical sparse coding l for each document
l Word code
l Document code (truncated averaging)
l Dictionary learning
l projected gradient descent l any faster alternative method can be used
Computation on STC
32 © Eric Xing @ CMU, ACL Tutorial 2012
Performance
~ 10 times speed up in train &test
33 © Eric Xing @ CMU, ACL Tutorial 2012
The shocking results on LDA
Retrieval Classification Annotation
l LDA is actually doing very poor on several “objectively” evaluatable predictive tasks
34 © Eric Xing @ CMU, ACL Tutorial 2012
Why? l LDA is not designed, nor trained for such tasks, such as
classification, there is no warrantee that the estimated topic vector θ is good at discriminating documents
35 © Eric Xing @ CMU, ACL Tutorial 2012
Unsupervised Latent Subspace Discovery
l Finding latent subspace representations (an old topic) l Mapping a high-dimensional representation into a latent low-dimensional representation, where each
dimension can have some interpretable meaning, e.g., a semantic topic
l Examples: l Topic models (aka LDA) [Blei et al 2003]
l Total scene latent space models [Li et al 2009]
l Multi-view latent Markov models [Xing et al 2005]
l PCA, CCA, …
⇒
⇒
⇒
Athlete Horse Grass Trees Sky Saddle
36 © Eric Xing @ CMU, ACL Tutorial 2012
l Unsupervised latent subspace representations are generic but can be sub-optimal for predictions
l Many datasets are available with supervised side information
l Can be noisy, but not random noise (Ames & Naaman, 2007) l labels & rating scores are usually assigned based on some intrinsic property of the data l helpful to suppress noise and capture the most useful aspects of the data
l Goals: l Discover latent subspace representations that are both predictive and
interpretable by exploring weak supervision information
� Tripadvisor Hotel Review (http://www.tripadvisor.com)
� LabelMe http://labelme.csail.mit.edu/
� Many others
Flickr (http://www.flickr.com/)
Predictive Subspace Learning with Supervision
37 © Eric Xing @ CMU, ACL Tutorial 2012
Class 1
Class 2
,0 ,1)(
s.ti
ibxwy
i
iiT
i
∀≥
∀−≥+
ξ
ξ
∑=
+m
ii
Tbw Cww
1, 2
1min ξ
Support vector machines
38 © Eric Xing @ CMU, ACL Tutorial 2012
Supervised STC
l Joint loss minimization
l coordinate descent alg. applies with closed-form update rules l No sum-exp function; seamless integration with non-probabilistic large-margin
principle
39 © Eric Xing @ CMU, ACL Tutorial 2012
Classification accuracy l 20 newsgroup data:
l regLDA+: LDA with entropic regularizer achieveing best classification accuracy l regLDA-: LDA with entropic regularizer achieveing comparable sparsity ratio as
STC
40 © Eric Xing @ CMU, ACL Tutorial 2012
Time efficiency l training & testing time
l No calls of digamma function l Converge faster with one additional dimension of freedom
41 © Eric Xing @ CMU, ACL Tutorial 2012
Summary l Topic models are intimately related to matrix factorization
l LDA = Multinomial PCA
l Exploring the connection can be beneficial
l Understanding l Sparsity l Efficient inference l Supervision
l More elaborate matrix factorizations can be devised l Until next time !
42 © Eric Xing @ CMU, ACL Tutorial 2012