Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage

Eric Xing

© Eric Xing @ CMU, 2006-2010 1

Machine LearningMachine Learning

Latent Aspect Models Latent Aspect Models

Eric XingEric Xing

Lecture 14, August 15, 2010

Reading: see class homepage

Eric Xing

© Eric Xing @ CMU, 2006-2010 2

Apoptosis + Medicine

Eric Xing

© Eric Xing @ CMU, 2006-2010 3

probabilisticgenerative

model


Eric Xing

© Eric Xing @ CMU, 2006-2010 4

statisticalinference


Eric Xing

© Eric Xing @ CMU, 2006-2010 5

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

(Generative Model)

(Inference)

Connecting Probability Models to Data

Eric Xing

© Eric Xing @ CMU, 2006-2010 6

Motivation for modeling latent topical aspects

Dimensionality reduction A VSM lives in a very high-dimensional feature space (usually larger vocabulary,

V)

Sparse representation of documents (|V| >> actual number of appeared words in any given document) --- often too spurious for many IR tasks

Semantic analysis and comprehension A need to define conceptual closeness,

to capture relation between features,

to distinguish and infer features from heterogeneous sources ...

Eric Xing

© Eric Xing @ CMU, 2006-2010 7

Latent Semantic Indexing (Deerwester et al., 1990)

Classic attempt at solving this problem in information retrieval

Uses SVD to reduce document representations

Models synonymy and polysemy

Computing SVD is slow

Non-probabilistic model

* *

T (m x k)

(k x k)

DT

(k x n)

=

X (m x n)

DocumentTerm ... * *

T (m x k)

(k x k)

DT

(k x n)

* *

T (m x k)

(k x k)

DT

(k x n)T

(m x k)

(k x k)DT

(k x n)

=

X (m x n)

DocumentTerm =

X (m x n)

DocumentTerm

X (m x n)

DocumentTerm ......

Eric Xing

© Eric Xing @ CMU, 2006-2010 8

Latent Semantic Structure

Latent Structure

Words

),w()w( PP

w

Distribution over words

)w(

)()|w()w|(

P

PPP

Inferring latent structure

...)w|( 1 nwP

Prediction

Eric Xing

© Eric Xing @ CMU, 2006-2010 9

How to Model Semantic? Q: What is it about? A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

SourceTargetSMT

AlignmentScoreBLEU

ParseTreeNoun

PhraseGrammar

CFG

likelihoodEM

HiddenParametersEstimation

argMax

MT Syntax Learning

0.6 0.3 0.1

Unigram over vocabulary

Top

ics

Mixing Proportion

Topic Models

Eric Xing

© Eric Xing @ CMU, 2006-2010 10

Words in Contexts

the opposition Labor Party fared even worse, with a

predicted 35 seats, seven less than last election.

Eric Xing

© Eric Xing @ CMU, 2006-2010 11

"Words" in Contexts (con'd)

Sivic et al. ICCV 2005

Eric Xing

© Eric Xing @ CMU, 2006-2010 12

loan

TOPIC 1

money

loan

bank

money

ba

nk

river

TOPIC 2

river

river

stream

bank

bank

stream

bank

loan

DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1

loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2

river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2

money1

DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1

stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1

money1 river2 bank1 money1 bank1 loan1 bank1 money1

stream2 .3

.8

.2

Mixture components

Mixture weights

.7

GENERATIVE PROCESS

Eric Xing

© Eric Xing @ CMU, 2006-2010 13

Why this is Useful? Q: What is it about? A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

MT Syntax Learning

Mixing Proportion

0.6 0.3 0.1

Q: give me similar document? Structured way of browsing the collection

Other tasks Dimensionality reduction

TF-IDF vs. topic mixing proportion

Classification, clustering, and more …

Eric Xing

© Eric Xing @ CMU, 2006-2010 14

Topic Models = Mixture Membership Models

Generating a documentPrior

θ

z

w β Nd

N

K

from ,| Draw -

from Draw-

each wordFor

prior thefrom

:1 nzknn

n

lmultinomiazw

lmultinomiaz

n

Draw

Which prior to use?

Eric Xing

© Eric Xing @ CMU, 2006-2010 15

Latent Dirichlet Allocation

wnzn

NM

K

k

Blei, Ng and Jordan (2003)

z

w ddwpzppppnznn

Nn )()(∏)()()( 1

Essentially a Bayesian pLSI:

Eric Xing

© Eric Xing @ CMU, 2006-2010 16

LDA

Generative model Models each word in a document as a sample from a mixture

model. Each word is generated from a single topic, different words in

the document may be generated from different topics. A topic is characterized by a distribution over words. Each document is represented as a list of mixing proportions

for the components (i.e. topic vector). The topic vectors and the word rates each follows a Dirichlet

prior --- essentially a Bayesian pLSI

wnzn

NM

K

k

wnzn

NM

K

k

Eric Xing

© Eric Xing @ CMU, 2006-2010 17

wor

ds

documents

W D

wor

ds

topic

topic

topi

c

topi

c

documents

LSI

Topic models

wor

ds

documents

wor

ds

topics

topi

csdocuments

P(w

|z)

P(z)P(w)Topic-Mixing is via repeated word labeling

dWx

'=

LSI versus pLSI

Eric Xing

© Eric Xing @ CMU, 2006-2010 18

Inference Tasks

Eric Xing

© Eric Xing @ CMU, 2006-2010 19

A possible query:

Close form solution?

Sum in the denominator over Tn terms, and integrate over n k-dimensional topic vectors

Bayesian inference

}{,,

,

)|()|()|()|()(mn

nz

Nn

nm

nmnzmn dddGppzpxpDp 1

)(

)|()|()|()|(

)(

),()|(

}{,,

,

Dp

ddGppzpxp

Dp

DpDp

mn

nz

in

nm

nmnzmn

nn

?)|(

?)|(

,

Dzp

Dp

mn

n

Eric Xing

© Eric Xing @ CMU, 2006-2010 20

Approximate Inference

Variational Inference

Mean field approximation (Blei et al) Expectation propagation (Minka et al) Variational 2nd-order Taylor approximation (Xing)

Markov Chain Monte Carlo

Gibbs sampling (Griffiths et al)

Eric Xing

© Eric Xing @ CMU, 2006-2010 21

Collapsed Gibbs sampling (Tom Griffiths & Mark Steyvers)

Collapsed Gibbs sampling Integrate out

For variables z = z1, z2, …, zn

Draw zi(t+1) from P(zi|z-i, w)

z-i = z1(t+1), z2

(t+1),…, zi-1(t+1), zi+1

(t), …, zn(t)

Eric Xing

© Eric Xing @ CMU, 2006-2010 22

Gibbs sampling

Need full conditional distributions for variables Since we only sample z we need

number of times word w assigned to topic j

number of times topic j used in document d

Eric Xing

© Eric Xing @ CMU, 2006-2010 23

Gibbs sampling

i wi di zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

iteration1

Eric Xing

© Eric Xing @ CMU, 2006-2010 24

Gibbs sampling

i wi di zi zi123456789

101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 25

Gibbs sampling


101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 26

Gibbs sampling


101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 27

Gibbs sampling


101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

2?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 28

Gibbs sampling


101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

21?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 29

Gibbs sampling


101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

211?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 30

Gibbs sampling


101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

2112?

iteration1 2

Eric Xing

© Eric Xing @ CMU, 2006-2010 31

Gibbs sampling

i wi di zi zi zi123456789

101112...

50


RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

211222212212...1

…

222122212222...1

iteration1 2 … 1000

Eric Xing

© Eric Xing @ CMU, 2006-2010 32

wN

c z

D

Latent Dirichlet Allocation (LDA)

Fei-Fei et al. ICCV 2005

“beach”

Topic Models for Images

Eric Xing

© Eric Xing @ CMU, 2006-2010 33

Image Representation

][ , ][

][ , ][

||11

||1111

Vndn

Vd

wwrr

wwrr

cat, grass, tiger, water

annotation vector(binary, same for each segment)

representation vector(real, 1 per image segment)

Eric Xing

© Eric Xing @ CMU, 2006-2010 34

To Generate an Image …

Eric Xing

© Eric Xing @ CMU, 2006-2010 35

Annotated images

Forsyth et. al. (2001): images as documents where region-specific feature vectors are like visual words.

A captioned image can be thought of as annotated data: two documents, one of which describes the other.

Eric Xing

© Eric Xing @ CMU, 2006-2010 36

Gaussian-multinomial LDA

A natural next step is to glue two LDA models together. Bottom: a traditional LDA model on captions Top: a Gaussian-LDA model on images

each region is a multivariate Gaussian

Does not work well

Eric Xing

© Eric Xing @ CMU, 2006-2010 37

Automatic annotation

Eric Xing

© Eric Xing @ CMU, 2006-2010 38

Conclusion

GM-based topic models are cool Flexible Modular Interactive

…..

Documents

Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage