Upload
chastity-french
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
Eric Xing
© Eric Xing @ CMU, 2006-2010 1
Machine LearningMachine Learning
Latent Aspect Models Latent Aspect Models
Eric XingEric Xing
Lecture 14, August 15, 2010
Reading: see class homepage
Eric Xing
© Eric Xing @ CMU, 2006-2010 2
Apoptosis + Medicine
Eric Xing
© Eric Xing @ CMU, 2006-2010 3
probabilisticgenerative
model
Apoptosis + Medicine
Eric Xing
© Eric Xing @ CMU, 2006-2010 4
statisticalinference
Apoptosis + Medicine
Eric Xing
© Eric Xing @ CMU, 2006-2010 5
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
(Generative Model)
(Inference)
Connecting Probability Models to Data
Eric Xing
© Eric Xing @ CMU, 2006-2010 6
Motivation for modeling latent topical aspects
Dimensionality reduction A VSM lives in a very high-dimensional feature space (usually larger vocabulary,
V)
Sparse representation of documents (|V| >> actual number of appeared words in any given document) --- often too spurious for many IR tasks
Semantic analysis and comprehension A need to define conceptual closeness,
to capture relation between features,
to distinguish and infer features from heterogeneous sources ...
Eric Xing
© Eric Xing @ CMU, 2006-2010 7
Latent Semantic Indexing (Deerwester et al., 1990)
Classic attempt at solving this problem in information retrieval
Uses SVD to reduce document representations
Models synonymy and polysemy
Computing SVD is slow
Non-probabilistic model
* *
T (m x k)
(k x k)
DT
(k x n)
=
X (m x n)
DocumentTerm ... * *
T (m x k)
(k x k)
DT
(k x n)
* *
T (m x k)
(k x k)
DT
(k x n)T
(m x k)
(k x k)DT
(k x n)
=
X (m x n)
DocumentTerm =
X (m x n)
DocumentTerm
X (m x n)
DocumentTerm ......
Eric Xing
© Eric Xing @ CMU, 2006-2010 8
Latent Semantic Structure
Latent Structure
Words
),w()w( PP
w
Distribution over words
)w(
)()|w()w|(
P
PPP
Inferring latent structure
...)w|( 1 nwP
Prediction
Eric Xing
© Eric Xing @ CMU, 2006-2010 9
How to Model Semantic? Q: What is it about? A: Mainly MT, with syntax, some learning
A Hierarchical Phrase-Based Model for Statistical Machine Translation
We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
SourceTargetSMT
AlignmentScoreBLEU
ParseTreeNoun
PhraseGrammar
CFG
likelihoodEM
HiddenParametersEstimation
argMax
MT Syntax Learning
0.6 0.3 0.1
Unigram over vocabulary
Top
ics
Mixing Proportion
Topic Models
Eric Xing
© Eric Xing @ CMU, 2006-2010 10
Words in Contexts
the opposition Labor Party fared even worse, with a
predicted 35 seats, seven less than last election.
Eric Xing
© Eric Xing @ CMU, 2006-2010 11
"Words" in Contexts (con'd)
Sivic et al. ICCV 2005
Eric Xing
© Eric Xing @ CMU, 2006-2010 12
loan
TOPIC 1
money
loan
bank
money
ba
nk
river
TOPIC 2
river
river
stream
bank
bank
stream
bank
loan
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1
loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2
river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2
money1
DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1
stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1
money1 river2 bank1 money1 bank1 loan1 bank1 money1
stream2 .3
.8
.2
Mixture components
Mixture weights
.7
GENERATIVE PROCESS
Eric Xing
© Eric Xing @ CMU, 2006-2010 13
Why this is Useful? Q: What is it about? A: Mainly MT, with syntax, some learning
A Hierarchical Phrase-Based Model for Statistical Machine Translation
We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
MT Syntax Learning
Mixing Proportion
0.6 0.3 0.1
Q: give me similar document? Structured way of browsing the collection
Other tasks Dimensionality reduction
TF-IDF vs. topic mixing proportion
Classification, clustering, and more …
Eric Xing
© Eric Xing @ CMU, 2006-2010 14
Topic Models = Mixture Membership Models
Generating a documentPrior
θ
z
w β Nd
N
K
from ,| Draw -
from Draw-
each wordFor
prior thefrom
:1 nzknn
n
lmultinomiazw
lmultinomiaz
n
Draw
Which prior to use?
Eric Xing
© Eric Xing @ CMU, 2006-2010 15
Latent Dirichlet Allocation
wnzn
NM
K
k
Blei, Ng and Jordan (2003)
z
w ddwpzppppnznn
Nn )()(∏)()()( 1
Essentially a Bayesian pLSI:
Eric Xing
© Eric Xing @ CMU, 2006-2010 16
LDA
Generative model Models each word in a document as a sample from a mixture
model. Each word is generated from a single topic, different words in
the document may be generated from different topics. A topic is characterized by a distribution over words. Each document is represented as a list of mixing proportions
for the components (i.e. topic vector). The topic vectors and the word rates each follows a Dirichlet
prior --- essentially a Bayesian pLSI
wnzn
NM
K
k
wnzn
NM
K
k
Eric Xing
© Eric Xing @ CMU, 2006-2010 17
wor
ds
documents
W D
wor
ds
topic
topic
topi
c
topi
c
documents
LSI
Topic models
wor
ds
documents
wor
ds
topics
topi
csdocuments
P(w
|z)
P(z)P(w)Topic-Mixing is via repeated word labeling
dWx
'=
LSI versus pLSI
Eric Xing
© Eric Xing @ CMU, 2006-2010 18
Inference Tasks
Eric Xing
© Eric Xing @ CMU, 2006-2010 19
A possible query:
Close form solution?
Sum in the denominator over Tn terms, and integrate over n k-dimensional topic vectors
Bayesian inference
}{,,
,
)|()|()|()|()(mn
nz
Nn
nm
nmnzmn dddGppzpxpDp 1
)(
)|()|()|()|(
)(
),()|(
}{,,
,
Dp
ddGppzpxp
Dp
DpDp
mn
nz
in
nm
nmnzmn
nn
?)|(
?)|(
,
Dzp
Dp
mn
n
Eric Xing
© Eric Xing @ CMU, 2006-2010 20
Approximate Inference
Variational Inference
Mean field approximation (Blei et al) Expectation propagation (Minka et al) Variational 2nd-order Taylor approximation (Xing)
Markov Chain Monte Carlo
Gibbs sampling (Griffiths et al)
Eric Xing
© Eric Xing @ CMU, 2006-2010 21
Collapsed Gibbs sampling (Tom Griffiths & Mark Steyvers)
Collapsed Gibbs sampling Integrate out
For variables z = z1, z2, …, zn
Draw zi(t+1) from P(zi|z-i, w)
z-i = z1(t+1), z2
(t+1),…, zi-1(t+1), zi+1
(t), …, zn(t)
Eric Xing
© Eric Xing @ CMU, 2006-2010 22
Gibbs sampling
Need full conditional distributions for variables Since we only sample z we need
number of times word w assigned to topic j
number of times topic j used in document d
Eric Xing
© Eric Xing @ CMU, 2006-2010 23
Gibbs sampling
i wi di zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
iteration1
Eric Xing
© Eric Xing @ CMU, 2006-2010 24
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 25
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 26
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 27
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
2?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 28
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
21?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 29
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
211?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 30
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
2112?
iteration1 2
Eric Xing
© Eric Xing @ CMU, 2006-2010 31
Gibbs sampling
i wi di zi zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
211222212212...1
…
222122212222...1
iteration1 2 … 1000
Eric Xing
© Eric Xing @ CMU, 2006-2010 32
wN
c z
D
Latent Dirichlet Allocation (LDA)
Fei-Fei et al. ICCV 2005
“beach”
Topic Models for Images
Eric Xing
© Eric Xing @ CMU, 2006-2010 33
Image Representation
][ , ][
][ , ][
||11
||1111
Vndn
Vd
wwrr
wwrr
cat, grass, tiger, water
annotation vector(binary, same for each segment)
representation vector(real, 1 per image segment)
Eric Xing
© Eric Xing @ CMU, 2006-2010 34
To Generate an Image …
Eric Xing
© Eric Xing @ CMU, 2006-2010 35
Annotated images
Forsyth et. al. (2001): images as documents where region-specific feature vectors are like visual words.
A captioned image can be thought of as annotated data: two documents, one of which describes the other.
Eric Xing
© Eric Xing @ CMU, 2006-2010 36
Gaussian-multinomial LDA
A natural next step is to glue two LDA models together. Bottom: a traditional LDA model on captions Top: a Gaussian-LDA model on images
each region is a multivariate Gaussian
Does not work well
Eric Xing
© Eric Xing @ CMU, 2006-2010 37
Automatic annotation
Eric Xing
© Eric Xing @ CMU, 2006-2010 38
Conclusion
GM-based topic models are cool Flexible Modular Interactive
…..