Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Introduction toExpectation Maximization
CMSC 678UMBC
March 28th, 2018
Outline
Administrivia & recap
Latent and probabilistic modeling
EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works
Announcement 1: Assignment 3
Due Wednesday April 11th, 11:59 AM
Any questions?
Announcement 2: Progress Report on Project
Due Monday April 16th, 11:59 AM
Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced
(or anticipate experiencing)
Any questions?
Recap(+) from last time…
Optimization for SVMsSeparable case: hard margin SVM
Non-separable case: soft margin SVM
maximize margin
maximize margin minimize slack margin
Subgradient is any direction that is below the functionFor the hinge loss a possible subgradient is:
Subgradient of Hinge Los
1
subgradient
𝑧𝑧 = 𝑤𝑤𝑇𝑇𝑥𝑥
�𝑦𝑦 = 𝑤𝑤𝑇𝑇𝑥𝑥ℓ 𝑦𝑦, �𝑦𝑦 = max{0, 1 − y𝑤𝑤𝑇𝑇𝑥𝑥}
not differentiable
at z=1
Example: Hinge lossobjective
loss term
only for points
perceptron update
update
subgradient
Take-away: the perceptron algorithm optimizes a hinge loss objective
Assume an original optimization problem
We convert it to a new optimization problem:
Lagrange multipliers
Dual formulation for SVM
Basics of ProbabilityRequirements to be a distribution (“proportional to”, ∝)Definitions of conditional probability, joint probability, and independenceBayes rule, (probability) chain ruleExpectation (of a random variable & function)
Empirical Risk MinimizationGradient DescentLoss Functions: what is it, what does it measure, and what are some computational difficulties with them?Regularization: what is it, how does it work, and why might you want it?
Tasks (High Level)Data set splits: training vs. dev vs. testClassification: Posterior decoding/MAP classifierClassification evaluations: accuracy, precision, recall, and F scoresRegression (vs. classification)Comparing supervised vs. Unsupervised Learning and their tradeoffs: why might you want to use one vs. the other, and what are some potential issues?Clustering: high-level goal/task, K-means as an exampleTradeoffs among clustering evaluations
Linear ModelsBasic form of a linear model (classification or regression)Perceptron (simple vs. other variants, like averaged or voted)When you should use perceptron (what are its assumptions?)Perceptron as SGD
Maximum Entropy Models Meanings of feature functions and weightsHow to learn the weights: gradient descentMeaning of the maxent gradient
Neural NetworksRelation to linear models and maxentTypes (feedforward, CNN, RNN)Learning representations (e.g., "feature maps”)What is a convolution (e.g., 1D vs 2D, high-level notions of why you might want to change padding or the width)How to learn: gradient descent, backpropCommon activation functionsNeural network regularization
Dimensionality ReductionWhat is the basic task & goal in dimensionality reduction?Dimensionality reduction tradeoffs: why might you want to, and what are some potential issues?Linear Discriminant Analysis vs. Principal Component Analysis: what are they trying to do, how are they similar, how do they differ?
Kernel Methods & SVMsFeature expansion and kernelsTwo views: maximizing a separating hyperplane margin vs. loss optimization (norm minimization)Non-separability & slackSub-gradients
Course Overview (so far)
Remember from the first day:A Terminology Buffet
Classification
Regression
Clustering
Fully-supervised
Semi-supervised
Un-supervised
Probabilistic
Generative
Conditional
Spectral
Neural
Memory-based
Exemplar…
the data: amount of human input/number of labeled examples
the approach: how any data are being
used
the task: what kindof problem are you
solving?
Remember from the first day:A Terminology Buffet
Classification
Regression
Clustering
Fully-supervised
Semi-supervised
Un-supervised
Probabilistic
Generative
Conditional
Spectral
Neural
Memory-based
Exemplar…
the data: amount of human input/number of labeled examples
the approach: how any data are being
used
the task: what kindof problem are you
solving?
what we’ve currently sampled…
Remember from the first day:A Terminology Buffet
Classification
Regression
Clustering
Fully-supervised
Semi-supervised
Un-supervised
Probabilistic
Generative
Conditional
Spectral
Neural
Memory-based
Exemplar…
the data: amount of human input/number of labeled examples
the approach: how any data are being
used
the task: what kindof problem are you
solving?
what we’ve currently sampled… what we’ll be sampling next…
LATENT VARIABLE MODELS AND EXPECTATION MAXIMIZATION
Outline
Administrivia & recap
Latent and probabilistic modeling
EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Is (Supervised) Classification “Latent?”
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Is (Supervised) Classification “Latent?”Not Obviously
best label = labelarg max 𝑝𝑝 label item)MAP classifier
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Is (Supervised) Classification “Latent?”Not Obviously
these values are unknown
but the generation process (explanation) is transparent
Input: an instance da fixed set of classes C = {c1, c2,…, cJ}A training set of m hand-labeled instances (d1,c1),....,(dm,cm)
Output: a learned (probabilistic) classifier γ that maps instances to classes
Latent Sequences: Part of Speech
British Left Waffles on Falkland IslandsBritish Left Waffles on Falkland Islands
British Left Waffles on Falkland IslandsAdjective Noun Verb
Noun Verb Noun
orthography
morphology
Adapted from Jason Eisner, Noah Smith
lexemes
syntax
semantics
pragmatics
discourse
observed text
Adapted from Jason Eisner, Noah Smith
Latent Modeling
explain what you see/annotate
with things “of importance” you don’t
orthography
morphology
lexemes
syntax
semantics
pragmatics
discourse
observed text
Latent Sequence Models: Part of Speech
p(British Left Waffles on Falkland Islands)
Latent Sequence Models: Part of Speech
p(British Left Waffles on Falkland Islands)
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Latent Sequence Models: Part of Speech
p(British Left Waffles on Falkland Islands)
1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Latent Sequence Models: Part of Speech
p(British Left Waffles on Falkland Islands)
1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)
2. Produce a tag sequence for this sentence
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Example: Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
Example: Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
N different (independent) rolls
𝑤𝑤1 = 1
𝑤𝑤2 = 5
𝑤𝑤3 = 4
⋯
Example: Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
maximize (log-) likelihood to learn the probability parameters
Example: Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
maximize (log-) likelihood to learn the probability parameters
p(1) = ?
p(3) = ?
p(5) = ?
p(2) = ?
p(4) = ?
p(6) = ?
maximum likelihood estimates
Example: Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
maximize (log-) likelihood to learn the probability parameters
p(1) = 2/9
p(3) = 1/9
p(5) = 1/9
p(2) = 1/9
p(4) = 3/9
p(6) = 1/9
maximum likelihood estimates
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
add complexity to better explain what we see
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
add complexity to better explain what we see
𝑤𝑤1 = 1
𝑤𝑤2 = 5
⋯
𝑧𝑧1 = 𝐴𝐴
𝑧𝑧2 = 𝐵𝐵
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
add complexity to better explain what we see
examples of latent classes z:• part of speech tag• topic (“sports” vs. “politics”)
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
add complexity to better explain what we see
goal: maximize (log-)likelihood
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z values
we just see the observed items (rolls) w
add complexity to better explain what we see
goal: maximize (log-)likelihood
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z valueswe just see the items w
add complexity to better explain what we see
goal: maximize (log-)likelihood
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
Example: Conditionally Rolling a Die
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z valueswe just see the items w
add complexity to better explain what we see
goal: maximize (log-)likelihood
if we knew the probability parametersthen we could estimate z and evaluate
likelihood… but we don’t! :(
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
Example: Conditionally Rolling a Die
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z values
goal: maximize marginalized (log-)likelihood
Example: Conditionally Rolling a Die
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z values
goal: maximize marginalized (log-)likelihood
w
Example: Conditionally Rolling a Die
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z values
goal: maximize marginalized (log-)likelihood
w z1 & w z2 & w z3 & w z4 & w
Example: Conditionally Rolling a Die
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
we don’t actually observe these z values
goal: maximize marginalized (log-)likelihood
w z1 & w z2 & w z3 & w z4 & w
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1
𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2
𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁
𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)
Example: Conditionally Rolling a Die
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁
goal: maximize marginalized (log-)likelihood
w z1 & w z2 & w z3 & w z4 & w
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1
𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2
𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁
𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
if we knew the probability parametersthen we could estimate z and evaluate
likelihood… but we don’t! :(
http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
if we knew the probability parametersthen we could estimate z and evaluate
likelihood… but we don’t! :(
http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
if we knew the probability parametersthen we could estimate z and evaluate
likelihood… but we don’t! :(
Outline
Administrivia & recap
Latent and probabilistic modeling
EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works
Expectation Maximization (EM)
0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty (compute expectations)
2. M-step: maximize log-likelihood, assuming these uncertain counts
Expectation Maximization (EM): E-step
0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)
Expectation Maximization (EM): E-step
0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)
We’ve already seen this type of counting, when computing the gradient in maxent models.
Expectation Maximization (EM): M-step0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
𝑝𝑝 𝑡𝑡+1 (𝑧𝑧)𝑝𝑝(𝑡𝑡)(𝑧𝑧)estimated
counts
EM Math
max𝜃𝜃
𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)
EM Math
max𝜃𝜃
𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty
M-step: maximize log-likelihood
EM Math
max𝜃𝜃
𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty
M-step: maximize log-likelihood
old parameters
posterior distribution
EM Math
max𝜃𝜃
𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty
M-step: maximize log-likelihood
old parameters
new parameters new parametersposterior distribution
Why EM? Semi-Supervised Learning
? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:
• human annotated• relatively small/few
examples
unlabeled data:• raw; not annotated• plentiful
Why EM? Semi-Supervised Learning
? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:
• human annotated• relatively small/few
examples
unlabeled data:• raw; not annotated• plentiful
EM
Why EM? Semi-Supervised Learning
? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:
• human annotated• relatively small/few
examples
unlabeled data:• raw; not annotated• plentiful
Why EM? Semi-Supervised Learning
? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?
EM
Outline
Administrivia & recap
Latent and probabilistic modeling
EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works
Three Coins Example
Imagine three coins
Flip 1st coin (penny)
If heads: flip 2nd coin (dollar coin)
If tails: flip 3rd coin (dime)
Three Coins Example
Imagine three coins
Flip 1st coin (penny)
If heads: flip 2nd coin (dollar coin)
If tails: flip 3rd coin (dime)
only observe these (record heads vs. tails outcome)
don’t observe this
Three Coins Example
Imagine three coins
Flip 1st coin (penny)
If heads: flip 2nd coin (dollar coin)
If tails: flip 3rd coin (dime)
observed:a, b, e, etc.We run the code, vs. The run failed
unobserved:part of speech?genre?
Three Coins Example
Imagine three coins
Flip 1st coin (penny)
If heads: flip 2nd coin (dollar coin)
If tails: flip 3rd coin (dime)
𝑝𝑝 heads = 𝜆𝜆 𝑝𝑝 tails = 1 − 𝜆𝜆
𝑝𝑝 heads = 𝛾𝛾
𝑝𝑝 heads = 𝜓𝜓
𝑝𝑝 tails = 1 − 𝛾𝛾
𝑝𝑝 tails = 1 − 𝜓𝜓
Three Coins Example
Imagine three coins
𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆
𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓
Three parameters to estimate: λ, γ, and ψ
Three Coins Example
If all flips were observed
H H T H T HH T H T T T
𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆
𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓
Three Coins Example
If all flips were observed
H H T H T HH T H T T T
𝑝𝑝 heads =46
𝑝𝑝 tails =26
𝑝𝑝 heads =14
𝑝𝑝 heads =12
𝑝𝑝 tails =34
𝑝𝑝 tails =12
𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆
𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓
Three Coins Example
But not all flips are observed set parameter values
H H T H T HH T H T T T
𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4
𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4
Three Coins Example
But not all flips are observed set parameter values
H H T H T HH T H T T T
𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4
𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4
𝑝𝑝 heads | observed item H =𝑝𝑝(heads & H)
𝑝𝑝(H)
Use these values to compute posteriors
𝑝𝑝 heads | observed item T =𝑝𝑝(heads & T)
𝑝𝑝(T)
Three Coins Example
But not all flips are observed set parameter values
H H T H T HH T H T T T
𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4
𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4
𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
Use these values to compute posteriors
marginal likelihood
rewrite joint using Bayes rule
Three Coins Example
But not all flips are observed set parameter values
H H T H T HH T H T T T
𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4
𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4
𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
Use these values to compute posteriors
𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2
Three Coins Example
But not all flips are observed set parameter values
H H T H T HH T H T T T
𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4
𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4
Use these values to compute posteriors
𝑝𝑝 H = 𝑝𝑝 H | heads ∗ 𝑝𝑝 heads + 𝑝𝑝 H | tails * 𝑝𝑝(tails)= .8 ∗ .6 + .6 ∗ .4
𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2
Three Coins ExampleH H T H T HH T H T T T
𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
=.8 ∗ .6
.8 ∗ .6 + .6 ∗ .4≈ 0.667
Use posteriors to update parameters
𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)
𝑝𝑝(T)
=.2 ∗ .6
.2 ∗ .6 + .6 ∗ .4≈ 0.334
Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?
Three Coins ExampleH H T H T HH T H T T T
𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
=.8 ∗ .6
.8 ∗ .6 + .6 ∗ .4≈ 0.667
Use posteriors to update parameters
𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)
𝑝𝑝(T)
=.2 ∗ .6
.2 ∗ .6 + .6 ∗ .4≈ 0.334
Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?
A: No.
Three Coins ExampleH H T H T HH T H T T T
Use posteriors to update parameters
𝑝𝑝 heads =# heads from penny# total flips of pennyfully observed setting
our setting: partially-observed 𝑝𝑝 heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny
# total flips of penny
𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
=.8 ∗ .6
.8 ∗ .6 + .6 ∗ .4≈ 0.667
𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)
𝑝𝑝(T)
=.2 ∗ .6
.2 ∗ .6 + .6 ∗ .4≈ 0.334
(in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)
Three Coins ExampleH H T H T HH T H T T T
Use posteriors to update parameters
our setting: partially-observed
𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny
# total flips of penny
=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]
# total flips of penny
𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
=.8 ∗ .6
.8 ∗ .6 + .6 ∗ .4≈ 0.667
𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)
𝑝𝑝(T)
=.2 ∗ .6
.2 ∗ .6 + .6 ∗ .4≈ 0.334
Three Coins ExampleH H T H T HH T H T T T
Use posteriors to update parameters
our setting: partially-observed
𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny
# total flips of penny
=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]
# total flips of penny
=2 ∗ 𝑝𝑝 heads | obs. H + 4 ∗ 𝑝𝑝 heads | obs.𝑇𝑇
6≈ 0.444
𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)
𝑝𝑝(H)
=.8 ∗ .6
.8 ∗ .6 + .6 ∗ .4≈ 0.667
𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)
𝑝𝑝(T)
=.2 ∗ .6
.2 ∗ .6 + .6 ∗ .4≈ 0.334
Expectation Maximization (EM)
0. Assume some value for your parameters
Two step, iterative algorithm:
1. E-step: count under uncertainty (compute expectations)
2. M-step: maximize log-likelihood, assuming these uncertain counts
Outline
Administrivia & recap
Latent and probabilistic modeling
EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
what do 𝒞𝒞, ℳ, 𝒫𝒫 look like?
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
𝒞𝒞 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
𝒞𝒞 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)
ℳ 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖
log�𝑘𝑘
𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
𝒞𝒞 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)
ℳ 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖
log�𝑘𝑘
𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)
𝒫𝒫 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)
𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
definition of conditional probability
algebra
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)
𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃
𝒞𝒞 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖) ℳ 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖
log�𝑘𝑘
𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘) 𝒫𝒫 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)
𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃
𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
take a conditional expectation (why? we’ll cover this more in
variational inference)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)
𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃
𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]ℳ already sums over Y
ℳ 𝜃𝜃 = �𝑖𝑖
log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖
log�𝑘𝑘
𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡) 𝒞𝒞 𝜃𝜃 𝑋𝑋 = �𝑖𝑖
�𝑘𝑘
𝑝𝑝𝜃𝜃(𝑡𝑡) 𝑦𝑦 = 𝑘𝑘 𝑥𝑥𝑖𝑖) log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))
Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))
ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))
Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))
ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))
Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))
≥ 0 ≤ 0 (we’ll see why with Jensen’s inequality, in variational inference)
Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)
𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]
𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))
ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))
Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))
ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 ≥ 0 EM does not decrease the marginal log-likelihood
Generalized EM
Partial M step: find a θ that simply increases, rather than maximizes, Q
Partial E step: only consider some of the variables (an online learning algorithm)
EM has its pitfalls
Objective is not convex converge to a bad local optimum
Computing expectations can be hard: the E-step could require clever algorithms
How well does log-likelihood correlate with an end task?
A Maximization-Maximization Procedure
𝐹𝐹 𝜃𝜃, 𝑞𝑞 = 𝔼𝔼 𝒞𝒞(𝜃𝜃)−𝔼𝔼 log 𝑞𝑞(𝑍𝑍)
observed data log-likelihood
anydistribution
over Z
we’ll see this again with variational inference