Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Introduction toExpectation Maximization

CMSC 678UMBC

March 28th, 2018

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

Recap(+) from last time…

Optimization for SVMsSeparable case: hard margin SVM

Non-separable case: soft margin SVM

maximize margin

maximize margin minimize slack margin

Subgradient is any direction that is below the functionFor the hinge loss a possible subgradient is:

Subgradient of Hinge Los

1

subgradient

𝑧𝑧 = 𝑤𝑤𝑇𝑇𝑥𝑥

�𝑦𝑦 = 𝑤𝑤𝑇𝑇𝑥𝑥ℓ 𝑦𝑦, �𝑦𝑦 = max{0, 1 − y𝑤𝑤𝑇𝑇𝑥𝑥}

not differentiable

at z=1

Example: Hinge lossobjective

loss term

only for points

perceptron update

update

subgradient

Take-away: the perceptron algorithm optimizes a hinge loss objective

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Dual formulation for SVM

Basics of ProbabilityRequirements to be a distribution (“proportional to”, ∝)Definitions of conditional probability, joint probability, and independenceBayes rule, (probability) chain ruleExpectation (of a random variable & function)

Empirical Risk MinimizationGradient DescentLoss Functions: what is it, what does it measure, and what are some computational difficulties with them?Regularization: what is it, how does it work, and why might you want it?

Tasks (High Level)Data set splits: training vs. dev vs. testClassification: Posterior decoding/MAP classifierClassification evaluations: accuracy, precision, recall, and F scoresRegression (vs. classification)Comparing supervised vs. Unsupervised Learning and their tradeoffs: why might you want to use one vs. the other, and what are some potential issues?Clustering: high-level goal/task, K-means as an exampleTradeoffs among clustering evaluations

Linear ModelsBasic form of a linear model (classification or regression)Perceptron (simple vs. other variants, like averaged or voted)When you should use perceptron (what are its assumptions?)Perceptron as SGD

Maximum Entropy Models Meanings of feature functions and weightsHow to learn the weights: gradient descentMeaning of the maxent gradient

Neural NetworksRelation to linear models and maxentTypes (feedforward, CNN, RNN)Learning representations (e.g., "feature maps”)What is a convolution (e.g., 1D vs 2D, high-level notions of why you might want to change padding or the width)How to learn: gradient descent, backpropCommon activation functionsNeural network regularization

Dimensionality ReductionWhat is the basic task & goal in dimensionality reduction?Dimensionality reduction tradeoffs: why might you want to, and what are some potential issues?Linear Discriminant Analysis vs. Principal Component Analysis: what are they trying to do, how are they similar, how do they differ?

Kernel Methods & SVMsFeature expansion and kernelsTwo views: maximizing a separating hyperplane margin vs. loss optimization (norm minimization)Non-separability & slackSub-gradients

Course Overview (so far)

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?


Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…



used


solving?

what we’ve currently sampled…


Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…



used


solving?

what we’ve currently sampled… what we’ll be sampling next…

LATENT VARIABLE MODELS AND EXPECTATION MAXIMIZATION

Outline




ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”


Is (Supervised) Classification “Latent?”Not Obviously

best label = labelarg max 𝑝𝑝 label item)MAP classifier


Is (Supervised) Classification “Latent?”Not Obviously

these values are unknown

but the generation process (explanation) is transparent

Input: an instance da fixed set of classes C = {c1, c2,…, cJ}A training set of m hand-labeled instances (d1,c1),....,(dm,cm)

Output: a learned (probabilistic) classifier γ that maps instances to classes

Latent Sequences: Part of Speech

British Left Waffles on Falkland IslandsBritish Left Waffles on Falkland Islands

British Left Waffles on Falkland IslandsAdjective Noun Verb

Noun Verb Noun

orthography

morphology

Adapted from Jason Eisner, Noah Smith

lexemes

syntax

semantics

pragmatics

discourse

observed text

Adapted from Jason Eisner, Noah Smith

Latent Modeling

explain what you see/annotate

with things “of importance” you don’t

orthography

morphology

lexemes

syntax

semantics

pragmatics

discourse

observed text

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)



Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):



1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

Adjective Noun Verb


Prep Noun Noun(i):

(ii):



1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

2. Produce a tag sequence for this sentence

Adjective Noun Verb


Prep Noun Noun(i):

(ii):

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖




N different (independent) rolls

𝑤𝑤1 = 1

𝑤𝑤2 = 5

𝑤𝑤3 = 4

⋯




maximize (log-) likelihood to learn the probability parameters





p(1) = ?

p(3) = ?

p(5) = ?

p(2) = ?

p(4) = ?

p(6) = ?

maximum likelihood estimates





p(1) = 2/9

p(3) = 1/9

p(5) = 1/9

p(2) = 1/9

p(4) = 3/9

p(6) = 1/9

maximum likelihood estimates

Example: Conditionally Rolling a Die



𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see





𝑖𝑖



𝑤𝑤1 = 1

𝑤𝑤2 = 5

⋯

𝑧𝑧1 = 𝐴𝐴

𝑧𝑧2 = 𝐵𝐵





𝑖𝑖



examples of latent classes z:• part of speech tag• topic (“sports” vs. “politics”)





𝑖𝑖



goal: maximize (log-)likelihood





𝑖𝑖


we don’t actually observe these z values

we just see the observed items (rolls) w







𝑖𝑖


we don’t actually observe these z valueswe just see the items w



if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(





𝑖𝑖


we don’t actually observe these z valueswe just see the items w



if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(


but we don’t! :(



𝑖𝑖



goal: maximize marginalized (log-)likelihood



𝑖𝑖




w



𝑖𝑖




w z1 & w z2 & w z3 & w z4 & w



𝑖𝑖




w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1

𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2

𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁

𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)


𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁


w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1

𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2

𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁

𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)


but we don’t! :(



http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg


but we don’t! :(



http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg


but we don’t! :(



Outline




Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty (compute expectations)

2. M-step: maximize log-likelihood, assuming these uncertain counts

Expectation Maximization (EM): E-step



1. E-step: count under uncertainty, assuming these parameters


count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)

Expectation Maximization (EM): E-step





count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)

We’ve already seen this type of counting, when computing the gradient in maxent models.

Expectation Maximization (EM): M-step0. Assume some value for your parameters




𝑝𝑝 𝑡𝑡+1 (𝑧𝑧)𝑝𝑝(𝑡𝑡)(𝑧𝑧)estimated

counts

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

EM Math

max𝜃𝜃



old parameters

posterior distribution

EM Math

max𝜃𝜃



old parameters

new parameters new parametersposterior distribution

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful


? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:


examples


EM


? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:


examples



? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?

EM

Outline




Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

Three Coins Example

Imagine three coins




only observe these (record heads vs. tails outcome)

don’t observe this

Three Coins Example

Imagine three coins




observed:a, b, e, etc.We run the code, vs. The run failed

unobserved:part of speech?genre?

Three Coins Example

Imagine three coins




𝑝𝑝 heads = 𝜆𝜆 𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾

𝑝𝑝 heads = 𝜓𝜓

𝑝𝑝 tails = 1 − 𝛾𝛾

𝑝𝑝 tails = 1 − 𝜓𝜓

Three Coins Example

Imagine three coins

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Three parameters to estimate: λ, γ, and ψ

Three Coins Example

If all flips were observed

H H T H T HH T H T T T



Three Coins Example

If all flips were observed


𝑝𝑝 heads =46

𝑝𝑝 tails =26

𝑝𝑝 heads =14

𝑝𝑝 heads =12

𝑝𝑝 tails =34

𝑝𝑝 tails =12



Three Coins Example

But not all flips are observed set parameter values


𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

Three Coins Example





𝑝𝑝 heads | observed item H =𝑝𝑝(heads & H)

𝑝𝑝(H)

Use these values to compute posteriors

𝑝𝑝 heads | observed item T =𝑝𝑝(heads & T)

𝑝𝑝(T)

Three Coins Example





𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)


marginal likelihood

rewrite joint using Bayes rule

Three Coins Example






𝑝𝑝(H)


𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2

Three Coins Example






𝑝𝑝 H = 𝑝𝑝 H | heads ∗ 𝑝𝑝 heads + 𝑝𝑝 H | tails * 𝑝𝑝(tails)= .8 ∗ .6 + .6 ∗ .4


𝑝𝑝(H)

𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2

Three Coins ExampleH H T H T HH T H T T T

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

Use posteriors to update parameters

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?



𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667



𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

A: No.



𝑝𝑝 heads =# heads from penny# total flips of pennyfully observed setting

our setting: partially-observed 𝑝𝑝 heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny


𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667


𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

(in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)



our setting: partially-observed

𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny


=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]



𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667


𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334



our setting: partially-observed

𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny


=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]


=2 ∗ 𝑝𝑝 heads | obs. H + 4 ∗ 𝑝𝑝 heads | obs.𝑇𝑇

6≈ 0.444


𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667


𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Expectation Maximization (EM)


Two step, iterative algorithm:

1. E-step: count under uncertainty (compute expectations)


Outline




Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

what do 𝒞𝒞, ℳ, 𝒫𝒫 look like?




𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)






ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)








log�𝑘𝑘


𝒫𝒫 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)


𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)


definition of conditional probability

algebra







ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃


log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖) ℳ 𝜃𝜃 = �𝑖𝑖


log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘) 𝒫𝒫 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)







𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

take a conditional expectation (why? we’ll cover this more in

variational inference)







𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]ℳ already sums over Y



log�𝑘𝑘





ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡) 𝒞𝒞 𝜃𝜃 𝑋𝑋 = �𝑖𝑖

�𝑘𝑘

𝑝𝑝𝜃𝜃(𝑡𝑡) 𝑦𝑦 = 𝑘𝑘 𝑥𝑥𝑖𝑖) log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)





𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))






ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))









≥ 0 ≤ 0 (we’ll see why with Jensen’s inequality, in variational inference)








ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 ≥ 0 EM does not decrease the marginal log-likelihood

Generalized EM

Partial M step: find a θ that simply increases, rather than maximizes, Q

Partial E step: only consider some of the variables (an online learning algorithm)

EM has its pitfalls

Objective is not convex converge to a bad local optimum

Computing expectations can be hard: the E-step could require clever algorithms

How well does log-likelihood correlate with an end task?

A Maximization-Maximization Procedure

𝐹𝐹 𝜃𝜃, 𝑞𝑞 = 𝔼𝔼 𝒞𝒞(𝜃𝜃)−𝔼𝔼 log 𝑞𝑞(𝑍𝑍)

observed data log-likelihood

anydistribution

over Z

we’ll see this again with variational inference

Documents

Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,