Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test....

Preview:

Citation preview

Introduction toExpectation Maximization

CMSC 678UMBC

March 28th, 2018

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

Recap(+) from last time…

Optimization for SVMsSeparable case: hard margin SVM

Non-separable case: soft margin SVM

maximize margin

maximize margin minimize slack margin

Subgradient is any direction that is below the functionFor the hinge loss a possible subgradient is:

Subgradient of Hinge Los

1

subgradient

𝑧𝑧 = 𝑤𝑤𝑇𝑇𝑥𝑥

�𝑦𝑦 = 𝑤𝑤𝑇𝑇𝑥𝑥ℓ 𝑦𝑦, �𝑦𝑦 = max{0, 1 − y𝑤𝑤𝑇𝑇𝑥𝑥}

not differentiable

at z=1

Example: Hinge lossobjective

loss term

only for points

perceptron update

update

subgradient

Take-away: the perceptron algorithm optimizes a hinge loss objective

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Dual formulation for SVM

Basics of ProbabilityRequirements to be a distribution (“proportional to”, ∝)Definitions of conditional probability, joint probability, and independenceBayes rule, (probability) chain ruleExpectation (of a random variable & function)

Empirical Risk MinimizationGradient DescentLoss Functions: what is it, what does it measure, and what are some computational difficulties with them?Regularization: what is it, how does it work, and why might you want it?

Tasks (High Level)Data set splits: training vs. dev vs. testClassification: Posterior decoding/MAP classifierClassification evaluations: accuracy, precision, recall, and F scoresRegression (vs. classification)Comparing supervised vs. Unsupervised Learning and their tradeoffs: why might you want to use one vs. the other, and what are some potential issues?Clustering: high-level goal/task, K-means as an exampleTradeoffs among clustering evaluations

Linear ModelsBasic form of a linear model (classification or regression)Perceptron (simple vs. other variants, like averaged or voted)When you should use perceptron (what are its assumptions?)Perceptron as SGD

Maximum Entropy Models Meanings of feature functions and weightsHow to learn the weights: gradient descentMeaning of the maxent gradient

Neural NetworksRelation to linear models and maxentTypes (feedforward, CNN, RNN)Learning representations (e.g., "feature maps”)What is a convolution (e.g., 1D vs 2D, high-level notions of why you might want to change padding or the width)How to learn: gradient descent, backpropCommon activation functionsNeural network regularization

Dimensionality ReductionWhat is the basic task & goal in dimensionality reduction?Dimensionality reduction tradeoffs: why might you want to, and what are some potential issues?Linear Discriminant Analysis vs. Principal Component Analysis: what are they trying to do, how are they similar, how do they differ?

Kernel Methods & SVMsFeature expansion and kernelsTwo views: maximizing a separating hyperplane margin vs. loss optimization (norm minimization)Non-separability & slackSub-gradients

Course Overview (so far)

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?

what we’ve currently sampled…

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?

what we’ve currently sampled… what we’ll be sampling next…

LATENT VARIABLE MODELS AND EXPECTATION MAXIMIZATION

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”Not Obviously

best label = labelarg max 𝑝𝑝 label item)MAP classifier

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”Not Obviously

these values are unknown

but the generation process (explanation) is transparent

Input: an instance da fixed set of classes C = {c1, c2,…, cJ}A training set of m hand-labeled instances (d1,c1),....,(dm,cm)

Output: a learned (probabilistic) classifier γ that maps instances to classes

Latent Sequences: Part of Speech

British Left Waffles on Falkland IslandsBritish Left Waffles on Falkland Islands

British Left Waffles on Falkland IslandsAdjective Noun Verb

Noun Verb Noun

orthography

morphology

Adapted from Jason Eisner, Noah Smith

lexemes

syntax

semantics

pragmatics

discourse

observed text

Adapted from Jason Eisner, Noah Smith

Latent Modeling

explain what you see/annotate

with things “of importance” you don’t

orthography

morphology

lexemes

syntax

semantics

pragmatics

discourse

observed text

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

2. Produce a tag sequence for this sentence

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

N different (independent) rolls

𝑤𝑤1 = 1

𝑤𝑤2 = 5

𝑤𝑤3 = 4

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

maximize (log-) likelihood to learn the probability parameters

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

maximize (log-) likelihood to learn the probability parameters

p(1) = ?

p(3) = ?

p(5) = ?

p(2) = ?

p(4) = ?

p(6) = ?

maximum likelihood estimates

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

maximize (log-) likelihood to learn the probability parameters

p(1) = 2/9

p(3) = 1/9

p(5) = 1/9

p(2) = 1/9

p(4) = 3/9

p(6) = 1/9

maximum likelihood estimates

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

𝑤𝑤1 = 1

𝑤𝑤2 = 5

𝑧𝑧1 = 𝐴𝐴

𝑧𝑧2 = 𝐵𝐵

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

examples of latent classes z:• part of speech tag• topic (“sports” vs. “politics”)

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

goal: maximize (log-)likelihood

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

we just see the observed items (rolls) w

add complexity to better explain what we see

goal: maximize (log-)likelihood

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z valueswe just see the items w

add complexity to better explain what we see

goal: maximize (log-)likelihood

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z valueswe just see the items w

add complexity to better explain what we see

goal: maximize (log-)likelihood

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

w

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

w z1 & w z2 & w z3 & w z4 & w

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1

𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2

𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁

𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁

goal: maximize marginalized (log-)likelihood

w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1

𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2

𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁

𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty (compute expectations)

2. M-step: maximize log-likelihood, assuming these uncertain counts

Expectation Maximization (EM): E-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)

Expectation Maximization (EM): E-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)

We’ve already seen this type of counting, when computing the gradient in maxent models.

Expectation Maximization (EM): M-step0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

𝑝𝑝 𝑡𝑡+1 (𝑧𝑧)𝑝𝑝(𝑡𝑡)(𝑧𝑧)estimated

counts

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

posterior distribution

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

new parameters new parametersposterior distribution

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful

EM

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?

EM

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

only observe these (record heads vs. tails outcome)

don’t observe this

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

observed:a, b, e, etc.We run the code, vs. The run failed

unobserved:part of speech?genre?

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

𝑝𝑝 heads = 𝜆𝜆 𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾

𝑝𝑝 heads = 𝜓𝜓

𝑝𝑝 tails = 1 − 𝛾𝛾

𝑝𝑝 tails = 1 − 𝜓𝜓

Three Coins Example

Imagine three coins

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Three parameters to estimate: λ, γ, and ψ

Three Coins Example

If all flips were observed

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Three Coins Example

If all flips were observed

H H T H T HH T H T T T

𝑝𝑝 heads =46

𝑝𝑝 tails =26

𝑝𝑝 heads =14

𝑝𝑝 heads =12

𝑝𝑝 tails =34

𝑝𝑝 tails =12

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

𝑝𝑝 heads | observed item H =𝑝𝑝(heads & H)

𝑝𝑝(H)

Use these values to compute posteriors

𝑝𝑝 heads | observed item T =𝑝𝑝(heads & T)

𝑝𝑝(T)

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

Use these values to compute posteriors

marginal likelihood

rewrite joint using Bayes rule

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

Use these values to compute posteriors

𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

Use these values to compute posteriors

𝑝𝑝 H = 𝑝𝑝 H | heads ∗ 𝑝𝑝 heads + 𝑝𝑝 H | tails * 𝑝𝑝(tails)= .8 ∗ .6 + .6 ∗ .4

𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2

Three Coins ExampleH H T H T HH T H T T T

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

Use posteriors to update parameters

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

Three Coins ExampleH H T H T HH T H T T T

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

Use posteriors to update parameters

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

A: No.

Three Coins ExampleH H T H T HH T H T T T

Use posteriors to update parameters

𝑝𝑝 heads =# heads from penny# total flips of pennyfully observed setting

our setting: partially-observed 𝑝𝑝 heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

(in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

Three Coins ExampleH H T H T HH T H T T T

Use posteriors to update parameters

our setting: partially-observed

𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny

=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]

# total flips of penny

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Three Coins ExampleH H T H T HH T H T T T

Use posteriors to update parameters

our setting: partially-observed

𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny

=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]

# total flips of penny

=2 ∗ 𝑝𝑝 heads | obs. H + 4 ∗ 𝑝𝑝 heads | obs.𝑇𝑇

6≈ 0.444

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm:

1. E-step: count under uncertainty (compute expectations)

2. M-step: maximize log-likelihood, assuming these uncertain counts

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

what do 𝒞𝒞, ℳ, 𝒫𝒫 look like?

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)

ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)

ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

𝒫𝒫 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

definition of conditional probability

algebra

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖) ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘) 𝒫𝒫 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

take a conditional expectation (why? we’ll cover this more in

variational inference)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]ℳ already sums over Y

ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡) 𝒞𝒞 𝜃𝜃 𝑋𝑋 = �𝑖𝑖

�𝑘𝑘

𝑝𝑝𝜃𝜃(𝑡𝑡) 𝑦𝑦 = 𝑘𝑘 𝑥𝑥𝑖𝑖) log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

≥ 0 ≤ 0 (we’ll see why with Jensen’s inequality, in variational inference)

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 ≥ 0 EM does not decrease the marginal log-likelihood

Generalized EM

Partial M step: find a θ that simply increases, rather than maximizes, Q

Partial E step: only consider some of the variables (an online learning algorithm)

EM has its pitfalls

Objective is not convex converge to a bad local optimum

Computing expectations can be hard: the E-step could require clever algorithms

How well does log-likelihood correlate with an end task?

A Maximization-Maximization Procedure

𝐹𝐹 𝜃𝜃, 𝑞𝑞 = 𝔼𝔼 𝒞𝒞(𝜃𝜃)−𝔼𝔼 log 𝑞𝑞(𝑍𝑍)

observed data log-likelihood

anydistribution

over Z

we’ll see this again with variational inference

Recommended