94
Introduction to Expectation Maximization CMSC 678 UMBC March 28 th , 2018

Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Introduction toExpectation Maximization

CMSC 678UMBC

March 28th, 2018

Page 2: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Page 3: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Page 4: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

Page 5: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Recap(+) from last time…

Page 6: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Optimization for SVMsSeparable case: hard margin SVM

Non-separable case: soft margin SVM

maximize margin

maximize margin minimize slack margin

Page 7: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Subgradient is any direction that is below the functionFor the hinge loss a possible subgradient is:

Subgradient of Hinge Los

1

subgradient

𝑧𝑧 = 𝑤𝑤𝑇𝑇𝑥𝑥

�𝑦𝑦 = 𝑤𝑤𝑇𝑇𝑥𝑥ℓ 𝑦𝑦, �𝑦𝑦 = max{0, 1 − y𝑤𝑤𝑇𝑇𝑥𝑥}

not differentiable

at z=1

Page 8: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Hinge lossobjective

loss term

only for points

perceptron update

update

subgradient

Take-away: the perceptron algorithm optimizes a hinge loss objective

Page 9: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Page 10: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Dual formulation for SVM

Page 11: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Basics of ProbabilityRequirements to be a distribution (“proportional to”, ∝)Definitions of conditional probability, joint probability, and independenceBayes rule, (probability) chain ruleExpectation (of a random variable & function)

Empirical Risk MinimizationGradient DescentLoss Functions: what is it, what does it measure, and what are some computational difficulties with them?Regularization: what is it, how does it work, and why might you want it?

Tasks (High Level)Data set splits: training vs. dev vs. testClassification: Posterior decoding/MAP classifierClassification evaluations: accuracy, precision, recall, and F scoresRegression (vs. classification)Comparing supervised vs. Unsupervised Learning and their tradeoffs: why might you want to use one vs. the other, and what are some potential issues?Clustering: high-level goal/task, K-means as an exampleTradeoffs among clustering evaluations

Linear ModelsBasic form of a linear model (classification or regression)Perceptron (simple vs. other variants, like averaged or voted)When you should use perceptron (what are its assumptions?)Perceptron as SGD

Maximum Entropy Models Meanings of feature functions and weightsHow to learn the weights: gradient descentMeaning of the maxent gradient

Neural NetworksRelation to linear models and maxentTypes (feedforward, CNN, RNN)Learning representations (e.g., "feature maps”)What is a convolution (e.g., 1D vs 2D, high-level notions of why you might want to change padding or the width)How to learn: gradient descent, backpropCommon activation functionsNeural network regularization

Dimensionality ReductionWhat is the basic task & goal in dimensionality reduction?Dimensionality reduction tradeoffs: why might you want to, and what are some potential issues?Linear Discriminant Analysis vs. Principal Component Analysis: what are they trying to do, how are they similar, how do they differ?

Kernel Methods & SVMsFeature expansion and kernelsTwo views: maximizing a separating hyperplane margin vs. loss optimization (norm minimization)Non-separability & slackSub-gradients

Course Overview (so far)

Page 12: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?

Page 13: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?

what we’ve currently sampled…

Page 14: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Remember from the first day:A Terminology Buffet

Classification

Regression

Clustering

Fully-supervised

Semi-supervised

Un-supervised

Probabilistic

Generative

Conditional

Spectral

Neural

Memory-based

Exemplar…

the data: amount of human input/number of labeled examples

the approach: how any data are being

used

the task: what kindof problem are you

solving?

what we’ve currently sampled… what we’ll be sampling next…

Page 15: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

LATENT VARIABLE MODELS AND EXPECTATION MAXIMIZATION

Page 16: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Page 17: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”

Page 18: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”Not Obviously

best label = labelarg max 𝑝𝑝 label item)MAP classifier

Page 19: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Is (Supervised) Classification “Latent?”Not Obviously

these values are unknown

but the generation process (explanation) is transparent

Input: an instance da fixed set of classes C = {c1, c2,…, cJ}A training set of m hand-labeled instances (d1,c1),....,(dm,cm)

Output: a learned (probabilistic) classifier γ that maps instances to classes

Page 20: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Latent Sequences: Part of Speech

British Left Waffles on Falkland IslandsBritish Left Waffles on Falkland Islands

British Left Waffles on Falkland IslandsAdjective Noun Verb

Noun Verb Noun

Page 21: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

orthography

morphology

Adapted from Jason Eisner, Noah Smith

lexemes

syntax

semantics

pragmatics

discourse

observed text

Page 22: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Adapted from Jason Eisner, Noah Smith

Latent Modeling

explain what you see/annotate

with things “of importance” you don’t

orthography

morphology

lexemes

syntax

semantics

pragmatics

discourse

observed text

Page 23: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Page 24: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Page 25: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Page 26: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Latent Sequence Models: Part of Speech

p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels)

2. Produce a tag sequence for this sentence

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Page 27: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

Page 28: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

N different (independent) rolls

𝑤𝑤1 = 1

𝑤𝑤2 = 5

𝑤𝑤3 = 4

Page 29: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

maximize (log-) likelihood to learn the probability parameters

Page 30: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

maximize (log-) likelihood to learn the probability parameters

p(1) = ?

p(3) = ?

p(5) = ?

p(2) = ?

p(4) = ?

p(6) = ?

maximum likelihood estimates

Page 31: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

maximize (log-) likelihood to learn the probability parameters

p(1) = 2/9

p(3) = 1/9

p(5) = 1/9

p(2) = 1/9

p(4) = 3/9

p(6) = 1/9

maximum likelihood estimates

Page 32: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

Page 33: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

𝑤𝑤1 = 1

𝑤𝑤2 = 5

𝑧𝑧1 = 𝐴𝐴

𝑧𝑧2 = 𝐵𝐵

Page 34: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

examples of latent classes z:• part of speech tag• topic (“sports” vs. “politics”)

Page 35: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

add complexity to better explain what we see

goal: maximize (log-)likelihood

Page 36: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

we just see the observed items (rolls) w

add complexity to better explain what we see

goal: maximize (log-)likelihood

Page 37: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z valueswe just see the items w

add complexity to better explain what we see

goal: maximize (log-)likelihood

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Page 38: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z valueswe just see the items w

add complexity to better explain what we see

goal: maximize (log-)likelihood

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Page 39: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

Page 40: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

w

Page 41: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

w z1 & w z2 & w z3 & w z4 & w

Page 42: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖

we don’t actually observe these z values

goal: maximize marginalized (log-)likelihood

w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1

𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2

𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁

𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)

Page 43: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Example: Conditionally Rolling a Die

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁

goal: maximize marginalized (log-)likelihood

w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1

𝑝𝑝(𝑧𝑧1,𝑤𝑤) �𝑧𝑧2

𝑝𝑝(𝑧𝑧2,𝑤𝑤) ⋯ �𝑧𝑧𝑁𝑁

𝑝𝑝(𝑧𝑧𝑁𝑁,𝑤𝑤)

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

Page 44: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

Page 45: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

Page 46: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Page 47: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty (compute expectations)

2. M-step: maximize log-likelihood, assuming these uncertain counts

Page 48: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Expectation Maximization (EM): E-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)

Page 49: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Expectation Maximization (EM): E-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑧𝑖𝑖 ,𝑤𝑤𝑖𝑖)𝑝𝑝(𝑧𝑧𝑖𝑖)

We’ve already seen this type of counting, when computing the gradient in maxent models.

Page 50: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Expectation Maximization (EM): M-step0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

𝑝𝑝 𝑡𝑡+1 (𝑧𝑧)𝑝𝑝(𝑡𝑡)(𝑧𝑧)estimated

counts

Page 51: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)

Page 52: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

Page 53: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

posterior distribution

Page 54: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

EM Math

max𝜃𝜃

𝔼𝔼𝑧𝑧 ~ 𝑝𝑝𝜃𝜃(𝑡𝑡)(⋅|𝑤𝑤) log𝑝𝑝𝜃𝜃(𝑧𝑧,𝑤𝑤)E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

new parameters new parametersposterior distribution

Page 55: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful

Page 56: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful

EM

Page 57: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?labeled data:

• human annotated• relatively small/few

examples

unlabeled data:• raw; not annotated• plentiful

Page 58: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why EM? Semi-Supervised Learning

? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ? ?

EM

Page 59: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Page 60: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

Page 61: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

only observe these (record heads vs. tails outcome)

don’t observe this

Page 62: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

observed:a, b, e, etc.We run the code, vs. The run failed

unobserved:part of speech?genre?

Page 63: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

Imagine three coins

Flip 1st coin (penny)

If heads: flip 2nd coin (dollar coin)

If tails: flip 3rd coin (dime)

𝑝𝑝 heads = 𝜆𝜆 𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾

𝑝𝑝 heads = 𝜓𝜓

𝑝𝑝 tails = 1 − 𝛾𝛾

𝑝𝑝 tails = 1 − 𝜓𝜓

Page 64: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

Imagine three coins

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Three parameters to estimate: λ, γ, and ψ

Page 65: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

If all flips were observed

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Page 66: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

If all flips were observed

H H T H T HH T H T T T

𝑝𝑝 heads =46

𝑝𝑝 tails =26

𝑝𝑝 heads =14

𝑝𝑝 heads =12

𝑝𝑝 tails =34

𝑝𝑝 tails =12

𝑝𝑝 heads = 𝜆𝜆𝑝𝑝 tails = 1 − 𝜆𝜆

𝑝𝑝 heads = 𝛾𝛾 𝑝𝑝 heads = 𝜓𝜓𝑝𝑝 tails = 1 − 𝛾𝛾 𝑝𝑝 tails = 1 − 𝜓𝜓

Page 67: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

Page 68: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

𝑝𝑝 heads | observed item H =𝑝𝑝(heads & H)

𝑝𝑝(H)

Use these values to compute posteriors

𝑝𝑝 heads | observed item T =𝑝𝑝(heads & T)

𝑝𝑝(T)

Page 69: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

Use these values to compute posteriors

marginal likelihood

rewrite joint using Bayes rule

Page 70: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

Use these values to compute posteriors

𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2

Page 71: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins Example

But not all flips are observed set parameter values

H H T H T HH T H T T T

𝑝𝑝 heads = 𝜆𝜆 = .6𝑝𝑝 tails = .4

𝑝𝑝 heads = .8 𝑝𝑝 heads = .6𝑝𝑝 tails = .2 𝑝𝑝 tails = .4

Use these values to compute posteriors

𝑝𝑝 H = 𝑝𝑝 H | heads ∗ 𝑝𝑝 heads + 𝑝𝑝 H | tails * 𝑝𝑝(tails)= .8 ∗ .6 + .6 ∗ .4

𝑝𝑝 heads | observed item H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

𝑝𝑝 H | heads = .8 𝑝𝑝 T | heads = .2

Page 72: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins ExampleH H T H T HH T H T T T

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

Use posteriors to update parameters

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

Page 73: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins ExampleH H T H T HH T H T T T

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

Use posteriors to update parameters

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

A: No.

Page 74: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins ExampleH H T H T HH T H T T T

Use posteriors to update parameters

𝑝𝑝 heads =# heads from penny# total flips of pennyfully observed setting

our setting: partially-observed 𝑝𝑝 heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

(in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

Page 75: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins ExampleH H T H T HH T H T T T

Use posteriors to update parameters

our setting: partially-observed

𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny

=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]

# total flips of penny

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Page 76: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Three Coins ExampleH H T H T HH T H T T T

Use posteriors to update parameters

our setting: partially-observed

𝑝𝑝(𝑡𝑡+1) heads =# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny

# total flips of penny

=𝔼𝔼𝑝𝑝(𝑡𝑡)[# 𝑒𝑒𝑥𝑥𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 heads from penny]

# total flips of penny

=2 ∗ 𝑝𝑝 heads | obs. H + 4 ∗ 𝑝𝑝 heads | obs.𝑇𝑇

6≈ 0.444

𝑝𝑝 heads | obs. H =𝑝𝑝 H heads)𝑝𝑝(heads)

𝑝𝑝(H)

=.8 ∗ .6

.8 ∗ .6 + .6 ∗ .4≈ 0.667

𝑝𝑝 heads | obs. T =𝑝𝑝 T heads)𝑝𝑝(heads)

𝑝𝑝(T)

=.2 ∗ .6

.2 ∗ .6 + .6 ∗ .4≈ 0.334

Page 77: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm:

1. E-step: count under uncertainty (compute expectations)

2. M-step: maximize log-likelihood, assuming these uncertain counts

Page 78: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Outline

Administrivia & recap

Latent and probabilistic modeling

EM (Expectation Maximization)Basic ideaThree coins exampleWhy EM works

Page 79: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

what do 𝒞𝒞, ℳ, 𝒫𝒫 look like?

Page 80: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)

Page 81: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)

ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

Page 82: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)

ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

𝒫𝒫 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)

Page 83: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

definition of conditional probability

algebra

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

Page 84: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃

𝒞𝒞 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖) ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘) 𝒫𝒫 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖)

Page 85: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

take a conditional expectation (why? we’ll cover this more in

variational inference)

Page 86: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃(𝑋𝑋)

𝑝𝑝𝜃𝜃(𝑋𝑋) =𝑝𝑝𝜃𝜃(𝑋𝑋,𝑌𝑌)𝑝𝑝𝜃𝜃 𝑌𝑌 𝑋𝑋)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝒞𝒞 𝜃𝜃 − 𝒫𝒫 𝜃𝜃

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[ℳ 𝜃𝜃 |𝑋𝑋] = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]ℳ already sums over Y

ℳ 𝜃𝜃 = �𝑖𝑖

log𝑝𝑝(𝑥𝑥𝑖𝑖) = �𝑖𝑖

log�𝑘𝑘

𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

Page 87: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡) 𝒞𝒞 𝜃𝜃 𝑋𝑋 = �𝑖𝑖

�𝑘𝑘

𝑝𝑝𝜃𝜃(𝑡𝑡) 𝑦𝑦 = 𝑘𝑘 𝑥𝑥𝑖𝑖) log𝑝𝑝(𝑥𝑥𝑖𝑖 ,𝑦𝑦 = 𝑘𝑘)

Page 88: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

Page 89: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

Page 90: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

≥ 0 ≤ 0 (we’ll see why with Jensen’s inequality, in variational inference)

Page 91: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Why does EM work?𝑋𝑋: observed data 𝑌𝑌: unobserved data 𝒞𝒞 𝜃𝜃 = log-likelihood of complete data (X,Y)

𝒫𝒫 𝜃𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃𝜃 = 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒞𝒞 𝜃𝜃 |𝑋𝑋] − 𝔼𝔼𝑌𝑌∼𝜃𝜃(𝑡𝑡)[𝒫𝒫 𝜃𝜃 |𝑋𝑋]

𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡)) 𝑅𝑅(𝜃𝜃, 𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 = 𝑄𝑄 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑄𝑄(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡)) − 𝑅𝑅 𝜃𝜃∗,𝜃𝜃(𝑡𝑡) − 𝑅𝑅(𝜃𝜃(𝑡𝑡),𝜃𝜃(𝑡𝑡))

Let 𝜃𝜃∗ be the value that maximizes 𝑄𝑄(𝜃𝜃,𝜃𝜃(𝑡𝑡))

ℳ 𝜃𝜃∗ −ℳ 𝜃𝜃 𝑡𝑡 ≥ 0 EM does not decrease the marginal log-likelihood

Page 92: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

Generalized EM

Partial M step: find a θ that simply increases, rather than maximizes, Q

Partial E step: only consider some of the variables (an online learning algorithm)

Page 93: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

EM has its pitfalls

Objective is not convex converge to a bad local optimum

Computing expectations can be hard: the E-step could require clever algorithms

How well does log-likelihood correlate with an end task?

Page 94: Introduction to Expectation Maximization€¦ · Data set splits: training vs. dev vs. test. Classification: Posterior decoding/MAP classifier. Classification evaluations: accuracy,

A Maximization-Maximization Procedure

𝐹𝐹 𝜃𝜃, 𝑞𝑞 = 𝔼𝔼 𝒞𝒞(𝜃𝜃)−𝔼𝔼 log 𝑞𝑞(𝑍𝑍)

observed data log-likelihood

anydistribution

over Z

we’ll see this again with variational inference