Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Directed Probabilistic Graphical Models
CMSC 678
UMBC
Announcement 1: Assignment 3
Due Wednesday April 11th, 11:59 AM
Any questions?
Announcement 2: Progress Report on Project
Due Monday April 16th, 11:59 AM
Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced
(or anticipate experiencing)
Any questions?
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Recap from last time…
Expectation Maximization (EM): E-step0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
count(𝑧𝑖 , 𝑤𝑖)𝑝(𝑧𝑖)
𝑝 𝑡+1 (𝑧)𝑝(𝑡)(𝑧)estimated
countshttp://blog.innotas.com/wp-
EM Math
max𝜃
𝔼𝑧 ~ 𝑝𝜃(𝑡)
(⋅|𝑤) log 𝑝𝜃(𝑧, 𝑤)
E-step: count under uncertainty
M-step: maximize log-likelihood
old parameters
new parametersnew parametersposterior distribution
𝒞 𝜃 = log-likelihood of complete data (X,Y)
𝒫 𝜃 = posterior log-likelihood of incomplete data Y
ℳ 𝜃 = marginal log-likelihood of observed data X
ℳ 𝜃 = 𝔼𝑌∼𝜃(𝑡)[𝒞 𝜃 |𝑋] − 𝔼𝑌∼𝜃(𝑡)[𝒫 𝜃 |𝑋]
EM does not decrease the marginal log-likelihood
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Assume an original optimization problem
Lagrange multipliers
Assume an original optimization problem
We convert it to a new optimization problem:
Lagrange multipliers
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
𝑤1 = 1
𝑤2 = 5
𝑤3 = 4
⋯
for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)
Generative Story
a probability distribution over 6
sides of the die
𝑘=1
6
𝜃𝑘 = 1 0 ≤ 𝜃𝑘 ≤ 1, ∀𝑘
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
𝑤1 = 1
𝑤2 = 5
𝑤3 = 4
⋯
for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)
Generative Story
ℒ 𝜃 =
𝑖
log 𝑝𝜃(𝑤𝑖)
=
𝑖
log 𝜃𝑤𝑖
Maximize Log-likelihood
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)
Generative Story
ℒ 𝜃 =
𝑖
log 𝜃𝑤𝑖
Maximize Log-likelihood
Q: What’s an easy way to maximize this, as written exactly (even without calculus)?
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)
Generative Story
ℒ 𝜃 =
𝑖
log 𝜃𝑤𝑖
Maximize Log-likelihood
Q: What’s an easy way to maximize this, as written exactly (even without calculus)?
A: Just keep increasing 𝜃𝑘 (we know 𝜃 must be a distribution, but it’s not specified)
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
ℒ 𝜃 =
𝑖
log 𝜃𝑤𝑖s. t.
𝑘=1
6
𝜃𝑘 = 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 ≤ 𝜃𝑘, but it complicates the problem and, right
now, is not needed)
solve using Lagrange multipliers
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
ℱ 𝜃 =
𝑖
log 𝜃𝑤𝑖− 𝜆
𝑘=1
6
𝜃𝑘 − 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 ≤ 𝜃𝑘, but it complicates the
problem and, right now, is not needed)
𝜕ℱ 𝜃
𝜕𝜃𝑘=
𝑖:𝑤𝑖=𝑘
1
𝜃𝑤𝑖
− 𝜆 𝜕ℱ 𝜃
𝜕𝜆= −
𝑘=1
6
𝜃𝑘 + 1
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
ℱ 𝜃 =
𝑖
log 𝜃𝑤𝑖− 𝜆
𝑘=1
6
𝜃𝑘 − 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 ≤ 𝜃𝑘, but it complicates the
problem and, right now, is not needed)
𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘
1
𝜆 optimal 𝜆 when
𝑘=1
6
𝜃𝑘 = 1
Probabilistic Estimation of Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
N different (independent) rolls
ℱ 𝜃 =
𝑖
log 𝜃𝑤𝑖− 𝜆
𝑘=1
6
𝜃𝑘 − 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 ≤ 𝜃𝑘, but it complicates the
problem and, right now, is not needed)
𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘
1
σ𝑘σ𝑖:𝑤𝑖=𝑘1=𝑁𝑘𝑁 optimal 𝜆 when
𝑘=1
6
𝜃𝑘 = 1
Example: Conditionally Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖
add complexity to better explain what we see
𝑤1 = 1
𝑤2 = 5
⋯
𝑧1 = 𝐻
𝑧2 = 𝑇
𝑝 heads = 𝜆
𝑝 tails = 1 − 𝜆
𝑝 heads = 𝛾
𝑝 heads = 𝜓
𝑝 tails = 1 − 𝛾
𝑝 tails = 1 − 𝜓
Example: Conditionally Rolling a Die
𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ
𝑖
𝑝 𝑤𝑖
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖
add complexity to better explain what we see
𝑝 heads = 𝜆
𝑝 tails = 1 − 𝜆
𝑝 heads = 𝛾
𝑝 heads = 𝜓
𝑝 tails = 1 − 𝛾
𝑝 tails = 1 − 𝜓
for item 𝑖 = 1 to 𝑁:𝑧𝑖 ~ Bernoulli 𝜆
Generative Story
𝜆 = distribution over penny𝛾 = distribution for dollar coin𝜓 = distribution over dime
if 𝑧𝑖 = 𝐻:𝑤𝑖 ~ Bernoulli 𝛾
else: 𝑤𝑖 ~ Bernoulli 𝜓
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Classify with Bayes Rule
argmax𝑌 log 𝑝 𝑋 𝑌) + log 𝑝(𝑌)
likelihood prior
argmax𝑌𝑝 𝑌 𝑋)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
29Adapted from Jurafsky & Martin (draft)
Bag of Words Representation
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...classifier
classifier
Adapted from Jurafsky & Martin (draft)
Naïve Bayes: A Generative StoryGenerative Story
𝜙 = distribution over 𝐾 labelsfor label 𝑘 = 1 to 𝐾:
global parameters
𝜃𝑘 = generate parameters
Naïve Bayes: A Generative Story
for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙
Generative Story
𝜙 = distribution over 𝐾 labelsy
for label 𝑘 = 1 to 𝐾:
𝜃𝑘 = generate parameters
Naïve Bayes: A Generative Story
for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙
Generative Story
𝜙 = distribution over 𝐾 labels
for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)
𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5
y
for label 𝑘 = 1 to 𝐾:
𝜃𝑘 = generate parameters
local variables
Naïve Bayes: A Generative Story
for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙
Generative Story
𝜙 = distribution over 𝐾 labels
𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5
y
for label 𝑘 = 1 to 𝐾:
each xij conditionally independent of one
another (given the label)
𝜃𝑘 = generate parameters
for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)
Naïve Bayes: A Generative Story
for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙
Generative Story
ℒ 𝜃 =
𝑖
𝑗
log 𝐹𝑦𝑖(𝑥𝑖𝑗; 𝜃𝑦𝑖) +
𝑖
log 𝜙𝑦𝑖 s. t.
Maximize Log-likelihood
𝜙 = distribution over 𝐾 labels
𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5
y
for label 𝑘 = 1 to 𝐾:
𝑘
𝜙𝑘 = 1 𝜃𝑘 is valid for 𝐹𝑘
𝜃𝑘 = generate parameters
for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)
Multinomial Naïve Bayes: A Generative Story
for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙
Generative Story
ℒ 𝜃 =
𝑖
𝑗
log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +
𝑖
log𝜙𝑦𝑖 s. t.
Maximize Log-likelihood
𝜙 = distribution over 𝐾 labels
for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)
𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5
y
for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values
𝑘
𝜙𝑘 = 1
𝑗
𝜃𝑘𝑗 = 1 ∀𝑘
Multinomial Naïve Bayes: A Generative Story
for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙
Generative Story
ℒ 𝜃
=
𝑖
𝑗
log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +
𝑖
log𝜙𝑦𝑖 − 𝜇
𝑘
𝜙𝑘 − 1 −
𝑘
𝜆𝑘
𝑗
𝜃𝑘𝑗 − 1
Maximize Log-likelihood via Lagrange Multipliers
𝜙 = distribution over 𝐾 labels
for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)
𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5
y
for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values
Multinomial Naïve Bayes: Learning
Calculate class priors
For each k:itemsk = all items with class = k
Calculate feature generation termsFor each k:
obsk = single object containing all items labeled as k
For each feature jnkj = # of occurrences of j in obsk
𝑝 𝑘 =|items𝑘|
# items𝑝 𝑗|𝑘 =
𝑛𝑘𝑗σ𝑗′ 𝑛𝑘𝑗′
Brill and Banko (2001)
With enough data, the classifier may not matter
Adapted from Jurafsky & Martin (draft)
Summary: Naïve Bayes is Not So Naïve, but not without issue
ProVery Fast, low storage requirements
Robust to Irrelevant Features
Very good in domains with many equally important features
Optimal if the independence assumptions hold
Dependable baseline for text classification (but often not the best)
ConModel the posterior in one go? (e.g., use conditional maxent)
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Adapted from Jurafsky & Martin (draft)
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Hidden Markov Models
p(British Left Waffles on Falkland Islands)
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Class-based
model
Bigram model
of the classes
Model all
class sequences
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖|𝑧𝑖−1
𝑧1,..,𝑧𝑁
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁
Hidden Markov Model
Goal: maximize (log-)likelihood
In practice: we don’t actually observe these zvalues; we just see the words w
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
Hidden Markov Model
Goal: maximize (log-)likelihood
In practice: we don’t actually observe these zvalues; we just see the words w
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
if we knew the probability parametersthen we could estimate z and evaluate
likelihood… but we don’t! :(
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
transitionprobabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
emissionprobabilities/parameters
transitionprobabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
emissionprobabilities/parameters
transitionprobabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
emissionprobabilities/parameters
transitionprobabilities/parameters
Q: How many different probability values are there with K states and V vocab items?
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1
emissionprobabilities/parameters
transitionprobabilities/parameters
Q: How many different probability values are there with K states and V vocab items?
A: VK emission values and K2 transition values
Hidden Markov Model Representation
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission
probabilities/parameterstransitionprobabilities/parameters
z1
w1
…
w2 w3 w4
z2 z3 z4
represent the probabilities and independence assumptions in a graph
Hidden Markov Model Representation
𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁
=ෑ
𝑖
𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission
probabilities/parameterstransitionprobabilities/parameters
z1
w1
…
w2 w3 w4
z2 z3 z4
𝑝 𝑤1|𝑧1 𝑝 𝑤2|𝑧2 𝑝 𝑤3|𝑧3 𝑝 𝑤4|𝑧4
𝑝 𝑧2| 𝑧1 𝑝 𝑧3| 𝑧2 𝑝 𝑧4| 𝑧3𝑝 𝑧1| 𝑧0
initial starting distribution
(“__SEQ_SYM__”)
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V …
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z4 = V …
𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉
z3 = V
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁
𝑝 𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start
…
𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉
𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁
𝑝 𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start
…𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉
𝑝 𝑁| 𝑉 𝑝 𝑁| 𝑉
𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉
𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
A Latent Sequence is a Path through the Graph
z1 = N
w1 w2 w3 w4
𝑝 𝑤1|𝑁𝑝 𝑤4|𝑁
𝑝 𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝 𝑉| 𝑉
𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉
𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉
Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?
A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.
If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.
If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.
If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.
If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
What’s the Maximum Weighted Path?
9 6
7
3
32 1
4
What’s the Maximum Weighted Path?
9 6
7
3
32 1
4
What’s the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
What’s the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
What’s the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
+10 19 +10 16 +10 42 +10 11
What’s the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
+10 19 +10 16 +10 42 +10 11
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
What’s the Maximum Value?
consider “any shared path ending with B (AB, BB, or CB) B”
maximize across the previous hidden state values
𝑣 𝑖, 𝐵 = max𝑠′
𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)
v(i, B) is the maximum
probability of any paths to that state B from the beginning (and
emitting the observation)
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
What’s the Maximum Value?
consider “any shared path ending with B (AB, BB, or CB) B”
maximize across the previous hidden state values
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= A
𝑣 𝑖, 𝐵 = max𝑠′
𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)
v(i, B) is the maximum
probability of any paths to that state B from the beginning (and
emitting the observation)
What’s the Maximum Value?
consider “any shared path ending with B (AB, BB, or CB) B”
maximize across the previous hidden state values
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= Acomputing v at time i-1will correctly incorporate
(maximize over) paths through time i-2:
we correctly obey the Markov property
𝑣 𝑖, 𝐵 = max𝑠′
𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)
v(i, B) is the maximum
probability of any paths to that state B from the beginning (and
emitting the observation)
Viterbi algorithm
Viterbi Algorithmv = double[N+2][K*]
b = int[N+2][K*]
v[*][*] = 0
v[0][START] = 1
for(i = 1; i ≤ N+1; ++i) {
for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)
for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)
if(v[i-1][old] * pobs * pmove > v[i][state]) {
v[i][state] = v[i-1][old] * pobs * pmove
b[i][state] = old}
}
}
}
backpointers/book-keeping
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Forward Probability
α(i, B) is the total probability of all paths to that state B from the
beginning
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= A
Forward Probability
marginalize across the previous hidden state values
𝛼 𝑖, 𝐵 =
𝑠′
𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)
computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property
α(i, B) is the total probability of all paths to that state B from the
beginning
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= A
Forward Probability
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
𝛼 𝑖, 𝑠 =
𝑠′
𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)
Forward Probability
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
𝛼 𝑖, 𝑠 =
𝑠′
𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)
how likely is it to get into state s this way?
what are the immediate ways to get into state s?
what’s the total probability up until now?
Forward Probability
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
𝛼 𝑖, 𝑠 =
𝑠′
𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)
how likely is it to get into state s this way?
what are the immediate ways to get into state s?
what’s the total probability up until now?
Q: What do we return? (How do we return the likelihood of the sequence?) A: α[N+1][end]
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaïve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi+1
= C
zi+1
= B
zi+1
= A
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
β(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi+1
= C
zi+1
= B
zi+1
= A
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
β(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
α(i, B) β(i, B)
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi+1
= C
zi+1
= B
zi+1
= A
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
β(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
α(i, B) β(i, B)
α(i, B) * β(i, B) = total probability of paths through state B at step i
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
β(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
α(i, B) β(i+1, s)
zi+1
= C
zi+1
= B
zi+1
= A
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
α(i, B) β(i+1, s’)
zi+1
= C
zi+1
= B
zi+1
= A
α(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
β(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the Bs’ arc (at time i)
With Both Forward and Backward Values
α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the ss’ arc (at time i)
α(i, s) * β(i, s) = total probability of paths through state s at step i
𝑝 𝑧𝑖 = 𝑠 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)
𝛼(𝑁 + 1, END)
𝑝 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)
𝛼(𝑁 + 1, END)
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
pobs(w | s)
ptrans(s’ | s)
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
pobs(w | s)
ptrans(s’ | s)
𝑝∗ 𝑧𝑖 = 𝑠 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)
𝛼(𝑁 + 1, END)
𝑝∗ 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)
𝛼(𝑁 + 1, END)
M-Step
“maximize log-likelihood, assuming these uncertain counts”
𝑝new 𝑠′ 𝑠) =𝑐(𝑠 → 𝑠′)
σ𝑥 𝑐(𝑠 → 𝑥)
if we observed the hidden transitions…
M-Step
“maximize log-likelihood, assuming these uncertain counts”
𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]
σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]
we don’t the hidden transitions, but we can approximately count
M-Step
“maximize log-likelihood, assuming these uncertain counts”
𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]
σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]
we don’t the hidden transitions, but we can approximately count
we compute these in the E-step, with our α and β values
EM For HMMs (Baum-Welch
Algorithm)
α = computeForwards()
β = computeBackwards()
L = α[N+1][END]
for(i = N; i ≥ 0; --i) {
for(next = 0; next < K*; ++next) {
cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L
for(state = 0; state < K*; ++state) {
u = pobs(obsi+1 | next) * ptrans (next | state)
ctrans(next| state) +=
α[i][state] * u * β[i+1][next]/L
}
}
}
Bayesian Networks:Directed Acyclic Graphs
𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ
𝑖
𝑝 𝑥𝑖 𝜋(𝑥𝑖))
“parents of”
topological sort
Bayesian Networks:Directed Acyclic Graphs
𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ
𝑖
𝑝 𝑥𝑖 𝜋(𝑥𝑖))
exact inference in general DAGs is NP-hard
inference in trees can be exact
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally
independent given Z if all (undirected) paths from
(any variable in) X to (any variable in) Y are
d-separated by Z
d-separation
P has a chain with an observed middle node
P has a fork with an observed parent node
P includes a “v-structure” or “collider” with all unobserved descendants
X & Y are d-separated if for all paths P, one of the following is true:
X Y
X Y
X Z Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable
in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node
P has a fork with an observed parent node
P includes a “v-structure” or “collider” with all unobserved descendants
X & Y are d-separated if for all paths P, one of the following is true:
X Z Y
X
Z
Y
X Z Y
observing Z blocks the path from X to Y
observing Z blocks the path from X to Y
not observing Z blocks the path from X to Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable
in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node
P has a fork with an observed parent node
P includes a “v-structure” or “collider” with all unobserved descendants
X & Y are d-separated if for all paths P, one of the following is true:
X Z Y
X
Z
Y
X Z Y
observing Z blocks the path from X to Y
observing Z blocks the path from X to Y
not observing Z blocks the path from X to Y
𝑝 𝑥, 𝑦, 𝑧 = 𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦)
𝑝 𝑥, 𝑦 =
𝑧
𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦) = 𝑝 𝑥 𝑝 𝑦
Markov Blanket
x
Markov blanket of a node x is its parents, children, and
children's parents
𝑝 𝑥𝑖 𝑥𝑗≠𝑖 =𝑝(𝑥1, … , 𝑥𝑁)
∫ 𝑝 𝑥1, … , 𝑥𝑁 𝑑𝑥𝑖
=ς𝑘 𝑝(𝑥𝑘|𝜋 𝑥𝑘 )
∫ ς𝑘 𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖factor out terms not dependent on xi
factorization of graph
=ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘
𝑝(𝑥𝑘|𝜋 𝑥𝑘 )
∫ ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖
the set of nodes needed to form the complete conditional for a variable xi