99
Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Graphical Models

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Directed Graphical Models

Directed Probabilistic Graphical Models

CMSC 678

UMBC

Page 2: Directed Graphical Models

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Page 3: Directed Graphical Models

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

Page 4: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 5: Directed Graphical Models

Recap from last time…

Page 6: Directed Graphical Models

Expectation Maximization (EM): E-step0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑖 , 𝑤𝑖)𝑝(𝑧𝑖)

𝑝 𝑡+1 (𝑧)𝑝(𝑡)(𝑧)estimated

countshttp://blog.innotas.com/wp-

Page 7: Directed Graphical Models

EM Math

max𝜃

𝔼𝑧 ~ 𝑝𝜃(𝑡)

(⋅|𝑤) log 𝑝𝜃(𝑧, 𝑤)

E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

new parametersnew parametersposterior distribution

𝒞 𝜃 = log-likelihood of complete data (X,Y)

𝒫 𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃 = 𝔼𝑌∼𝜃(𝑡)[𝒞 𝜃 |𝑋] − 𝔼𝑌∼𝜃(𝑡)[𝒫 𝜃 |𝑋]

EM does not decrease the marginal log-likelihood

Page 8: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 9: Directed Graphical Models

Assume an original optimization problem

Lagrange multipliers

Page 10: Directed Graphical Models

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Page 11: Directed Graphical Models

Lagrange multipliers: an equivalent problem?

Page 12: Directed Graphical Models

Lagrange multipliers: an equivalent problem?

Page 13: Directed Graphical Models

Lagrange multipliers: an equivalent problem?

Page 14: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 15: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

𝑤1 = 1

𝑤2 = 5

𝑤3 = 4

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

a probability distribution over 6

sides of the die

𝑘=1

6

𝜃𝑘 = 1 0 ≤ 𝜃𝑘 ≤ 1, ∀𝑘

Page 16: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

𝑤1 = 1

𝑤2 = 5

𝑤3 = 4

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

ℒ 𝜃 =

𝑖

log 𝑝𝜃(𝑤𝑖)

=

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood

Page 17: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood

Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

Page 18: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood

Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

A: Just keep increasing 𝜃𝑘 (we know 𝜃 must be a distribution, but it’s not specified)

Page 19: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖s. t.

𝑘=1

6

𝜃𝑘 = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the problem and, right

now, is not needed)

solve using Lagrange multipliers

Page 20: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜕ℱ 𝜃

𝜕𝜃𝑘=

𝑖:𝑤𝑖=𝑘

1

𝜃𝑤𝑖

− 𝜆 𝜕ℱ 𝜃

𝜕𝜆= −

𝑘=1

6

𝜃𝑘 + 1

Page 21: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘

1

𝜆 optimal 𝜆 when

𝑘=1

6

𝜃𝑘 = 1

Page 22: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘

1

σ𝑘σ𝑖:𝑤𝑖=𝑘1=𝑁𝑘𝑁 optimal 𝜆 when

𝑘=1

6

𝜃𝑘 = 1

Page 23: Directed Graphical Models

Example: Conditionally Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑤1 = 1

𝑤2 = 5

𝑧1 = 𝐻

𝑧2 = 𝑇

𝑝 heads = 𝜆

𝑝 tails = 1 − 𝜆

𝑝 heads = 𝛾

𝑝 heads = 𝜓

𝑝 tails = 1 − 𝛾

𝑝 tails = 1 − 𝜓

Page 24: Directed Graphical Models

Example: Conditionally Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑝 heads = 𝜆

𝑝 tails = 1 − 𝜆

𝑝 heads = 𝛾

𝑝 heads = 𝜓

𝑝 tails = 1 − 𝛾

𝑝 tails = 1 − 𝜓

for item 𝑖 = 1 to 𝑁:𝑧𝑖 ~ Bernoulli 𝜆

Generative Story

𝜆 = distribution over penny𝛾 = distribution for dollar coin𝜓 = distribution over dime

if 𝑧𝑖 = 𝐻:𝑤𝑖 ~ Bernoulli 𝛾

else: 𝑤𝑖 ~ Bernoulli 𝜓

Page 25: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 26: Directed Graphical Models

Classify with Bayes Rule

argmax𝑌 log 𝑝 𝑋 𝑌) + log 𝑝(𝑌)

likelihood prior

argmax𝑌𝑝 𝑌 𝑋)

Page 27: Directed Graphical Models

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

Page 28: Directed Graphical Models

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

Page 29: Directed Graphical Models

The Bag of Words Representation

29Adapted from Jurafsky & Martin (draft)

Page 30: Directed Graphical Models

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Adapted from Jurafsky & Martin (draft)

Page 31: Directed Graphical Models

Naïve Bayes: A Generative StoryGenerative Story

𝜙 = distribution over 𝐾 labelsfor label 𝑘 = 1 to 𝐾:

global parameters

𝜃𝑘 = generate parameters

Page 32: Directed Graphical Models

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labelsy

for label 𝑘 = 1 to 𝐾:

𝜃𝑘 = generate parameters

Page 33: Directed Graphical Models

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:

𝜃𝑘 = generate parameters

local variables

Page 34: Directed Graphical Models

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labels

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:

each xij conditionally independent of one

another (given the label)

𝜃𝑘 = generate parameters

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

Page 35: Directed Graphical Models

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

ℒ 𝜃 =

𝑖

𝑗

log 𝐹𝑦𝑖(𝑥𝑖𝑗; 𝜃𝑦𝑖) +

𝑖

log 𝜙𝑦𝑖 s. t.

Maximize Log-likelihood

𝜙 = distribution over 𝐾 labels

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:

𝑘

𝜙𝑘 = 1 𝜃𝑘 is valid for 𝐹𝑘

𝜃𝑘 = generate parameters

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

Page 36: Directed Graphical Models

Multinomial Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

ℒ 𝜃 =

𝑖

𝑗

log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +

𝑖

log𝜙𝑦𝑖 s. t.

Maximize Log-likelihood

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values

𝑘

𝜙𝑘 = 1

𝑗

𝜃𝑘𝑗 = 1 ∀𝑘

Page 37: Directed Graphical Models

Multinomial Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

ℒ 𝜃

=

𝑖

𝑗

log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +

𝑖

log𝜙𝑦𝑖 − 𝜇

𝑘

𝜙𝑘 − 1 −

𝑘

𝜆𝑘

𝑗

𝜃𝑘𝑗 − 1

Maximize Log-likelihood via Lagrange Multipliers

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values

Page 38: Directed Graphical Models

Multinomial Naïve Bayes: Learning

Calculate class priors

For each k:itemsk = all items with class = k

Calculate feature generation termsFor each k:

obsk = single object containing all items labeled as k

For each feature jnkj = # of occurrences of j in obsk

𝑝 𝑘 =|items𝑘|

# items𝑝 𝑗|𝑘 =

𝑛𝑘𝑗σ𝑗′ 𝑛𝑘𝑗′

Page 39: Directed Graphical Models

Brill and Banko (2001)

With enough data, the classifier may not matter

Adapted from Jurafsky & Martin (draft)

Page 40: Directed Graphical Models

Summary: Naïve Bayes is Not So Naïve, but not without issue

ProVery Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

ConModel the posterior in one go? (e.g., use conditional maxent)

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Adapted from Jurafsky & Martin (draft)

Page 41: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 42: Directed Graphical Models

Hidden Markov Models

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Class-based

model

Bigram model

of the classes

Model all

class sequences

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖|𝑧𝑖−1

𝑧1,..,𝑧𝑁

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁

Page 43: Directed Graphical Models

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

Page 44: Directed Graphical Models

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Page 45: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

Page 46: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

transitionprobabilities/parameters

Page 47: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Page 48: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Page 49: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

Page 50: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

A: VK emission values and K2 transition values

Page 51: Directed Graphical Models

Hidden Markov Model Representation

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

Page 52: Directed Graphical Models

Hidden Markov Model Representation

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑝 𝑤1|𝑧1 𝑝 𝑤2|𝑧2 𝑝 𝑤3|𝑧3 𝑝 𝑤4|𝑧4

𝑝 𝑧2| 𝑧1 𝑝 𝑧3| 𝑧2 𝑝 𝑧4| 𝑧3𝑝 𝑧1| 𝑧0

initial starting distribution

(“__SEQ_SYM__”)

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

Page 53: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V …

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 54: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z4 = V …

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

z3 = V

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 55: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 56: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

…𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑁| 𝑉 𝑝 𝑁| 𝑉

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 57: Directed Graphical Models

A Latent Sequence is a Path through the Graph

z1 = N

w1 w2 w3 w4

𝑝 𝑤1|𝑁𝑝 𝑤4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉

𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉

Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?

A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 58: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 59: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 60: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 61: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 62: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 63: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4

Page 64: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4

Page 65: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

Page 66: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

Page 67: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

Page 68: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

Page 69: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 70: Directed Graphical Models

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

Page 71: Directed Graphical Models

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

Page 72: Directed Graphical Models

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= Acomputing v at time i-1will correctly incorporate

(maximize over) paths through time i-2:

we correctly obey the Markov property

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

Viterbi algorithm

Page 73: Directed Graphical Models

Viterbi Algorithmv = double[N+2][K*]

b = int[N+2][K*]

v[*][*] = 0

v[0][START] = 1

for(i = 1; i ≤ N+1; ++i) {

for(state = 0; state < K*; ++state) {

pobs = pemission(obsi | state)

for(old = 0; old < K*; ++old) {

pmove = ptransition(state | old)

if(v[i-1][old] * pobs * pmove > v[i][state]) {

v[i][state] = v[i-1][old] * pobs * pmove

b[i][state] = old}

}

}

}

backpointers/book-keeping

Page 74: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 75: Directed Graphical Models

Forward Probability

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Page 76: Directed Graphical Models

Forward Probability

marginalize across the previous hidden state values

𝛼 𝑖, 𝐵 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Page 77: Directed Graphical Models

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

Page 78: Directed Graphical Models

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Page 79: Directed Graphical Models

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Q: What do we return? (How do we return the likelihood of the sequence?) A: α[N+1][end]

Page 80: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 81: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Page 82: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) β(i, B)

Page 83: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) β(i, B)

α(i, B) * β(i, B) = total probability of paths through state B at step i

Page 84: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) β(i+1, s)

zi+1

= C

zi+1

= B

zi+1

= A

Page 85: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

α(i, B) β(i+1, s’)

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the Bs’ arc (at time i)

Page 86: Directed Graphical Models

With Both Forward and Backward Values

α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the ss’ arc (at time i)

α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑝 𝑧𝑖 = 𝑠 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

Page 87: Directed Graphical Models

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

Page 88: Directed Graphical Models

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

pobs(w | s)

ptrans(s’ | s)

Page 89: Directed Graphical Models

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

pobs(w | s)

ptrans(s’ | s)

𝑝∗ 𝑧𝑖 = 𝑠 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝∗ 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

Page 90: Directed Graphical Models

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝑐(𝑠 → 𝑠′)

σ𝑥 𝑐(𝑠 → 𝑥)

if we observed the hidden transitions…

Page 91: Directed Graphical Models

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]

σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]

we don’t the hidden transitions, but we can approximately count

Page 92: Directed Graphical Models

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]

σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]

we don’t the hidden transitions, but we can approximately count

we compute these in the E-step, with our α and β values

Page 93: Directed Graphical Models

EM For HMMs (Baum-Welch

Algorithm)

α = computeForwards()

β = computeBackwards()

L = α[N+1][END]

for(i = N; i ≥ 0; --i) {

for(next = 0; next < K*; ++next) {

cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L

for(state = 0; state < K*; ++state) {

u = pobs(obsi+1 | next) * ptrans (next | state)

ctrans(next| state) +=

α[i][state] * u * β[i+1][next]/L

}

}

}

Page 94: Directed Graphical Models

Bayesian Networks:Directed Acyclic Graphs

𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ

𝑖

𝑝 𝑥𝑖 𝜋(𝑥𝑖))

“parents of”

topological sort

Page 95: Directed Graphical Models

Bayesian Networks:Directed Acyclic Graphs

𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ

𝑖

𝑝 𝑥𝑖 𝜋(𝑥𝑖))

exact inference in general DAGs is NP-hard

inference in trees can be exact

Page 96: Directed Graphical Models

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally

independent given Z if all (undirected) paths from

(any variable in) X to (any variable in) Y are

d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Y

X Y

X Z Y

Page 97: Directed Graphical Models

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y

observing Z blocks the path from X to Y

not observing Z blocks the path from X to Y

Page 98: Directed Graphical Models

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y

observing Z blocks the path from X to Y

not observing Z blocks the path from X to Y

𝑝 𝑥, 𝑦, 𝑧 = 𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦)

𝑝 𝑥, 𝑦 =

𝑧

𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦) = 𝑝 𝑥 𝑝 𝑦

Page 99: Directed Graphical Models

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝 𝑥𝑖 𝑥𝑗≠𝑖 =𝑝(𝑥1, … , 𝑥𝑁)

∫ 𝑝 𝑥1, … , 𝑥𝑁 𝑑𝑥𝑖

=ς𝑘 𝑝(𝑥𝑘|𝜋 𝑥𝑘 )

∫ ς𝑘 𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖factor out terms not dependent on xi

factorization of graph

=ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘

𝑝(𝑥𝑘|𝜋 𝑥𝑘 )

∫ ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖

the set of nodes needed to form the complete conditional for a variable xi