Directed Graphical Models

Directed Probabilistic Graphical Models

CMSC 678

UMBC

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Recap from last time…

Expectation Maximization (EM): E-step0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑖 , 𝑤𝑖)𝑝(𝑧𝑖)

𝑝 𝑡+1 (𝑧)𝑝(𝑡)(𝑧)estimated

countshttp://blog.innotas.com/wp-

EM Math

max𝜃

𝔼𝑧 ~ 𝑝𝜃(𝑡)

(⋅|𝑤) log 𝑝𝜃(𝑧, 𝑤)

E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

new parametersnew parametersposterior distribution

𝒞 𝜃 = log-likelihood of complete data (X,Y)

𝒫 𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃 = 𝔼𝑌∼𝜃(𝑡)[𝒞 𝜃 |𝑋] − 𝔼𝑌∼𝜃(𝑡)[𝒫 𝜃 |𝑋]

EM does not decrease the marginal log-likelihood

OutlineRecap of EM





Assume an original optimization problem

Lagrange multipliers

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Lagrange multipliers: an equivalent problem?



OutlineRecap of EM





Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

𝑤1 = 1

𝑤2 = 5

𝑤3 = 4

⋯

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

a probability distribution over 6

sides of the die

𝑘=1

6

𝜃𝑘 = 1 0 ≤ 𝜃𝑘 ≤ 1, ∀𝑘


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖


𝑤1 = 1

𝑤2 = 5

𝑤3 = 4

⋯


Generative Story

ℒ 𝜃 =

𝑖

log 𝑝𝜃(𝑤𝑖)

=

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖



Generative Story

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖


Q: What’s an easy way to maximize this, as written exactly (even without calculus)?


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖



Generative Story

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖


Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

A: Just keep increasing 𝜃𝑘 (we know 𝜃 must be a distribution, but it’s not specified)


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖


ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖s. t.

𝑘=1

6

𝜃𝑘 = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the problem and, right

now, is not needed)

solve using Lagrange multipliers


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖


ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1



0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜕ℱ 𝜃

𝜕𝜃𝑘=

𝑖:𝑤𝑖=𝑘

1

𝜃𝑤𝑖

− 𝜆 𝜕ℱ 𝜃

𝜕𝜆= −

𝑘=1

6

𝜃𝑘 + 1


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖


ℱ 𝜃 =

𝑖


𝑘=1

6

𝜃𝑘 − 1





𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘

1

𝜆 optimal 𝜆 when

𝑘=1

6

𝜃𝑘 = 1


𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖


ℱ 𝜃 =

𝑖


𝑘=1

6

𝜃𝑘 − 1





𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘

1

σ𝑘σ𝑖:𝑤𝑖=𝑘1=𝑁𝑘𝑁 optimal 𝜆 when

𝑘=1

6

𝜃𝑘 = 1

Example: Conditionally Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑤1 = 1

𝑤2 = 5

⋯

𝑧1 = 𝐻

𝑧2 = 𝑇

𝑝 heads = 𝜆

𝑝 tails = 1 − 𝜆

𝑝 heads = 𝛾

𝑝 heads = 𝜓

𝑝 tails = 1 − 𝛾

𝑝 tails = 1 − 𝜓

Example: Conditionally Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑝 heads = 𝜆

𝑝 tails = 1 − 𝜆

𝑝 heads = 𝛾

𝑝 heads = 𝜓

𝑝 tails = 1 − 𝛾

𝑝 tails = 1 − 𝜓

for item 𝑖 = 1 to 𝑁:𝑧𝑖 ~ Bernoulli 𝜆

Generative Story

𝜆 = distribution over penny𝛾 = distribution for dollar coin𝜓 = distribution over dime

if 𝑧𝑖 = 𝐻:𝑤𝑖 ~ Bernoulli 𝛾

else: 𝑤𝑖 ~ Bernoulli 𝜓

OutlineRecap of EM





Classify with Bayes Rule

argmax𝑌 log 𝑝 𝑋 𝑌) + log 𝑝(𝑌)

likelihood prior

argmax𝑌𝑝 𝑌 𝑋)

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)




29Adapted from Jurafsky & Martin (draft)

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier


Naïve Bayes: A Generative StoryGenerative Story

𝜙 = distribution over 𝐾 labelsfor label 𝑘 = 1 to 𝐾:

global parameters

𝜃𝑘 = generate parameters

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labelsy

for label 𝑘 = 1 to 𝐾:




Generative Story

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y



local variables



Generative Story



y


each xij conditionally independent of one

another (given the label)





Generative Story

ℒ 𝜃 =

𝑖

𝑗

log 𝐹𝑦𝑖(𝑥𝑖𝑗; 𝜃𝑦𝑖) +

𝑖

log 𝜙𝑦𝑖 s. t.




y


𝑘

𝜙𝑘 = 1 𝜃𝑘 is valid for 𝐹𝑘



Multinomial Naïve Bayes: A Generative Story


Generative Story

ℒ 𝜃 =

𝑖

𝑗

log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +

𝑖

log𝜙𝑦𝑖 s. t.



for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)


y

for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values

𝑘

𝜙𝑘 = 1

𝑗

𝜃𝑘𝑗 = 1 ∀𝑘

Multinomial Naïve Bayes: A Generative Story


Generative Story

ℒ 𝜃

=

𝑖

𝑗

log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +

𝑖

log𝜙𝑦𝑖 − 𝜇

𝑘

𝜙𝑘 − 1 −

𝑘

𝜆𝑘

𝑗

𝜃𝑘𝑗 − 1

Maximize Log-likelihood via Lagrange Multipliers


for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)


y

for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values

Multinomial Naïve Bayes: Learning

Calculate class priors

For each k:itemsk = all items with class = k

Calculate feature generation termsFor each k:

obsk = single object containing all items labeled as k

For each feature jnkj = # of occurrences of j in obsk

𝑝 𝑘 =|items𝑘|

# items𝑝 𝑗|𝑘 =

𝑛𝑘𝑗σ𝑗′ 𝑛𝑘𝑗′

Brill and Banko (2001)

With enough data, the classifier may not matter


Summary: Naïve Bayes is Not So Naïve, but not without issue

ProVery Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

ConModel the posterior in one go? (e.g., use conditional maxent)

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)


OutlineRecap of EM





Hidden Markov Models

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Class-based

model

Bigram model

of the classes

Model all

class sequences

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖|𝑧𝑖−1

𝑧1,..,𝑧𝑁

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w


=ෑ

𝑖


if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states


=ෑ

𝑖





=ෑ

𝑖


transitionprobabilities/parameters




=ෑ

𝑖


emissionprobabilities/parameters




Transition and emission distributions do not change


=ෑ

𝑖








=ෑ

𝑖




Q: How many different probability values are there with K states and V vocab items?





=ෑ

𝑖




Q: How many different probability values are there with K states and V vocab items?

A: VK emission values and K2 transition values

Hidden Markov Model Representation


=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

Hidden Markov Model Representation


=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

𝑝 𝑤1|𝑧1 𝑝 𝑤2|𝑧2 𝑝 𝑤3|𝑧3 𝑝 𝑤4|𝑧4

𝑝 𝑧2| 𝑧1 𝑝 𝑧3| 𝑧2 𝑝 𝑧4| 𝑧3𝑝 𝑧1| 𝑧0

initial starting distribution

(“__SEQ_SYM__”)



Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V …

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1

…

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z4 = V …

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

z3 = V

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1

…

w2 w3 w4


𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

…


𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

…𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑁| 𝑉 𝑝 𝑁| 𝑉


𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

A Latent Sequence is a Path through the Graph

z1 = N

w1 w2 w3 w4

𝑝 𝑤1|𝑁𝑝 𝑤4|𝑁


z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉

𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉

Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?

A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

OutlineRecap of EM





Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16





ITILA, Ch 16





ITILA, Ch 16





ITILA, Ch 16

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4


9 6

7

3

32 1

4


9 6

7

3

32 1

4+310

+37


9 6

7

3

32 1

4+310

+37


9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11


9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

OutlineRecap of EM





What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A




zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A


𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)







zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= Acomputing v at time i-1will correctly incorporate

(maximize over) paths through time i-2:

we correctly obey the Markov property


𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)




Viterbi algorithm

Viterbi Algorithmv = double[N+2][K*]

b = int[N+2][K*]

v[*][*] = 0

v[0][START] = 1

for(i = 1; i ≤ N+1; ++i) {

for(state = 0; state < K*; ++state) {

pobs = pemission(obsi | state)

for(old = 0; old < K*; ++old) {

pmove = ptransition(state | old)

if(v[i-1][old] * pobs * pmove > v[i][state]) {

v[i][state] = v[i-1][old] * pobs * pmove

b[i][state] = old}

}

}

}

backpointers/book-keeping

OutlineRecap of EM





Forward Probability

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Forward Probability

marginalize across the previous hidden state values

𝛼 𝑖, 𝐵 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

Forward Probability





𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Forward Probability





𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Q: What do we return? (How do we return the likelihood of the sequence?) A: α[N+1][end]

OutlineRecap of EM





Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A





β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)


zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A








α(i, B) β(i, B)


zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A








α(i, B) β(i, B)

α(i, B) * β(i, B) = total probability of paths through state B at step i


zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A








α(i, B) β(i+1, s)

zi+1

= C

zi+1

= B

zi+1

= A


zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

α(i, B) β(i+1, s’)

zi+1

= C

zi+1

= B

zi+1

= A








α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the Bs’ arc (at time i)

With Both Forward and Backward Values

α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the ss’ arc (at time i)

α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑝 𝑧𝑖 = 𝑠 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

Expectation Maximization (EM)0. Assume some value for your parameters




estimated counts





estimated counts

pobs(w | s)

ptrans(s’ | s)





estimated counts

pobs(w | s)

ptrans(s’ | s)

𝑝∗ 𝑧𝑖 = 𝑠 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝∗ 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝑐(𝑠 → 𝑠′)

σ𝑥 𝑐(𝑠 → 𝑥)

if we observed the hidden transitions…

M-Step


𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]

σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]

we don’t the hidden transitions, but we can approximately count

M-Step


𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]

σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]

we don’t the hidden transitions, but we can approximately count

we compute these in the E-step, with our α and β values

EM For HMMs (Baum-Welch

Algorithm)

α = computeForwards()

β = computeBackwards()

L = α[N+1][END]

for(i = N; i ≥ 0; --i) {

for(next = 0; next < K*; ++next) {

cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L

for(state = 0; state < K*; ++state) {

u = pobs(obsi+1 | next) * ptrans (next | state)

ctrans(next| state) +=

α[i][state] * u * β[i+1][next]/L

}

}

}

Bayesian Networks:Directed Acyclic Graphs

𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ

𝑖

𝑝 𝑥𝑖 𝜋(𝑥𝑖))

“parents of”

topological sort

Bayesian Networks:Directed Acyclic Graphs

𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ

𝑖

𝑝 𝑥𝑖 𝜋(𝑥𝑖))

exact inference in general DAGs is NP-hard

inference in trees can be exact

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally

independent given Z if all (undirected) paths from

(any variable in) X to (any variable in) Y are

d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Y

X Y

X Z Y


Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation





X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y


not observing Z blocks the path from X to Y


Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation





X Z Y

X

Z

Y

X Z Y



not observing Z blocks the path from X to Y

𝑝 𝑥, 𝑦, 𝑧 = 𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦)

𝑝 𝑥, 𝑦 =

𝑧

𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦) = 𝑝 𝑥 𝑝 𝑦

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝 𝑥𝑖 𝑥𝑗≠𝑖 =𝑝(𝑥1, … , 𝑥𝑁)

∫ 𝑝 𝑥1, … , 𝑥𝑁 𝑑𝑥𝑖

=ς𝑘 𝑝(𝑥𝑘|𝜋 𝑥𝑘 )

∫ ς𝑘 𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖factor out terms not dependent on xi

factorization of graph

=ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘

𝑝(𝑥𝑘|𝜋 𝑥𝑘 )

∫ ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖

the set of nodes needed to form the complete conditional for a variable xi

Documents

Directed Graphical Models