Eric Xing © Eric Xing @ CMU, 2006-20101 Machine Learning Mixture Model, HMM, and Expectation Maximization Eric Xing Lecture 9, August 14, 2010 Reading:

Eric Xing

© Eric Xing @ CMU, 2006-2010 1

Machine LearningMachine Learning

Mixture Model, HMM, and Mixture Model, HMM, and

Expectation MaximizationExpectation Maximization

Eric XingEric Xing

Lecture 9, August 14, 2010

Reading:

Eric Xing

© Eric Xing @ CMU, 2006-2010 2

Data log-likelihood

MLE

What if we do not know zn?

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn

∑∑∑∑

∑ ∏∑ ∏

∏

)-(-log

),;(loglog

),,|()|(log),(log);(

2

21

2

θl

Gaussian Discriminative Analysis

zi

xi

N

),;(maxargˆ , DMLEk θl

);(maxargˆ , DMLEk θl

);(maxargˆ , DMLEk θl ∑

∑,ˆ ⇒

n

kn

n nkn

MLEkz

xz

'

2'2

12/12'

2

21

2/12

)(exp)2(

1

)(exp)2(

1

),,|1(

2

2

kknk

knk

nknyp

x

x

x

Eric Xing

© Eric Xing @ CMU, 2006-2010 3

Clustering

Eric Xing

© Eric Xing @ CMU, 2006-2010 4

Unobserved Variables

A variable can be unobserved (latent) because: it is an imaginary quantity meant to provide some simplified and abstractive view of

the data generation process e.g., speech recognition models, mixture models …

it is a real-world object and/or phenomena, but difficult or impossible to measure e.g., the temperature of a star, causes of a disease, evolutionary ancestors …

it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors; or was measure with a noisy channel, etc. e.g., traffic radio, aircraft signal on a radar screen,

Discrete latent variables can be used to partition/cluster data into sub-groups (mixture models, forthcoming).

Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc., later lectures).

Eric Xing

© Eric Xing @ CMU, 2006-2010 5

Mixture Models

A density model p(x) may be multi-modal. We may be able to model it as a mixture of uni-modal

distributions (e.g., Gaussians). Each mode may correspond to a different sub-population

(e.g., male and female).

Eric Xing

© Eric Xing @ CMU, 2006-2010 6

Gaussian Mixture Models (GMMs)

Consider a mixture of K Gaussian components: Z is a latent class indicator vector:

X is a conditional Gaussian variable with a class-specific mean/covariance

The likelihood of a sample:

∏):(multi)(k

zknn

knzzp

)-()-(-exp)(

),,|( // knkT

kn

km

knn xxzxp

1

21

2122

11

k kkkz k

zkkn

zk

k

kkn

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,|,()|(),(

11mixture proportion

mixture component

Z

X

Eric Xing

© Eric Xing @ CMU, 2006-2010 7

Gaussian Mixture Models (GMMs)

Consider a mixture of K Gaussian components:

This model can be used for unsupervised clustering. This model (fit by AutoClass) has been used to discover new kinds of stars in

astronomical data, etc.

k kkkn xNxp ),|,(),(

mixture proportion mixture component

Eric Xing

© Eric Xing @ CMU, 2006-2010 8

Learning mixture models

Given data

Likelihood:

n

k kkkn

n xNxpDL ),|,(),,();,,(

),,,(maxarg**,*, DL

Eric Xing

© Eric Xing @ CMU, 2006-2010 9

Why is Learning Harder?

In fully observed iid settings, the log likelihood decomposes into a sum of local terms.

With latent variables, all the parameters become coupled together via marginalization

),|(log)|(log)|,(log);( xzc zxpzpzxpD l

z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( l

Eric Xing

© Eric Xing @ CMU, 2006-2010 10

Recall MLE for completely observed data

Data log-likelihood

MLE

What if we do not know zn?

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn

∑∑∑∑

∑ ∏∑ ∏

∏

)-(-log

),;(loglog

),,|()|(log),(log);(

2

21

2

θl

Toward the EM algorithm

zi

xi

N

),;(maxargˆ , DMLEk θl

);(maxargˆ , DMLEk θl

);(maxargˆ , DMLEk θl ∑

∑,ˆ ⇒

n

kn

n nkn

MLEkz

xz

Eric Xing

© Eric Xing @ CMU, 2006-2010 11

Recall K-means

Start: "Guess" the centroid k and coveriance k of each of the K clusters

Loop For each point n=1 to N,

compute its cluster label:

For each cluster k=1:K

)()(minarg )()(1)()( tkn

tk

Ttkn

k

tn xxz

n

tn

n ntnt

k kz

xkz

),(

),()(

)()(

1 ...)( 1tk

Eric Xing

© Eric Xing @ CMU, 2006-2010 12

Expectation-Maximization

Start: "Guess" the centroid k and coveriance k of each of the K clusters

Loop

Eric Xing

© Eric Xing @ CMU, 2006-2010 13

─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., and ).

Here we are essentially doing inference

∑ ),|,(

),|,(),,|(

)()()(

)()()()()()(

)(

i

ti

tin

ti

tk

tkn

tkttk

nq

kn

tkn xN

xNxzpz

t

1

E-step

Zn

Xn

N

Eric Xing

© Eric Xing @ CMU, 2006-2010 14

─ Maximization step: compute the parameters under current results of the expected value of the hidden variables

This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")

M-step

Zn

Xn

N

⇒

s.t. ,∀ ,)( ⇒ ,)(maxarg

)(*

k∂

∂*

∑

∑

)(

Nn

NNz

kll

kn

tknn q

kn

k

kcck

t

k

10θθ

∑∑

)(

)(

)1(* ⇒ ,)(maxargn

tkn

n ntk

ntkk

xl

θ

∑∑

)(

)()()(

)(*))((

⇒ ,)(maxargn

tkn

n

Ttkn

tkn

tknt

kk

xxl

11

1

θ

Eric Xing

© Eric Xing @ CMU, 2006-2010 15

How is EM derived?

A mixture of K Gaussians: Z is a latent class indicator vector

X is a conditional Gaussian variable with a class-specific mean/covariance

The likelihood of a sample:

The “complete” likelihood

Zn

Xn

N

∏):(multi)(k

zknn

knzzp

)-()-(-exp)(

),,|( // knkT

kn

km

knn xxzxp

1

21

2122

11

k kkkz k

zkkn

zk

k

kn

knn

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,1|,()|1(),(

),|,(),,1|,()|1(),1,( kkkkn

kn

knn xNzxpzpzxp

But this is itself a random variable! Not good as objective function

k

zkkknn

knxNzxp ),|,(),,(

Eric Xing

© Eric Xing @ CMU, 2006-2010 16

How is EM derived?

The complete log likelihood:

The expected complete log likelihood

We maximize iteratively using the above iterative procedure:

Zn

Xn

N

∑∑∑∑

∑∑

log)()(2

1log

),,|(log)|(log),;()|()|(

n kkknk

Tkn

kn

n kk

kn

nxzpnn

nxzpnc

Cxxzz

zxpzpzx

1

θl

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn

∑∑∑∑

∑ ∏∑ ∏

∏

)-(-log

),;(loglog

),,|()|(log),(log);(

2

21

2

θl

)(θcl

Eric Xing

© Eric Xing @ CMU, 2006-2010 17

Compare: K-means

The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.

In the K-means “E-step” we do hard assignment:

In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:

)()(maxarg )()()()( tkn

tk

Ttkn

k

tn xxz 1

n

tn

n nt

ntk kz

xkz

),(

),()(

)()(

1

∑∑

)(

)(

)1(

n

tkn

n ntk

ntk

x

)(

)(tq

kn

tkn z

Eric Xing

© Eric Xing @ CMU, 2006-2010 18

Theory underlying EM

What are we doing?

Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.

But we do not observe z, so computing

is difficult!

What shall we do?

z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( l

Eric Xing

© Eric Xing @ CMU, 2006-2010 19

Complete & Incomplete Log Likelihoods

Complete log likelihoodLet X denote the observable variable(s), and Z denote the latent variable(s).

If Z could be observed, then

Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).

Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.

But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.

Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:

This objective won't decouple

)|,(log),;(def

zxpzxc l

z

c zxpxpx )|,(log)|(log);( l

Eric Xing

© Eric Xing @ CMU, 2006-2010 20

Expected Complete Log Likelihood

z

qc zxpxzqzx )|,(log),|(),;(def

l

z

z

z

xzqzxp

xzq

xzqzxp

xzq

zxp

xpx

)|(

)|,(log)|(

)|(

)|,()|(log

)|,(log

)|(log);(

l

qqc Hzxx ),;();( ll

For any distribution q(z), define expected complete log likelihood:

A deterministic function of Linear in lc() --- inherit its factorizabiility

Does maximizing this surrogate yield a maximizer of the likelihood?

Jensen’s inequality

Eric Xing

© Eric Xing @ CMU, 2006-2010 21

Lower Bounds and Free Energy

For fixed data x, define a functional called the free energy:

The EM algorithm is coordinate-ascent on F : E-step:

M-step:

);()|(

)|,(log)|(),(

def

xxzq

zxpxzqqF

z

l

),(maxarg t

q

t qFq 1

),(maxarg ttt qF

11

Eric Xing

© Eric Xing @ CMU, 2006-2010 22

E-step: maximization of expected lc w.r.t. q

Claim:

This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).

Proof (easy): this setting attains the bound l(;x)F(q, )

Can also show this result using variational calculus or the fact that

),|(),(maxarg tt

q

t xzpqFq 1

);()|(log

)|(log),(

),(

)|,(log),()),,((

xxp

xpxzp

xzp

zxpxzpxzpF

tt

z

tt

zt

tttt

l

),|(||KL),();( xzpqqFx l

Eric Xing

© Eric Xing @ CMU, 2006-2010 23

E-step plug in posterior expectation of latent variables

Without loss of generality: assume that p(x,z|) is a generalized exponential family distribution:

Special cases: if p(X|Z) are GLIMs, then

The expected complete log likelihood under is

)(),(

)()|,(log),|(),;(

),|(

Azxf

Azxpxzqzx

ixzqi

ti

z

tt

q

tc

t

t

1l

iii zxfzxh

Zzxp ),(exp),(

)(),(

1

)()(),( xzzxf iTii

),|( tt xzpq 1

)()()(),|(

GLIM~

Axzi

ixzqiti

p

t

Eric Xing

© Eric Xing @ CMU, 2006-2010 24

M-step: maximization of expected lc w.r.t.

Note that the free energy breaks into two terms:

The first term is the expected complete log likelihood (energy) and the second term, which does not depend on, is the entropy.

Thus, in the M-step, maximizing with respect to for fixed q we only need to consider the first term:

Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,).

qqc

zz

z

Hzx

xzqxzqzxpxzq

xzqzxp

xzqqF

),;(

)|(log)|()|,(log)|(

)|(

)|,(log)|(),(

l

zqc

t zxpxzqzx t )|,(log)|(maxarg),;(maxarg

11 l

Eric Xing

© Eric Xing @ CMU, 2006-2010 25

Summary: EM Algorithm

A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:

1. Estimate some “missing” or “unobserved” data from observed data and current parameters.

2. Using this “complete” data, find the maximum likelihood parameter estimates.

Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:

E-step: M-step:

In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.

),(maxarg t

q

t qFq 1

),(maxarg ttt qF

11

Eric Xing

© Eric Xing @ CMU, 2006-2010 26

From static to dynamic mixture models

Dynamic mixtureDynamic mixture

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Static mixtureStatic mixture

AX1

Y1

N

The sequence:The sequence:

The underlying The underlying source:source:

Phonemes,Phonemes,

Speech signal, Speech signal,

sequence of rolls, sequence of rolls,

dice,dice,

Eric Xing

© Eric Xing @ CMU, 2006-2010 27

Chromosomes of tumor cell:

Predicting Tumor Cell States

Eric Xing

© Eric Xing @ CMU, 2006-2010 28

Copy number profile for chromosome1 from 600 MPE cell line

Copy number profile for chromosome8 from COLO320 cell line

60-70 fold amplification of CMYC region

Copy number profile for chromosome 8in MDA-MB-231 cell line

deletion

DNA Copy number aberration types in breast cancer

Eric Xing

© Eric Xing @ CMU, 2006-2010 29

A real CGH run

Eric Xing

© Eric Xing @ CMU, 2006-2010 30

Hidden Markov Model Observation spaceObservation space

Alphabetic set:

Euclidean space:

Index set of hidden statesIndex set of hidden states

Transition probabilitiesTransition probabilities between any two statesbetween any two states

or

Start probabilitiesStart probabilities

Emission probabilitiesEmission probabilities associated with each stateassociated with each state

or in general:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Kccc ,,, 21CdR

M,,, 21I

,)|( , jii

tj

t ayyp 11 1

.,,,,lMultinomia~)|( ,,, I iaaayyp Miiii

tt 211 1

.,,,lMultinomia~)( Myp 211

.,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiii

tt 211

.,|f~)|( I iyxp ii

tt 1

Graphical model

K

1

…

2

State automata

Eric Xing

© Eric Xing @ CMU, 2006-2010 31

The Dishonest Casino

A casino has two dice: Fair die

P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die

P(1) = P(2) = P(3) = P(5) = 1/10P(6) = 1/2

Casino player switches back-&-forth between fair and loaded die once every 20 turns

Game:1. You bet $12. You roll (always with a fair die)3. Casino player rolls (maybe with fair die,

maybe with loaded die)4. Highest number wins $2

Eric Xing

© Eric Xing @ CMU, 2006-2010 32

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

The Dishonest Casino Model

Eric Xing

© Eric Xing @ CMU, 2006-2010 33

Puzzles Regarding the Dishonest Casino

GIVEN: A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION How likely is this sequence, given our model of how the casino works?

This is the EVALUATION problem in HMMs

What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

Eric Xing

© Eric Xing @ CMU, 2006-2010 34

Joint Probability

1245526462146146136136661664661636616366163616515615115146123562344

Eric Xing

© Eric Xing @ CMU, 2006-2010 35

Probability of a Parse

Given a sequence x = x1……xT

and a parse y = y1, ……, yT, To find how likely is the parse:

(given our HMM and the sequence)

p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability)

= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)

= p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)

Marginal probability:

Posterior probability:

yyxx

1 2 112 1

y y y

T

t

T

tttyyy

N ttyxpapp )|(),()( ,

)(/),()|( xyxxy ppp

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Eric Xing

© Eric Xing @ CMU, 2006-2010 36

Example: the Dishonest Casino

Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

Then, what is the likelihood of y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?

(say initial probs a0Fair = ½, aoLoaded = ½)

½ P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =

½ (1/6)10 (0.95)9 = .00000000521158647211 = 5.21 10-9

Eric Xing

© Eric Xing @ CMU, 2006-2010 37


So, the likelihood the die is fair in all this run

is just 5.21 10-9

OK, but what is the likelihood of = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded,

Loaded, Loaded, Loaded?

½ P(1 | Loaded) P(Loaded | Loaded) … P(4 | Loaded) =

½ (1/10)8 (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 10-9

Therefore, it is after all 6.59 times more likely that the die is fair all the way, than that it is loaded all the way

Eric Xing

© Eric Xing @ CMU, 2006-2010 38


Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6

Now, what is the likelihood = F, F, …, F? ½ (1/6)10 (0.95)9 = 0.5 10-9, same as before

What is the likelihood y = L, L, …, L?

½ (1/10)4 (1/2)6 (0.95)9 = .00000049238235134735 = 5 10-7

So, it is 100 times more likely the die is loaded

Eric Xing

© Eric Xing @ CMU, 2006-2010 39

Three Main Questions on HMMs

1.1. EvaluationEvaluation

GIVEN an HMM M, and a sequence x,FIND Prob (x | M)ALGO. ForwardForward

2.2. DecodingDecoding

GIVEN an HMM M, and a sequence x ,FIND the sequence y of states that maximizes, e.g., P(y | x , M),

or the most probable subsequence of statesALGO. Viterbi, Forward-backward Viterbi, Forward-backward

3.3. LearningLearning

GIVEN an HMM M, with unspecified transition/emission probs.,and a sequence x,

FIND parameters = (i, aij, ik) that maximize P(x | )ALGO. Baum-Welch (EM)Baum-Welch (EM)

Eric Xing

© Eric Xing @ CMU, 2006-2010 40

Applications of HMMs

Some early applications of HMMs finance, but we never saw them speech recognition modelling ion channels

In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

Some current applications of HMMs to biology mapping chromosomes aligning biological sequences predicting sequence structure inferring evolutionary relationships finding genes in DNA sequence

Eric Xing

© Eric Xing @ CMU, 2006-2010 41

The Forward Algorithm

We want to calculate P(x), the likelihood of x, given the HMM Sum over all possible ways of generating x:

To avoid summing over an exponential number of paths y, define

(the forward probability)

The recursion:

),,...,()(def

11 1 ktt

kt

kt yxxPy

i

kiit

ktt

kt ayxp ,)|( 11

k

kTP )(x

yyxx

1 2 112 1

y y y

T

t

T

tttyyy

N ttyxpapp )|(),()( ,

Eric Xing

© Eric Xing @ CMU, 2006-2010 42

The Forward Algorithm – derivation

Compute the forward probability:

),,,...,( 111 k

tttkt yxxxP

),,...,,|(),...,,|(),,...,( 111111111 111

ttk

ttttk

ty tt yxxyxPxxyyPyxxPt

)|()|(),,...,( 11 11111

kttt

kty tt yxPyyPyxxP

t

)|(),,...,()|( 1111 1111 it

kti

itt

ktt yyPyxxPyxP

kii

it

ktt ayxP ,)|( 11

AA xtx1

yty1 ...

Axt-1

yt-1

...

...

...

),|()|()(),,( :ruleChain BACPABPAPCBAP

Eric Xing

© Eric Xing @ CMU, 2006-2010 43

The Forward Algorithm

We can compute for all k, t, using dynamic programming!

Initialization:

Iteration:

Termination:

kt

kkk yxP )|( 1111

kk

kk

kk

yxP

yPyxP

yxP

)|(

)()|(

),(

1

11

1

11

111

111

kii

it

ktt

kt ayxP ,)|( 11

k

kTP )(x

Eric Xing

© Eric Xing @ CMU, 2006-2010 44

The Backward Algorithm

We want to compute ,

the posterior probability distribution on the t th position, given x

We start by computing

The recursion:

)|( x1ktyP

Forward, tk Backward,

),...,,,,...,(),( Ttk

ttk

t xxyxxPyP 11 11 x

)|...()...(

),,...,|,...,(),,...,(

, 11

11

11

111

ktTt

ktt

kttTt

ktt

yxxPyxxP

yxxxxPyxxP

)|,...,( 11 k

tTtkt yxxP

i

it

ittik

kt yxpa 111, )1|(

A Axt+1 xT

yt+1 yT...

Axt

yt

...

...

...

Eric Xing

© Eric Xing @ CMU, 2006-2010 45

Example:

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

kii

it

ktt

kt ayxP ,)|( 11

it

itti ik

kt yxPa 111 1 )|(,

Eric Xing

© Eric Xing @ CMU, 2006-2010 46

Alpha (actual) 0.0833 0.0500 0.0136 0.0052 0.0022 0.0006 0.0004 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Beta (actual) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0007 0.0006 0.0045 0.0055 0.0264 0.0112 0.1633 0.1033 1.0000 1.0000

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

kii

it

ktt

kt ayxP ,)|( 11

it

itti ik

kt yxPa 111 1 )|(,

Eric Xing

© Eric Xing @ CMU, 2006-2010 47

Alpha (logs) -2.4849 -2.9957 -4.2969 -5.2655 -6.1201 -7.4896 -7.9499 -9.6553 -9.7834 -10.1454 -11.5905 -12.4264 -13.4110 -14.6657 -15.2391 -15.2407 -17.0310 -17.5432 -18.8430 -19.8129

Beta (logs) -16.2439 -17.2014 -14.4185 -14.9922 -12.6028 -12.7337 -10.8042 -10.4389 -9.0373 -9.7289 -7.2181 -7.4833 -5.4135 -5.1977 -3.6352 -4.4938 -1.8120 -2.2698

0 0

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

kii

it

ktt

kt ayxP ,)|( 11

it

itti ik

kt yxPa 111 1 )|(,

Eric Xing

© Eric Xing @ CMU, 2006-2010 48

What is the probability of a hidden state prediction?

Eric Xing

© Eric Xing @ CMU, 2006-2010 49

Posterior decoding

We can now calculate

Then, we can ask What is the most likely state at position t of sequence x:

Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?

Posterior Decoding:

This is different from MPA of a whole sequence of hidden states

This can be understood as bit error ratevs. word error rate

)()(

),()|(

xx

xx

PPyP

yPkt

kt

ktk

t

11

)|(maxarg* x1 ktkt yPk

: *

Tty tkt 11

Example:MPA of X ?MPA of (X, Y) ?

x y P(x,y)

0 0 0.35

0 1 0.05

1 0 0.3

1 1 0.3

Eric Xing

© Eric Xing @ CMU, 2006-2010 50

Viterbi decoding

GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:

y* = argmaxy P(y|x) = argmax P(y,x) Let

= Probability of most likely sequence of states ending at state yt = k

The recursion:

Underflows are a significant problem

These numbers become extremely small – underflow Solution: Take the logs of all values:

),,...,,,...,(max ,--},...{ -1111111

kttttyy

kt yxyyxxPV

t

itkii

ktt

kt VayxpV 11 ,max)|(

x1 x2 x3 ……………………...……..xN

State 1

2

K

x1 x2 x3 ……………………...……..xN

State 1

2

K

x1 x2 x3 ……………………...……..xN

State 1

2

K

x1 x2 x3 ……………………...……..xN

State 1

2

K

Vi(t)k

tV

tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( 11121111

itkii

ktt

kt VayxpV 11 ,logmax)|(log

Eric Xing

© Eric Xing @ CMU, 2006-2010 51

Computational Complexity and implementation details

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N); Space: O(KN).

Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant

i

kiit

ktt

kt ayxp ,1)1|(

it

itt

iik

kt yxpa 111, )1|(

itkii

ktt

kt VayxpV 1,max)1|(

Eric Xing

© Eric Xing @ CMU, 2006-2010 52

(Homework!)

Learning HMM

Given x = x1…xN for which the true state path y = y1…yN is known,

Define:

Aij = # times state transition ij occurs in y

Bik = # times state i in y emits k in x We can show that the maximum likelihood parameters are:

What if y is continuous? We can treat as NT observations of, e.g., a Gaussian, and apply learning rules for Gaussian …

' ',

,,

)(#

)(#

j ij

ij

n

T

t

itn

jtnn

T

t

itnML

ij A

A

y

yy

iji

a2 1

2 1

' ',

,,

)(#

)(#

k ik

ik

n

T

t

itn

ktnn

T

t

itnML

ik BB

y

xy

iki

b1

1

NnTtyx tntn :,::, ,, 11

(Homework!)

n

T

ttntn

T

ttntnn xxpyypypp

1,,

21,,1, )|()|()(log),(log),;( yxyxθl

Eric Xing

© Eric Xing @ CMU, 2006-2010 53

Unsupervised ML estimation

Given x = x1…xN for which the true state path y = y1…yN is unknown,

EXPECTATION MAXIMIZATION

0. Starting with our best guess of a model M, parameters :

1. Estimate Aij , Bik in the training data How? , , How? (homework)

2. Update according to Aij , Bik

Now a "supervised learning" problem

3. Repeat 1 & 2, until convergence

This is called the Baum-Welch Algorithm

We can get to a provably more (or equally) likely parameter set each iteration

ktntn

itnik xyB ,, ,

tn

jtn

itnij yyA

, ,, 1

Eric Xing

© Eric Xing @ CMU, 2006-2010 54

The Baum Welch algorithm

The complete log likelihood

The expected complete log likelihood

EM The E step

The M step ("symbolically" identical to MLE)

n

T

ttntn

T

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl

n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ l

)|( ,,, nitn

itn

itn ypy x1

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111

n

T

t

itn

n

T

t

jitnML

ija 1

1

2

,

,,

n

T

t

itn

ktnn

T

t

itnML

ik

xb 1

1

1

,

,,

Nn

inML

i 1,

Eric Xing

© Eric Xing @ CMU, 2006-2010 55

Summary

Modeling hidden transitional trajectories (in discrete state space, such as cluster label, DNA copy number, dice-choice, etc.) underlying observed sequence data (discrete, such as dice outcomes; or continuous, such as CGH signals)

Useful for parsing, segmenting sequential data Important HMM computations:

The joint likelihood of a parse and data can be written as a product to local terms (i.e., initial prob, transition prob, emission prob.)

Computing marginal likelihood of the observed sequence: forward algorithm Predicting a single hidden state: forward-backward Predicting an entire sequence of hidden states: viterbi Learning HMM parameters: an EM algorithm known as Baum-Welch

Documents

Eric Xing © Eric Xing @ CMU, 2006-20101 Machine Learning Mixture Model, HMM, and Expectation Maximization Eric Xing Lecture 9, August 14, 2010 Reading: