Hidden Markov Model - Georgia Institute of Technologymsse.gatech.edu/GIP/lect04_HMM_yanwang.pdf · Multiscale Systems Engineering Research Group Hidden Markov Model (HMM) HMM is an

Hidden Markov Model

Prof. Yan WangWoodruff School of Mechanical Engineering

Georgia Institute of TechnologyAtlanta, GA 30332, [email protected]

mailto:[email protected]

Multiscale Systems Engineering Research Group

Learning Objectives

To familiarize the hidden Markov model as a generalization of Markov chain

To understand the three basic problems (evaluation, decoding, and learning) in HMM model construction and applications


Hidden Markov Model (HMM)

HMM is an extension of regular Markov chainState variables q’s are not directly observable

All statistical inference about the Markov chain itself has to be done in terms of observable o’sobservable

hidden

ot−1

qt−1

o1

q1 qt qt+1

ot ot+1 oT

qT


HMM

o’s are conditionally independent given {qt}.

However, {ot} is not an independent sequence, nor a Markov chain itself.

observable

hidden

ot−1

qt−1

o1

q1 qt qt+1

ot ot+1 oT

qT


HMM Components

State sequence: Q={q1,q2,…,qT} with Npossible valuesObservation sequence: O={o1,o2,…,oT} with M possible symbols {v1,v2,…vM}State transition matrix: A=(aij)N×N where aij=P(qt+1=j|qt=i)Observation matrix: B=(bik)N×M where bik=bi(vk)=P(ot=vk|qt=i)Initial state distribution: Δ=(δi)1×N where δi=P(q1=i)

Model’s parameters λ={A, B, Δ}


Three Basic Problems in HMMEvaluation: Given a model with parameters λ and a sequence of observations O={o1,o2,…,oT}, what is the probability that the model generates those observations P(O|λ)?Decoding: Given a model with parameters λand a sequence of observations O={o1,o2,…,oT}, what is the most likely state sequence Q={q1,q2,…,qT} in the model that produces the observations?Learning: Given a set of observations O={o1,o2,…,oT}, how can we find a model with the parameters λ with the maximum likelihood P(O|λ)?


Evaluation Problem

Given an O={o1,o2,…,oT}, P(O|λ)=?

Naïve algorithmbecause of lemma 3.1

where Q(d) is one of all possible combinations of state sequences

assume conditional independence between observations

However, the number of possible combinations of state sequences is huge!

( ) ( )

( ) ( ) ( )( | ) ( , | ) ( | , ) ( | )d d

d d d

Q QP O P O Q P O Q P Qλ λ λ λ= =∑ ∑

( )

1 1( | , ) ( | , ) ( )

t

T Tdt t q tt t

P O Q P o q b oλ λ= =

= =∏ ∏1 1 1

1 1( )11 1

( | ) ( | , )t t

T Tdq t t q q qt t

P Q P q q aλ δ λ δ+

− −

+= == =∏ ∏


Evaluation Problem –Forward Algorithm

Define a forward variable

which can be recursively calculated by 1 2

( ) ( , , , , | )t t ti P o o o q iα λ= =…

1 1 1 1 1 1 1( ) ( , | ) ( | ) ( | , ) ( )

i ii P o q i P q i P o q i b oα λ λ λ δ= = = = = =

1 1 1 1

1 1 1 1

1 1 1 1 1

1 1 1 1

1 1 1 1

1 1

( ) ( , , , | )

( | ) ( , , | , )

( | ) ( | , ) ( , , | , )

( | , ) ( , , , | )

( | , ) ( , , , , | )

( |t

t t t

t t t

t t t t t

t t t t

t t t t tq

t t

i P o o q i

P q i P o o q i

P q i P o q i P o o q i

P o q i P o o q i

P o q i P o o q q i

P o q

α λλ λλ λ λ

λ λ

λ λ

+ + +

+ + +

+ + + +

+ + +

+ + +

+ +

= == = == = = == = =

= = =

= =

∑

……

……

…

1 1

1 ,

, ) ( | , ) ( , , , | )

( ) ( )t

tt

t t t tq

i t q i t tq

i P q i q P o o q

b o a q

λ λ λ

α+

+

=

=

∑∑

…

1 1( | ) ( , | ) ( )

N N

T Ti iP O P O q i iλ λ α

= == = =∑ ∑


Evaluation Problem –Backward Algorithm

Define a backward variable

which can be recursively calculated by 1 2

( ) ( , , , | , )t t t T ti P o o o q iβ λ+ += =…

( ) 1Tiβ =

1

1

1 2

1 2

1 2 1

2 11

( ) ( , , | , )

( | , ) ( , , | , )

( | , ) ( , , , | ) / ( | )

( | , ) ( , , , , | ) / ( | )

( , , | , , ) (( | , )

t

t t T t

t t t T t

t t t T t t

t t t T t t tq

t T t t tt t

i P o o q i

P o q i P o o q i

P o q i P o o q i P q i

P o q i P o o q i q P q i

P o o q i q P qP o q i

β λλ λλ λ λ

λ λ λλ

λ

+

+

+ +

+ +

+ + +

+ ++

= == = == = = =

= = = =

== =

∑

………

……

1

1

11

1

1 2 1 1

1 , 1 1

, | )

( | )

( | , ) ( , , | , ) ( | , )

( ) ( )

t

t

tt

tq

t

t t t T t t tq

i t i q t tq

i q

P q i

P o q i P o o q P q q i

b o a q

λλ

λ λ λ

β

+

+

++

+

+ + + +

+ + +

=

=

= = =

=

∑

∑∑

…


Decoding Problem

Given an O={o1,o2,…,oT}, find a Q*={q1*,q2*,…,qT*} with the maximum of P(O|Q)

Viterbi Algorithm is a dynamic programming method to solve the decoding problem


Dynamic Programming

Breaking down complex problems into subproblems in a recursive manner.

e.g. search the shortest path in the Traveling Salesman Problem


Define an auxiliary probability

which is the highest probability that a single path leads to qt=i at time t

Recursively

Decoding Problem –Viterbi Algorithm

1 11 1 1, ,

( ) : max ( , , , , , , | )t

t t t tq qi P q q q i o oρ λ

−−= =

…… …

{ }{ }{ }

1 1 1 1

1 1 1

1

( ) max ( ) ( | , ) ( | )

max ( ) ( | , ) ( | )

max ( ) ( )

t t t t t ti

t t t t ti

t ij j ti

j i P q j q i P o q j

i P q j q i P o q j

i a b o

ρ ρ λ

ρ λ

ρ

+ + + +

+ + +

+

= = = =

= = = =

=


Define an auxiliary variable

to store the optimal state at time t to reach state j at time t+1.

The algorithm is1. initialize

2. recursion

Decoding Problem –Viterbi Algorithm – cont’d

( )1 1( ) ( ) , 1, ,

j jj b o j j Nρ δ= ∀ = …

{ }1 1( ) max ( ) ( )

t t ij j tij i a b oρ ρ+ +=

{ } { }1 1( ) arg max ( ) ( ) arg max ( )

t t ij j t t iji i

j i a b o i aψ ρ ρ+ += =

1( ) 0jψ =

{ }1( ) arg max ( )

t t iji

j i aψ ρ+ =


3. terminate•The optimal probability

•The optimal final state

4. backtrack state sequence

Decoding Problem –Viterbi Algorithm – cont’d

{ }* max ( )Tj

P jρ=

{ }* arg max ( )T T

jq jρ=

* *1 1( )

t t tq qψ + +=


Learning Problem

Given an O={o1,o2,…,oT}, find a λ*={A*, B*, Δ*} with the maximum likelihood P(O|λ)

Global optimum needs to search all possible state and observation sequences

Instead, Baum-Welch Algorithm is usually used to search heuristically

{ } { }* arg max ( | ) arg max log ( | )P O P Oλ λ

λ λ λ= =

{ }( ) ( )

* ( ) ( )arg max log ( | , )d d

d d

O QP O Q

λλ λ= ∑ ∑


Learning Problem –Baum-Welch Algorithm

Algorithm 1.Guess some parameters λ2. Determine some “probable paths” {Q(1),…,Q(d)}3. Estimate the number of transitions , from state i to state j, given the current estimate of λ.4. Estimate the number of the observation vkemitted from state i as

5. Estimate the initial distribution6. Re-estimate λ’ from Aij’s and Bi(vk)’s

7. If λ’ and λ is close enough, stop; otherwise, assign λ = λ’ and go back to step 2.

ija

ˆ( )i kb v

ˆiδ


Learning Problem –Baum-Welch Algorithm – cont’d

Define an auxiliary likelihood

which is the probability that a transition qt=i and qt+1=joccurs at time t given the complete observations {o1,o2,…,oT} and model parameters λ

1 1( , ) ( , | , , , )t t t Ti j P q i q j o o+= = =ξ λ…


Learning Problem –Baum-Welch Algorithm – cont’d

1 1

1 1

1

1 1 1

1

1 1 1

2 1 1

( , ) ( , | , , , )

( , , , , | )

( , , | )( , , | , , ) ( , | )

( , , | )( , , | | ) ( | , )

( , , | | ) ( | , ) ( |

t t t T

t t T

T

T t t t t

T

t t t t

t T t t t t

i j P q i q j o o

P q i q j o o

P o oP o o q i q j P q i q j

P o oP o o q i P o q j

P o o q j P q j q i P q i

+

+

+ +

+ +

+ + +

= = == =

=

= = = ==

= == = = =

=

ξ λλ

λλ λ

λλ λ

λ λ

……

……

………

1

1 1

1 1 2 1

1

1 1

1

)

( , , | )( , , , | ) ( | , )

( | , ) ( , , | | )

( , , | )( ) ( ) ( )

( )

T

t t t t

t t t T t

T

t ij i t t

N

Ti

P o oP o o q i P q j q i

P o q j P o o q j

P o oi a b o j

i

+

+ + + +

+ +

=

⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦

⎡ ⎤= = =⎢ ⎥

= =⎢ ⎥⎣ ⎦=

=∑

λ

λλ λ

λ λ

λα β

α

……

…

…


Learning Problem –Baum-Welch Algorithm – cont’dEstimate parameters

1 11 1

11 1 1

( , | , , , ) ( , )ˆ

( | , , , ) ( , )

T T

t t T tt tij T T N

t T tt t k

P q i q j o o i ja

P q i o o i k

+= =

= = =

= == =

=

∑ ∑∑ ∑ ∑

λ ξ

λ ξ

…

…

1 11

11

1

1 1

( , , | , , , )ˆ( )( | , , , )

1 ( , )

( , )t k

T

t k t t Tti k T

t TtT

o v tt

T N

tt k

P o v q i q j o ob v

P q i o o

i j

i k

+=

=

==

= =

= = ==

=

=

∑∑

∑∑ ∑

λ

λ

ξ

ξ

…

…

1 2 1 11 1ˆ ( , | , , , ) ( , )

N N

i Tk kP q i q k o o i k

= == = = =∑ ∑δ λ ξ…


Learning Problem –Baum-Welch Algorithm – cont’dHow to measure two models, λ’ and λ, are close enough?

( )

( )( | )d

d

Op P O λ= ∑

11

2

( )( ) log

( )

p xCross Entropy p x

p x=


HMM Applications

HMM has been applied in many fields that are based on the analysis of discrete-valued time series, such as

speech recognition (Rabiner, 1989)

Image recognition

genetic profile and classification (Eddy, 1998)


Kalman Filter

Can be regarded as a special case of HMM

Also known as the Gaussian linear state-space model

State series are linearly dependent on history, subject to process white noises.

Observations are also linearly dependent on states, subject to measurement noises.

t t t= +x Cx w

t t t= +y Dx v


Summary

Hidden Markov model is a generalization Markov chain


Further ReadingsRabiner L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2): 267-296 Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics Review, 14(9): 755-776 Dempster A.P., Laird N.M., and Rubin D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 39(1): 1-38Wu C.F.J. (1983) On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1): 95-103HMM Software Packages http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html

http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html

Documents

Hidden Markov Model - Georgia Institute of Technologymsse.gatech.edu/GIP/lect04_HMM_yanwang.pdf · Multiscale Systems Engineering Research Group Hidden Markov Model (HMM) HMM is an