Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Stats 318: Lecture # 18
Agenda: Hidden Markov Models
I Hidden Markov models
I Forward algorithm for state estimation (filtering)
I Forward-backward algorithm for state estimation (smoothing)
I Viterbi algorithm for most likely explanation
I EM algorithm (Baum Welch) for parameter estimation
Hidden Markov model
I {Ht} & {Yt} discrete time stochastic processes
I {Ht} Markov chain and not directly observable (”hidden”)
I {Yt} directly observable
P(Yt | H1:T ) = P(Yt | H1:t) = P(Yt | Ht)
I Terminology : Transition probabilities P(Ht+1 | Ht)
Emission probabilities P(Yt | Ht)
Homogeneous if above probabilities time independent (assumed henceforth)
Examples
I Speech recognition
I Finance forecasting
I DNA motif discovery
I . . .
Speech recognition
Copy number variations
How many duplications do we have? Do we have deletions?
Inference problems
1. What is the prob/likelihood of an observed sequence Y1:T ?
2. What is the prob/likelihood of the latent variable given Y ?
3. What is the most likely value of a latent variable? (’decoding problem’)
4. Given one or several observed sequences, how would we estimate model parameters?
i.e. transition & emission probabilities (and distribution of initial latent variable)
Probability of an observed sequence
P(y1:T ) =∑h1:T
P(y1:T , h1:T ) =∑h1:T
P(h1)T∏t=2
P(ht | ht−1)T∏t=1
P(yt | ht) (∗)
(∗) =∑hT
P(yT | hT )∑h1:T−1
P(hT | hT−1)P(h1)T−1∏t=2
P(ht | ht−1)T−1∏t=1
P(yt | ht)
Suggests dynamic programming solution
Forward algorithm
I Forward probabilities : αt(ht) = P(y1:t, ht) y1:T is given throughout
I Recursion: α1(h1) = P(h1)P(y1, h1) &
αt+1(ht+1) =∑ht
P(yt+1 | ht+1, ht, y1:t)P(ht+1 | ht, y1:t)αt(ht)
= P(yt+1 | ht+1)∑ht
P(ht+1 | ht)αt(ht)
One matrix-vector product per time step!
I Likelihood of an observed sequence is: P(y1:T ) =∑h
αT (h)
Probability of latent variables
Interested in conditional distribution of latent variables:
(a) Filtering : P(ht | y1:t) [forward algorithm]
(b) Smoothing (hindsight) : P(ht | y1:T ) [forward-backward algorithm]
(c) Most likely explanation : (arg)maxh1:T
P(h1:T | y1:T ) [Viterbi algorithm]
Filtering : P(ht | y1:t)
Forward probabilities αt(ht) = P(ht, y1:t)
P(ht | y1:t) =αt(ht)∑h
αt(h)
Smoothing : P(ht | y1:T )
Conditional independence of yt+1:T & y1:t | ht + Bayes’ rule give
P(ht | y1:T ) = P(ht | y1:t, yt+1:T ) ∝ P(yt+1:T | ht)︸ ︷︷ ︸backward prob
P(ht | y1:t)︸ ︷︷ ︸∼ forward prob
Backward probabilities :
βT (hT ) = 1
βt(ht) = P(yt+1:T | ht)
Recursion
βt(ht) =∑ht+1
P(yt+1:T , ht+1 | ht) =∑ht+1
P(ht+1 | ht)P(yt+1 | ht+1)βt+1(ht+1)
One matrix-vector product per time step!
Forward-backward algorithm
I Compute forward probabilities : αt(ht) = P(y1:t, ht)
I backward : βt(ht) = P(yt+1:T | ht)
P(ht | y1:T ) =P(y1:t, ht)P(yt+1:T | ht)
P(y1:T )=αt(ht)βt(ht)∑h
αT (h)= γt(ht)
Complexity of algorithm is O(Tn2) where n # hidden states
Most likely explanation : arg maxh1:T
P(h1:T | y1:T )
Value function : V1(j) = P(y1, h1 = j) = P(y1 | h1 = j)P(h1 = j)
Vt(j) =maxh1:t−1
P(h1:t−1, y1:t, ht = j)
Recursion
Vt+1(j) = maxh1:t
P(h1:t−1, y1:t, ht, yt+1, ht+1 = j)
= maxh1:t−1
maxi
P(h1:t−1, y1:t, ht = i)P(ht+1 = j | ht = i)P(yt+1 | ht+1 = j)
= maxi
Vt(i)P(ht+1 = j | ht = i)P(yt+1 | ht+1 = j) (∗)
Keeping track of optimal state : ψt(j) = arg max in (∗)ψT ( ) = arg max
jVT (j)
Viterbi algorithm
I Compute Vt for t = 1, . . . , T
I ψt
I Backtrack for most likely sequence :
HT = ψT ( )
For t = T − 1, T − 2, . . . , 1
Ht = ψt(Ht+1)
Parameter estimation (learning)
Given observed sequence(s) Y1:T , how can we estimate the model parameters?
Parameters (collectively denoted by θ)
I π : distribution of X1
I T (x, x′) transition probabilities
I E(x, y) = P(Yt = y | Xt = x) emission probabilities
Solution via maximum likelihood & Baum-Welch (EM) algorithm
Baum-Welch algorithm
I Complete likelihood
P(X1:T , Y1:T ) = P(X1)
T∏t=2
P(Xt | Xt−1)
T∏t=1
P(Yt | Xt)
I Parameter estimation from complete likelihood is easy (why?)
I EM iteration: current parameter value θk
E-step: Compute Q(θ || θk) = EX [logPθ(X,Y ) | Y, θk]
M-step: θk+1 = arg maxθ
Q(θ || θk)
E-step
I γt(x) = P(Xt = x | y1:T , θk) [forward-backward algorithm]
I ξt(x, x′) = P(Xt = x,Xt+1 = x′ | y1:T , θk) ∝ αt(x)T
(k)(x, x′)βt+1(x′)E(k)(x′, yt+1)
Above relation not hard to show...Probabilities use current guesses of model parameters
P (k)(x, x′) = P(Xt+1 = x′ | Xt = x, θk) E(k)(x, y) = P(Yt = y | Xt = x, θk)
I Log-likelihood:
logP (X,Y ) = log π(X1) +
T−1∑t=1
log T (Xt+1, Xt) +
T∑t=1
logE(Xt, yt)
E-step
Q(θ || θk) =∑x
γ1(x) log π(x) +
T−1∑t=1
∑x,x′
ξt(x, x′) log T (x, x′) +
T∑t=1
∑x
γt(x) logE(x, y)
Q-step
Q(θ || θk) =∑x
γ1(x) log π(x) +∑x,x′
T−1∑t=1
ξt(x, x′) log T (x, x′) +
∑x
T∑t=1
γt(x) logE(x, yt)
Q-step: update parameters π(·), T (·, ·), E(·, ·)
I π+(x) = γ1(x)I
T+(x, x′) =
∑T−1t=1 ξt(x, x
′)∑z
∑T−1t=1 ξt(x, z)
=
∑T−1t=1 ξt(x, x
′)∑T−1t=1 γt(x)
I
E+(x, v) =
∑Tt=1 1{yt = v}γx(t)∑T
t=1 γx(t)
That’s all folks!Enjoy the summer