View
227
Download
1
Embed Size (px)
Citation preview
Ch 13. Sequential Data (1/2)Ch 13. Sequential Data (1/2)
Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.
Summarized by Kim Jin-young
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
2 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
ContentsContents
13.1 Markov Models 13.2 Hidden Markov Models
13.2.1 Maximum likelihood for the HMM 13.2.2 The forward-backward algorithm 13.2.3 The sum-product algorithm for the HMM 13.2.4 Scaling factors 13.2.5 The Viterbi Algorithm 13.2.6 Extensions of the HMM
3 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Sequential DataSequential Data
Data dependency exists according to a sequence Weather data, DNA, characters in sentence i.i.d. assumption doesn’t hold
Sequential Distribution Stationary vs. Nonstationary
Markov Model No latent variable
State Space Models Hidden Markov Model (discrete latent variables) Linear Dynamical Systems
4 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Markov ModelsMarkov Models
Markov Chain
State Space Model (HMM)
1 N n 1 n-11
1 n n-12
1 2 1 n n-1 n-23
x ,...,x x |x ,...,x
(x ) x |x
(x ) (x |x ) x |x ,x
N
n
N
n
N
n
p p
p p
p p p
1 N 1 N 1 n n-1 n n2 1
x ,...,x ,z ,...,z (z ) z |z x |zN N
n n
p p p p
(free of Markov assumption of any order with reasonable no. of extra parameters)
5 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Hidden Markov Model (overview)Hidden Markov Model (overview)
Overview Introduction of discrete latent vars.
(based on prior knowledge)
Examples Coin toss Urn and ball
Three Issues(given observation,)
Parameter estimation Prob. of observation seq. Most likely seq. of latent var.
6 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Hidden Markov Model (example)Hidden Markov Model (example)
Lattice Representation
Left-to-right HMM<Handwriting Recognition>
7 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Hidden Markov ModelHidden Markov Model
Given the following,
Joint prob. dist. for HMM is:
Whose elements are:
1 1{ ,... }, { ,... }, { }N Nx x z z X Z θ π, A,φ
1
1
11
n n-11 1
n n n1
(z | )
(z |z , )
(x |z , ) (x | )
k
n j nk
nk
KZk
k
K KZ Zjk
k j
KZ
kk
p
p A
p p
π
A
1 n n-1 m m2 1
( | ) (z | ) (z |z , ) (x |z , )N N
n m
p p p p
X,Z π A φ
(observation,latent var,model parameters)
(initial latent node)
(cond. dist. amonglatent vars)
(emission prob.)
K : 상태의 수 /
N : 총 시간Zn-1j,nk : 시각 n-1 에서 j 상태였다가 시각 n 에서 k 상태로 transition
(initial state, state transition, emission)
8 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
EM Revisited EM Revisited (slide by Seok Ho-sik)(slide by Seok Ho-sik)
General EM Maximizing the log likelihood function
Given a joint distribution p(X, Z|Θ) over observed variables X and latent variables Z, governed by parameters Θ
1. Choose an initial setting for the parameters Θold
2. E step Evaluate p(Z|X,Θold ) – posterior dist. of latent vars
3. M step Evaluate Θnew given by Θnew = argmaxΘQ(Θ ,Θold)
Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)
4. It the covariance criterion is not satisfied, then let Θold Θnew
9 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Estimation of HMM ParameterEstimation of HMM Parameter
The Likelihood Function
Using EM Algorithm E-Step
( | ) ( | )Z
p pX θ X,Z θ (marginalization over latent var Z)
( , ) ( , | ) ln ( , | )old old
Z
Q p pθ θ X Z θ X Z θ
1 1
1 1 1,
( ) ( | , )
( ) ( )
( , ) ( , | , )
( , ) , ( )
oldn n
nk nk nkZ
oldn n n n
n j nk n j nk n j nkZ
z p z
z E z z
z z p z z
z z E z z z z
X θ
z
X θ
z
11
12 1 1 1 1
( , ) ( ) ln
( , ) ln ( ) ln ( | )
Kold
k kk
N K K N K
n j nk jk nk n kn j k n k
Q z
z z A z p x
θ θ
1 n n-1 n n2 1
( , | )
(z ) z |z x |zN N
n n
p
p p p
X Z θ
10 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Estimation of HMM ParameterEstimation of HMM Parameter
1
11
( )
( )
kk K
jj
z
z
M-Step
Initial
Transition
Emission
n1
1
n k n k1
1
( )x
( )
( )(x -μ )(x -μ )
( )
N
nkn
k N
nkn
NT
nkn
k N
nkn
z
z
z
z
( | ) ( | , )k k kp x N x
12
11 2
( , )
( , )
N
n j nkn
jk K N
n j nll n
z zA
z z
(Given Gaussian Emission Density)
11 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Forward-backward AlgorithmForward-backward Algorithm
Probability for a single latent var
Probability for two successive latent vars
n nn n
1 n n n+1 N n n n
( | z ) (z )(z ) (z | )
( )
(x ,...,x ,z ) (x ,...,x |z ) (z ) (z )
( ) ( )
p X pp
p
p p
p p
XX
X X
(parameter estimation)
n-1 n n-1 nn-1 n n-1 n
1 n-1 n-1 n n n+1 N n n n-1 n-1
n-1 n n n n-1 n
( | z ,z ) (z ,z )(z ,z ) (z ,z | )
( )
(x ,...,x ,z ) (x |z ) (x ,...,x |z ) (z |z ) (z )
( )
(z ) (x |z ) (z |z ) (z )
( )
p pp
p
p p p p p
p
p p
p
XX
X
X
X
12 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Forward & Backward VariableForward & Backward Variable
Defining alpha & beta Recursively
Probability of Observation
n
n
n nz
Nz
( | ) (z ) (z )
( | ) (z )
p
p
X θ
X θ
n-1
n-1
n 1 n n n n n-1 n n-1Z
n n+1 N n n+1 n+1 n+1 n+1 nZ
(z ) (x ,...,x ,z ) (x ,z ) (z ) (z |z )
(z ) (x ,...,x |z ) (z ) (x ,z ) (z |z )
p p p
p p p
(probability of observation)
13 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Sum-product AlgorithmSum-product Algorithm
Factor graph representation
Same result as before
1 1 1 1
n-1 n n n-1 n n
(z ) (z ) (x |z )
(z ,z ) (z |z ) (x |z )n
h p p
f p p
1 1 1
1 1
1
1
1
n-1 n-1
n n-1 n n-1
n n
n n
n n n n n
n n nn
(z ) (z )
(z ) (z ,z ) (z )
(z ) (z )
(z ) (z )
(z , ) (z ) (z ) (z ) (z )
(z , ) (z ) (z )(z )
( ) ( )
n n n n
n n n n
n
n n
n n
n n n n
z f f z
f z n z fZ
f z
f z
f z f z
f
p
p
p p
X
X
X X
(alternative to forward-backward algo.)
(We condition on x1,x2…,xN)
14 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Scaling FactorsScaling Factors
Alpha & Beta variable can go to zero exponentially quickly.
What if we rescale Alpha & Beta so that their values remain of order unity?
(Implementation Issue)
1
n-1
nn N 1 n
1 n
n 1 n-1
n n n n-1 n n-1
n+1 N nn
n+1 N 1 n
1 n n+1 n+1 n+1 n+1 nZ
(z )ˆ (z ) (z |x ,...,x )
(x ,...,x )
(x |x ,...,x )
ˆ ˆ(z ) (x ,z ) (z ) (z |z )
(x ,...,x |z )ˆ(z )(x ,...,x |x ,...,x )
ˆ ˆ(z ) (z ) (x ,z ) (z |z )
n
n
nZ
n
pp
c p
c p p
p
p
c p p
15 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
The Viterbi AlgorithmThe Viterbi Algorithm
From max-sum algorithm
Joint dist. by the most probable path
Backtracking the most probable path
1
1 1 1
n n
n+1 1 n n+1 n
(z ) (z )
(z ) max{ln (z ,z ) (z )}n n n n
n n n nn
z f f z
f z n z fz
f
1 1n n 1 n 1 n
,...,
1 1 1 1
n+1 n+1 n+1 n+1 n n
(z ) (z ) max (x ,...,x ,z ,...,z )
(z ) ln (x |z ) ln (z )
(z ) max{ln (x |z ) ln (z |z ) (z )}
n nn
n
f zz z
z
p p
p p
(most likely state sequence)
max max1 )n nk k
(Eq. 13.68 Revised)
16 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Extensions of HMMExtensions of HMM
Autoregressive HMM Considering long-term
time dependency
Input-output HMM For supervised learning
Factorial HMM For decoding multiple
bits of info.