Upload
trinhdung
View
269
Download
0
Embed Size (px)
Citation preview
.
Maximum Likelihood andParameter Estimation For HMM
Lecture #8
Background Readings: Chapter 3.3 in , Biological Sequence Analysis, Durbin et al., 2001.© Shlomo Moran, following Danny Geiger and Nir Friedman
2
Parameter Estimation for HMM
An HMM model is defined by the parameters: mkl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
lk
b
mkl
ek(b)
{ : , are states} { ( ) : is a state, is a letter}kl km k l e b k bθ = ∪
3
Parameter Estimation for HMM
To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
4
Maximum Likelihood Parameter Estimation for HMM
The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).ML parameter estimation looks for θ which maximizes the above.
The exact method for finding or approximating this θ depends on the nature of the training set used.
5
Data for HMM
The training set is characterized by:1.For each xj, the information on the states sj
i (the symbolsxj
i are usually known). 2.The size (number of sequences) in it.
S1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
6
Case 1: ML when Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which maximize the probability of the sample set:p(x1,...,xn| θ*) =MAXθp(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
7
Case 1: Sequences are fully known
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
( )
( , ) ( , )then: prob( | ) [ ( )]
j jkl k
M E bkkl
k l k b
j m e bx θ = ∏ ∏
For each xj we have:
11
( | ) ( )j
i i i
Lj j
s s s ii
p x m e xθ−
=
=∏1Let |{ : , } | (the superscript indicates )
Let ( ) |{ : , } |
j j j j jkl i i
j jk i i
M i s k s l x
E b i s k x b−= = =
= = =
8
Case 1 (cont)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).
Thus, if Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:
( )
( , ) ( , )
1 prob( | ) [ ( )],.., kl kM E bkkl
k l k b
n m e bx x θ = ∏ ∏
9
Case 1 (cont)
( )
( , ) ( , )[ ( )]kl kM E b
kklk l k b
m e b∏ ∏
So we need to find mkl’s and ek(b)’s which maximize:
Subject to:
For all states , 1 and ( ) 1
[ , ( ) 0 ]
kl kl b
kl k
k m e b
m e b
= =
≥
∑ ∑
10
Case 1 (cont)
( )
( , ) ( , )
( )
[ ( )]
[ ( )][ ] [ ]
kl k
kl k
M E bkkl
k l k b
M E bkkl
k l k b
F m e b
m e b
= =
×
∏ ∏
∏ ∏ ∏ ∏
kb
Subject to: for all , 1, and e ( ) 1.kll
k m b= =∑ ∑
Rewriting, we need to maximize:
11
Case 1 (cont)
If we maximize for each : s.t. 1klMklkl
llk m m =∑∏( )and also [ ( )] s.t. ( ) 1kE b
k kbb
e b e b =∑∏
Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.
12
ML parameters estimation for a die
Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={q1,q2,q3,q4,q5, q6} , ∑qi=1Assume that the data is one sequence:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize
Subject to: q1+q2+ q3+ q4+ q5+ q6=1 [and qi ≥0 ]
2 3 2 21 2 3 4 5 6prob( | )Data q q q q q qθ = ⋅ ⋅ ⋅ ⋅ ⋅
252 3 21 2 3 4 5
1i.e., prob( | ) 1 i
iData q q q q q qθ
=
⎛ ⎞= ⋅ ⋅ ⋅ ⋅ ⋅ −⎜ ⎟
⎝ ⎠∑
13
Side comment: Sufficient Statistics
To compute the probability of data in the die example we only require to record the number of times nifalling on side i (namely,n1, n2,…,n6).We do not need to recall the entire sequence of outcomes
{ni | i=1…6} is called sufficient statistics for the multinomial sampling.
( ) 63 51 2 4
51 2 3 4 5 1
( | ) 1prob nn nn n n
iiData q q q q q qθ
== ⋅ ⋅ ⋅ ⋅ ⋅ −∑
14
Sufficient Statistics
A sufficient statistics is a function of the data that summarizes the relevant information for the likelihoodFormally, s(Data) is a sufficient statistics if for any two datasets D and D’
s(Data) = s(Data’ ) ⇒ P(Data|θ) = P(Data’|θ)
Datasets
Statistics
Exercise:Define a concise “sufficient statistics”for the HMM model, when the sequences are fully known.
15
Maximum Likelihood Estimate
( ) 63 51 2 4
51 2 3 4 5 1
log(prob( | )) log 1n
n nn n nii
Data q q q q q qθ=
⎡ ⎤= ⋅ ⋅ ⋅ ⋅ ⋅ −⎢ ⎥⎣ ⎦∑
By the ML approach, we look for parameters that maximize the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:
( )5 561 1
log log 1i i ii in q n q
= == + −∑ ∑
A necessary condition for (local) maximum is:
65
1
log(prob( | )) 01
j
j j ii
n nDataq q q
θ
=
∂= − =
∂ −∑
16
Finding the Maximum
Rearranging terms:
jj i
i
nq q
n=Divide the jth equation by the ith equation:
Sum from j=1 to 6:
66
1
11
jjj i i
j i i
n nq q qn n=
=
= = =∑
∑
The only local – and hence global – maximum is given by the relative frequency: 1,..,6i
inq in
= =
6 65
611
j
j ii
n n nq qq
=
= =−∑
17
Generalization to any number k of outcomes
Let X be a random variable with k values x1,…,xk denoting the koutcomes of Independently and Identically Distributed experiments, with parameters θ ={q1,q2,...,qk} (qi is the probability of xi). Again, the data is one sequence of length n, in which xi appears nitimes.
Then we have to maximize
Subject to: q1+q2+ ....+ qk=1
11
1
1 11
i.e., prob( | ) 1k
k
nknn
iki
Data q q qθ −
−
−=
⎛ ⎞= ⋅⋅⋅ ⋅ −⎜ ⎟
⎝ ⎠∑
1 21 2 1prob( | ) ... , ( .. )knn n
k kData q q q n n nθ = ⋅ + + =
18
Generalization for k outcomes (cont)
i k
i k
n nq q
=
By treatment identical to the die case, the maximum is obtained when for all i:
Hence the MLE is given by the relative frequencies:
1,..,ii
nq i k
n= =
19
Fractional Exponents
Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%):We still have for θ = (q1,..,qk):
And the same analysis yields:
1,..,ii
nq i k
n= =
1 211 2prob( | ) , ( ... )knn n
kkData q q q n n nθ = ⋅ ⋅ ⋅ ⋅ + + =
20
Summary: for an experiment with k outcomes x1,..,xk, the ML problem is the following:Given a statistic (n1’ n2,..,nk) of positive real numbers (ni is the number of observations of xi ), find parameters θ ={q1,q2,...,qk} (qi is the probability of xi) which maximizes the likelihood.
Subject to: q1+q2+ ....+ qk=1
1 2The unique ( , .., ) which maximizes
the likelihood is given by , 1.. .
k
ii
q q qnq i kn
θ =
= =
1 21 2 1prob( | ) ... , ( .. )knn n
k kData q q q n n nθ = ⋅ + + =
1 2The unique ( , .., ) which maximizes
the likelihood is given by , 1,.., .
k
ii
q q qnq i kn
θ =
= =
21
Frequency Vector of Statistics (for single random variable)
The frequency vector of statistics (n1’ n2,..,nk) where n1+..nk=n,is the statistics (p1,..,pk) obtained by letting pi = ni /n (i = 1,..,k).
Consider two statistics of a random variable with 4 outcomes:n=125 and the statistics is (10,25,80,10) n=250 and the statistics is (20,50,160,20).Both statistics gives the same frequency vector (0.08, 0.2, 0.64, 0.08), and hence are optimized by the same parameters.
We can treat a frequency vector (p1,..,pk) as any other statistics, and find the parameters which maximize its likelihood.
22
ML for frequency vectors
For single-dice experiments, the likelihood of any data which has a frequency vector P is maximized by the same parameters which maximizes the likelihood of the “statistics” P (prove!). The “ML problem for frequency vectors” is the same as the original problem, but we restrict the statistics to frequency vectors:
1 2
1
1 2
1 2
Given a frequency vector of observed , ( ,.. ), Find parameters ( , ,.., ) which maximize the
likelihood prob( | ) ... k
k
kpp pk
Data p pq q q
Data q q q
θ
θ
=
= ⋅
23
ML for frequency vectors
Since “frequency vector” can be viewed as a probability vector, our solution for the “single dice”ML problem implies the following:
1
1
1
1
Let ( ,.., ) be a probability -vector.Then among all probability -vectors ( ,.., ),
the likelihood is maximized iff .
The value of this maximum likelihood is
i
i
k
kk
pi
ik
pi
i
P p p kk Q q q
q Q P
p
=
=
=
=
=∏
∏
24
Side Trip:Maximum Likelihood and Entropy
( )
1 2
1 2
1 2 1
1
The logarithm of the maximum likelihood of thefrequency vector ( , ,.., ) is given by
log( ... ) log .
This is a negative number, and its absolute value
- log is known as
k
kkpp p
k i ii
ki ii
p p p
p p p p p
p p
=
=
⋅ = ∑
∑ 1the ofentro ( ,..,y )p .kp p
25
ML and “Relative Entropy”
Thus, MLE for frequency vectors implies:
( )
1
1
1
1
Let ( ,.., ) be a probability vector.Then among all probability vectors ( ,.., ), the sum
log gets its unique maximum when .
This is equivalent to the following: Let ( ,.., ) be
k
k
ki ii
k
P p pQ q q
p q P Q
P p p
=
=
=
=
=
∑
1
1
a probability vector. Then for all probability vectors ( ,.., ) it holds that
log( / ) 0 , with equality iff .k
ki i ii
Q q q
p p q P Q=
=
≥ =∑
26
Relative Entropy
1 1
1
For ( ,.., ) and ( ,.., ) as above,
the sum D(P||Q)= log( / ) is called the
or th Kullback e , and . So the ML formula for frequency vectors imp
Leiber distanceli
relative entropy , of
k kk
i i ii
P p p Q q q
p p q
P Q=
= =
∑
esthat the relative entropy of and is nonnegative,and it equals 0 iff .
P QP Q=
27
Relative entropy of scoring functions
( , ) ( , )( , ) log log( ) ( ) ( , )
where ( , ), ( , ) are the probabilities of ( , ) in the "Match" and "Random" models
P a b P a ba bQ a Q b Q a b
P a b Q a ba b
σ = =
Recall that we have defined the scoring function of alignments via
The relative entropy between P and Q is:
Large relative entropy means that P(a,b), the distribution of the “match”model, is significantly different from Q(a,b), the distribution of the “random” model. (This relative entropy is called “the mutual information of P and Q”, denoted I(P,Q). I(P,Q)=0 iff P =Q.)
, ,
( , )( || ) ( , ) log( ) ( , ) ( , )( , )a b a b
P a bD P Q P a b P a b a bQ a b
σ= =∑ ∑
28
Back to ML of fully known training sets
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Recall: Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set. We need to maximize:
( )
( , ) ( , )
1 prob( | ) [ ( )],.., kl kM E bkkl
k l k b
n m e bx x θ = ∏ ∏
29
Rearranging terms we need to:
( )
( , ) ( , )Maximize [ ( )]kl kM E b
kklk l k b
m e b∏ ∏
Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl k kl k
l b
km b m e b= = ≥∑ ∑
For this, we need to maximize the likelihoods of two dicesfor each state k: One for transition {mkl|l=1,..,m} and one for emission {ek(b)|b∈Σ}.
30
Apply to HMM (cont.)
We apply the “dice likelihood” technique to get for each k the parameters {mkl|l=1,..,m} and {ek(b)|b∈Σ}:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
'' '
( ) , and ( )( ')
kl kkl k
kl kl b
M E bm e bM E b
= =∑ ∑
Which gives the optimal ML parameters
32
Adding pseudo counts in HMM
We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : rkl is our prior belief on transitions from k to l.rk(b) is our prior belief on emissions of b from state k.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
' '' '
( ) ( )then , and ( )( ) ( ( ') ( ))
kl kl k kkl k
kl kl k kl b
M r E b r bm e bM r E b r b+ +
= =+ +∑ ∑
33
Case 2: State paths are unknown
For a given θ we have:prob(x1,..., xn|θ)= prob(x1| θ) ⋅ ⋅ ⋅ p (xn|θ)
(since the xj are independent)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
For each sequence x,prob(x|θ)=∑s prob(x,s|θ),
The sum taken over all state paths s which emit x.
34
Thus, for the n sequences (x1,..., xn) we have:
prob(x1,..., xn |θ)= ∑ prob(x1,..., xn , s1,..., sn |θ),
Where the summation is taken over all tuples of n state paths (s1,..., sn ) which generate (x1,..., xn) .For simplicity, we will assume that n=1.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
(s1,..., sn )
35
So we need to maximize prob(x|θ)=∑s prob(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x.Finding θ* which maximizes ∑s prob(x,s|θ) is hard. [Unlike finding θ* which maximizes prob(x,s|θ) for a single sequence (x,s).]
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
36
ML Parameter Estimation for HMM
The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that prob(x|θ’) > prob(x|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.
A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.
37
Baum Welch training
We start with some values of mkl and ek(b), which define prior values of θ.Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.
prob(x|θ*) > prob(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
38
Baum Welch training
In case 1 we computed the optimal values of mkl and ek(b), (for the optimal θ) by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.
Si= lSi-1= k
xi-1= b
… …
xi= c
39
Baum Welch training
When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Mkl to be the sum over all edges.
Si= ?Si-1= ?
xi-1= b xi= c
… …
40
Baum Welch training
Similarly, For each edge si b and each state k,we compute the average number of times that si=k, which is the expected number of “k → b”transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:
Si = ?
xi-1= b
41
The expected values of Mkl and Ek(b)
Given the current distribution θ, Mkl and Ek(b) are defined by:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Mkl=∑sMsklprob(s|x,θ),
where Mskl is the number of k to l transitions in the sequence s.
Ek(b)=∑sEsk (b)prob(s|x,θ),
where Esk (b) is the number of times k emits b in the sequence s
with output x.