Upload
hanhu
View
245
Download
0
Embed Size (px)
Citation preview
✬
✫
✩
✪
Introduction to Natural Language Processing
HMM: Hidden Markov Models
Jean-Cedric Chappelier
Artificial Intelligence Laboratory
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier1/36
✬
✫
✩
✪
Objectives of this lecture
➥ Introduce fundamental concepts necessary to use HMMs for PoS tagging
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier2/36
✬
✫
✩
✪
Example: PoS tagging with HMM
Sentence to tag: Time flies like an arrow.
Example of HMM model:
❑ PoS tags: T = {Adj,Adv,Det,N,V, . . .}
❑ Transition probabilities:
P (N|Adj) = 0.1, P (V|N) = 0.3, P (Adv|N) = 0.01, P (Adv|V) = 0.005,
P (Det|Adv) = 0.1, P (Det|V) = 0.3, P (N|Det) = 0.5
(plus all the others, such that stochastic constraints are fullfilled)
❑ Initial probabilities: PI(Adj) = 0.01, PI(Adv) = 0.001, PI(Det) = 0.1,
PI(N) = 0.2, PI(V) = 0.003 (+...)
❑ Words: L = {an, arrow, flies, like, time, . . .}
❑ Emission probabilities: P (time|N) = 0.1, P (time|Adj) = 0.01, P (time|V) =
0.05, P (flies|N) = 0.1, P (flies|V) = 0.01, P (like|Adv) = 0.005, P (like|V) =
0.1, P (an|Det) = 0.3, P (arrow|N) = 0.5 (+...)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier3/36
✬
✫
✩
✪
In this example, 12 = 3 · 2 · 2 · 1 · 1 analyses are possible, for example:
P (time/N flies/V like/Adv an/Det arrow /N) = 1.13 · 10−11
P (time/Adj flies/N like/V an/Det arrow /N) = 6.75 · 10−10
P (time/N flies/V like/Adv an/Det arrow /N)
= PI(N) · P (time|N) · P (V|N) · P (flies|V) · P (Adv|V) · P (like|Adv)
·P (Det|Adv) · P (an/Det) · P (N|Det) · P (arrow|N)
= 2e-1 · 1e-1 · 3e-1 · 1e-2 · 5e-3 · 5e-3 · 1e-1 · 3e-1 · 5e-1 · 5e-1
The aim is to choose the most probable tagging
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier4/36
✬
✫
✩
✪
Contents
➥ HMM models, three basic problems
➥ Forward-Backward algorithms
➥ Viterbi algorithm
➥ Baum-Welch algorithm
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier5/36
✬
✫
✩
✪
Markov Models
Markov model: a discrete-time stochastic process T on T ={t(1), ..., t(m)
}satisfying
the Markov property (limited conditioning)
P (Ti|T1, ..., Ti−1) = P (Ti|Tt−k, ..., Ti−1)
k : order of the Markov model
In practice k = 1 (bigrams) or 2 (trigrams) rarely 3 or 4 (→ learning difficulties)
From a theoretical point of view: every Markov model of order k can be represented as
another Markov model of order 1 (choose Yi = (Ti−k+1, ..., Ti))
Vocable:
P (T1, ..., Ti) = P (T1) · P (T2|T1) · ... · P (Ti|Ti−1)
initial probabilities transition probabilities
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier6/36
✬
✫
✩
✪
Hidden Markov Models (HMM)
What is hidden?
☞ The model itself (i.e. the state sequence)
What do we see then?
☞ An observation w related to the state (but not the state itself)
Formally:
❑ a set of states T ={t(1), ..., t(m)
}PoS tags
❑ a transition probabilities matrix A such that Att′ = P (Ti+1 = t′|Ti = t), shorten
P (t′|t) (independant of i)
❑ an initial probabilities vector I such that It = P (T1 = t), shorten PI(t)
✰ an alphabet L (not necessarily finite) words
✰ n probability densities on L (emission probabilities): Bt(w) = P (Wi = w|Ti = t)
(for w ∈ L), shorten P (w|t).
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier7/36
✬
✫
✩
✪
���������������
���������������
���������
���������
������������������
��������������
0.5 0.5
H T
P(w|C)
0.49 0.51
Coin 10.4 Coin 2 0.1
0.6
H T
0.85 0.15
0.9
Simple example of HMM
Example: a cheater tossing from two hidden (unfair) coins
States: coin 1 and coin 2: T = {1, 2}
transition matrix A =
0.4 0.6
0.9 0.1
alphabet = {H, T}
emission probabilities B1 = (0.49, 0.51) et B2 = (0.85, 0.15)
initial probabilities I = (0.5, 0.5)
☞ 5 free parameters: I1, A11, A21, B1(H), B2(H)
Observation: HTTHTTHHTTHTTTHHTHHTTHTTTTHTHHTHTHHTTTH
[ (Hidden) sequence of states: 211211211121112112111211112121121211112 ]
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier8/36
✬
✫
✩
✪
The three basic problems for HMMs
Problems: Given an HMM and an observation sequence w = w1...wn
➮ given the parameters θ of the HMM, what is the probability of the observation
sequence: P (w|θ)
Application: Language Identification
➮ given the parameters θ of the HMM, find the most likely state sequence t that
produces w: Argmaxt
P (t|w, θ)
Application: PoS Tagging, Speech recognition
➮ find the parameters that maximize the probability of producing w: Argmaxθ
P (θ|w)
Application: Unsupervised learning
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier9/36
✬
✫
✩
✪
Remarks:
➊ θ = (I,A,B)
= (I1, ..., Im, B1(w1), B1(w2), ..., B1(wL), B2(w1), ..., B2(wL),
..., Bm(w1), ..., Bm(wL), A11, ..., A1m, ..., Am1, ..., Amm)
i.e. (m− 1) +m · (L− 1) +m · (m− 1) = m · (m+ L− 1)− 1 free
parameters (because of sum-to-1 contraints), where m = |T | and L = |L| (in the
finite case, otherwise L stands for the total number of parameters used to representL)
➋ Supervised learning (i.e Argmaxθ
P (θ|w, t)) is easy
➌ WARNING! There is a difference between P (θ|w) and P (M|w)!
The modelM is supposed to be known here, but its parameters θ: i.e. the HMM
design is already defined (number of states, alphabet) only the parameters are missing.
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier10/36
✬
✫
✩
✪
Contents
➥ HMM models, three basic problems
☞ Forward-Backward algorithms
➥ Viterbi algorithm
➥ Baum-Welch algorithm
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier11/36
✬
✫
✩
✪
Computation of P (W|θ)
Computation of P (W|θ) is mathematically trivial:
P (W|θ) =∑
t
P (W, t|θ) =∑
t
P (W|t, θ) · P (t|θ)
Practical limitation: complexity isO(nmn) exponential!
Practical computation: forward/backward algorithms−→ complexity isO(nm2)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier12/36
✬
✫
✩
✪
Forward-Backward algorithms
”forward” variable : αi(t) = P (w1, ..., wi, Ti = t|θ) t ∈ T
iterative computation: αi+1(t′) = Bt′(wi+1) ·
∑
t∈T
(αi(t) ·Att′
)
α1(t) = Bt(w1) · It
”backward” variable : βi(t) = P (wi+1, ..., wn|Ti = t, θ)
βi−1(t′) =
∑
t∈T
(βi(t) ·At′t ·Bt(wi)
)
βn(t) = 1 (by convention, practical considerations)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier13/36
✬
✫
✩
✪
Forward-Backward algorithms (2)
”forward-backward” variable : γi(t) = P (Ti = t|w, θ)
γi(t) =P (w, Ti = t|θ)
P (w|θ)=
αi(t) · βi(t)∑
t′∈T
αi(t′) · βi(t
′)
Computation inO(nm2)→ efficient solutions to ”first problem”:
P (w|θ) =∑
t∈T
P (w, Tn = t|θ) =∑
t∈T
αn(t)
P (w|θ) =∑
t∈T
αi(t) · βi(t) ∀i : 1 ≤ i ≤ n
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier14/36
✬
✫
✩
✪
Contents
➥ HMM models, three basic problems
➥ Forward-Backward algorithms
☞ Viterbi algorithm
➥ Baum-Welch algorithm
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier15/36
✬
✫
✩
✪
Viterbi algorithm (1)
Efficient solution to the ”second problem”: find the most likely sequence of states t
(knowing w and the parameters θ) : ArgmaxtP (t|w, θ)
⇒ maximize (in t) P (t,w|θ).
”The” lattice ☞ temporal unfolding of all possible walks through the Markov chain
2
m
.
.
.
1
2
m
.
.
.
1
2
m
.
.
.
1
Sta
tes
Timew w
21w
n
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier16/36
✬
✫
✩
✪
Viterbi algorithm (2)
Let ρi(t) = maxt1,...,ti−1
P (t1, ..., ti−1, ti = t, w1, ..., wi|θ)
We are looking for maxt∈T
ρn(t)
It’s easy (exercise) to show that ρi(t) = maxt′
[P (t|t′, θ)P (wi|t, θ) ρi−1(t
′)]
from which the following algorithm comes:
...
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier17/36
✬
✫
✩
✪
Viterbi algorithm (3)
for all t ∈ T do
ρ1(t) = It ·Bt(w1)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
for i from 2 to n do
for all t ∈ T do
• ρi(t) = Bt(wi) ·maxt′(At′t · ρi−1(t
′))
• mark one of the transitions from t′ to t where the maximum is reached
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
reconstruct backwards (from tn) the best path following the marked transitions
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier18/36
✬
✫
✩
✪
Contents
➥ HMM models, three basic problems
➥ Forward-Backward algorithms
➥ Viterbi algorithm
☞ Baum-Welch algorithm
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier19/36
✬
✫
✩
✪
Expectation-Maximization
Our goal: maximize P (θ|w)
☞ Maximum-likelihood estimation of θ → maximization of P (w|θ)
To achieve it: Expectation-Maximization (EM) algorithm
General formulation of EM:
given
• observed data w = w1...wn
• a parameterized probability distribution P (T,W|θ) where
– T = T1...Tn are unobserved data
– θ are the parameters of the model
determine θ that maximizes P (w|θ) by convergence of iterative computation of the series
θ(i) that maximizes (in θ) ET
[logP (T,W|θ)|w, θ(i−1)
]
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier20/36
✬
✫
✩
✪
Expectation-Maximization (2)
To do so, define the auxilary function
Q(θ, θ′) = ET [logP (T,W|θ)|w, θ′] =∑
t
P (t|w, θ′) logP (t,w|θ)
as it can be shown (with Jensen inequality) that
Q(θ, θ′) > Q(θ′, θ′)⇒ P (w|θ) > P (w|θ′)
This is the fundamental principle of EM: if we already have an estimation θ′ of the
parameters and we find another parameter configuration θ for which the first inequality (on
Q) holds, then w is most probable with model θ rather than with model θ′.
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier21/36
✬
✫
✩
✪
Expectation-Maximization (3)
EM algorithm:
• Estimation Step: Compute Q(θ, θ(i))
• Maximization Step: Compute θ(i+1) = Argmaxθ
Q(θ, θ(i))
in other words:
1. Choose θ(0) (and set i = 0)
2. Find θ(i+1) which maximizes
∑tP (t|w, θ(i)) logP (t,w|θ(i+1))
3. Set i← i+ 1 and go back to (2) unless some convergence test is fulfilled
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier22/36
✬
✫
✩
✪
Baum-Welch Algorithm
The Baum-Welch Algorithm is an EM algorithm for estimating HMM parameters. It’s an
answer to the ”third problem”.
The goal is therefore to find
Argmaxθ
∑
t
P (t|w, θ′) logP (t,w|θ) = Argmaxθ
∑
t
P (t,w|θ′) logP (t,w|θ)
︸ ︷︷ ︸def= Q(θ, θ′)
since P (w|θ′) does not depend on θ.
What is logP (t,w|θ)?
logP (t,w|θ) = logPI(t1) +
n∑
i=2
logP (ti|ti−1) +
n∑
i=1
logP (wi|ti)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier23/36
✬
✫
✩
✪
Q(θ, θ′) consists therefore of 3 terms:
Q((I,A,B), θ′) = QI(I, θ′) +QA(A, θ′) +QB(B, θ′)
Let’s compute one of these:
QI(I, θ′) =
∑
t
P (t,w|θ′) logPI(t1)
=∑
t1
∑
t2,...,tn
P (t1,w|θ′) · P (t2, ...tn|t1,w, θ′) · logPI(t1)
=∑
t∈T
P (t1 = t,w|θ′) · logPI(t)∑
t2,...,tn
P (t2, ..., tn|t1,w, θ′)
︸ ︷︷ ︸=1
=∑
t∈T
P (t1 = t,w|θ′) · log It
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier24/36
✬
✫
✩
✪
Similarly we have:
QA(A, θ′) =n∑
i=2
∑
t∈T
∑
t′∈T
P (Ti−1 = t, Ti = t′,w|θ′) logAtt′
QB(B, θ′) =
n∑
i=2
∑
t∈T
P (Ti = t,w|θ′) logBt(wi)
Therefore Q is a sum of three independent terms (e.g. QI does not depend on A nor on
B) and therefore the maximisation over θ is achieved by the three terms separately, i.e.
maximizing QI(I, θ′) over I, QA(A, θ′) over A and QB(B, θ′) over B separately.
Notice that all these three functions are sums (over i) of functions of the form:
f(x) =m∑
j=1
yj log xj
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier25/36
✬
✫
✩
✪
and all the above three functions have to be maximized under the constraint
m∑
j=1
xj = 1.a
This maximization under constraints is achieved using Lagrange multipliers, i.e. looking at
g(x) = f(x)− λ ·m∑
j=1
xj =m∑
j=1
(yj log xj − λ · xj)
Solving this by∂∂x
g(x) = 0, we find that λ =yj
xj.
Putting this back in the constraint we find:
xj =yj
m∑
j=1
yj
aTo be accurate: for Bt the constraint is∑
w∈L
Bt(w) = 1. This changes the formulas a bit, but not the
essence of the computation.
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier26/36
✬
✫
✩
✪
Summarizing the obtained results, we have the following reestimation formulas (where the
max. is reached):
It =P (T1 = t,w|θ′)∑
t′∈T
P (T1 = t′,w|θ′)=
P (T1 = t,w|θ′)
P (w|θ′)
Att′ =
n∑
i=2
P (Ti−1 = t, Ti = t′,w|θ′)
n∑
i=2
∑
τ∈T
P (Ti−1 = t, Ti = τ,w|θ′)
=
n∑
i=2
P (Ti−1 = t, Ti = t′,w|θ′)
n∑
i=2
P (Ti−1 = t,w|θ′)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier27/36
✬
✫
✩
✪
and:
Bt(w) =
n∑
i=1s.t.wi=w
P (Ti = t,w|θ′)
n∑
i=2
P (Ti = t,w|θ′)
=
n∑
i=2
P (Ti = t,w|θ′) δwi,w
n∑
i=2
P (Ti = t,w|θ′)
with δw,w′ = 1 if w = w′ and 0 otherwise.
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier28/36
✬
✫
✩
✪
Baum-Welch Algorithm: effective computation
How do we compute these reestimation formulas?
Let χi(t, t′) = P (Ti = t, Ti+1 = t′|w)
χi is easy to compute with ”forward” and ”backward” variables:
χi(t, t′) =
αi(t) ·Att′ ·Bt′(wi+1) · βi+1(t′)∑
τ∈T
∑
τ ′∈T
αi(τ) ·Aττ ′ ·Bτ ′(wi+1) · βi+1(τ′)
Notice: γi(t) =∑
t′∈T
χi(t, t′) for all 1 ≤ i < n
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier29/36
✬
✫
✩
✪
Effective Reestimation formulas
It = γ1(t)
Att′ =
n−1∑
i=1
χi(t, t′)
n−1∑
i=1
γi(t)
Bt(w) =
n∑
i=1s.t.wi=w
γi(t)
n∑
i=1
γi(t)
=
n∑
i=1
γi(t) δwi,w
n∑
i=1
γi(t)
with δw,w′ = 1 if w = w′ and 0 otherwise.
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier30/36
✬
✫
✩
✪
Baum-Welch Algorithm
1. Let θ(0) be an initial parameter set
2. Compute iteratively α, β and then γ and χ
3. Compute θ(t+1) with reestimation formulas
4. If θ(t+1) 6= θ(t), go to (2) [or another weaker stop test]
WARNING!
The algorithm converges but only towards a local maximum of E [logP (T,W|θ)]
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier31/36
✬
✫
✩
✪
CRF versus HMM
(linear) Conditional Random Fields (CRF) are a discriminative generalization of the
HMMs where “features” no longer needs to be state-conditionnal probabilities (less
constraint).
For instance (order 1):
HMM
P (t,w) = P (t1)P (w1|t1)·
n∏
i=2
P (wi|ti)P (ti|ti−1)
t1 t2
w2
w1
tn
wn
...
CRF
P (t|w) =n∏
i=2
P (ti−1, ti|w)
(with P (ti−1, ti|w) ∝
exp(
∑
jλjfj(ti−1, ti,w, i
)
)
t1 t2 tn...
wn1w ...
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier32/36
✬
✫
✩
✪
Keypoints
➟ HMMs definitions, their applications
➟ Three basic problems for HMMs
➟ Algorithms needed to solve these problems: Forward-Backward, Viterbi, Baum-Welch
(be aware of their existence, but not the implementation details)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier33/36
✬
✫
✩
✪
References
[1] L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition, Proceedings of the IEEE, vol. 77, No. 2, 1989.
[2] C. D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing,
MIT, 1999.
[3] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum-likelihood from incomplete data
via the EM algorithm, Journal of Royal Statistical Society B, 1977.
[4] H. Bourlard et al., Traitement de la parole, 2000; pp. 179-200, 202-214, 232-260.
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier34/36
✬
✫
✩
✪
APPENDUM
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier35/36
✬
✫
✩
✪
Justification of the maximization of the auxilary function Q for finding θ maximizing
P (w|θ):
logP (w|θ)− logP (w|θ′) = logP (w|θ)
P (w|θ′)= log
∑
t
P (w, t|θ)
P (w|θ′)
= log∑
t
P (t|w, θ′)P (w, t|θ)
P (w, t|θ′)
Jensen≥
∑
t
P (t|w, θ′) logP (w, t|θ)
P (w, t|θ′)
≥ ET [logP (T,W|θ)|w, θ′]−ET [logP (T,W|θ′)|w, θ′]
≥ Q(θ, θ′)−Q(θ′, θ′)
Therefore:
Q(θ, θ′) > Q(θ′, θ′)⇒ logP (w|θ) > logP (w|θ′)⇒ P (w|θ) > P (w|θ′)
LIAI&C
Introduction to Natural Language Processing (CS-431)M. Rajman
J.-C. Chappelier36/36