HMM: Hidden Markov Models - coling.epfl.ch · HMM: Hidden Markov Models Jean-Cedric Chappelier´ ... Baum-Welch algorithm LIA I&C Introduction to Natural Language Processing (CS-431)

✬

✫

✩

✪

Introduction to Natural Language Processing

HMM: Hidden Markov Models

Jean-Cedric Chappelier

[email protected]

Artificial Intelligence Laboratory

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier1/36

✬

✫

✩

✪

Objectives of this lecture

➥ Introduce fundamental concepts necessary to use HMMs for PoS tagging

LIAI&C



✬

✫

✩

✪

Example: PoS tagging with HMM

Sentence to tag: Time flies like an arrow.

Example of HMM model:

❑ PoS tags: T = {Adj,Adv,Det,N,V, . . .}

❑ Transition probabilities:

P (N|Adj) = 0.1, P (V|N) = 0.3, P (Adv|N) = 0.01, P (Adv|V) = 0.005,

P (Det|Adv) = 0.1, P (Det|V) = 0.3, P (N|Det) = 0.5

(plus all the others, such that stochastic constraints are fullfilled)

❑ Initial probabilities: PI(Adj) = 0.01, PI(Adv) = 0.001, PI(Det) = 0.1,

PI(N) = 0.2, PI(V) = 0.003 (+...)

❑ Words: L = {an, arrow, flies, like, time, . . .}

❑ Emission probabilities: P (time|N) = 0.1, P (time|Adj) = 0.01, P (time|V) =

0.05, P (flies|N) = 0.1, P (flies|V) = 0.01, P (like|Adv) = 0.005, P (like|V) =

0.1, P (an|Det) = 0.3, P (arrow|N) = 0.5 (+...)

LIAI&C



✬

✫

✩

✪

In this example, 12 = 3 · 2 · 2 · 1 · 1 analyses are possible, for example:

P (time/N flies/V like/Adv an/Det arrow /N) = 1.13 · 10−11

P (time/Adj flies/N like/V an/Det arrow /N) = 6.75 · 10−10

P (time/N flies/V like/Adv an/Det arrow /N)

= PI(N) · P (time|N) · P (V|N) · P (flies|V) · P (Adv|V) · P (like|Adv)

·P (Det|Adv) · P (an/Det) · P (N|Det) · P (arrow|N)

= 2e-1 · 1e-1 · 3e-1 · 1e-2 · 5e-3 · 5e-3 · 1e-1 · 3e-1 · 5e-1 · 5e-1

The aim is to choose the most probable tagging

LIAI&C



✬

✫

✩

✪

Contents

➥ HMM models, three basic problems

➥ Forward-Backward algorithms

➥ Viterbi algorithm

➥ Baum-Welch algorithm

LIAI&C



✬

✫

✩

✪

Markov Models

Markov model: a discrete-time stochastic process T on T ={t(1), ..., t(m)

}satisfying

the Markov property (limited conditioning)

P (Ti|T1, ..., Ti−1) = P (Ti|Tt−k, ..., Ti−1)

k : order of the Markov model

In practice k = 1 (bigrams) or 2 (trigrams) rarely 3 or 4 (→ learning difficulties)

From a theoretical point of view: every Markov model of order k can be represented as

another Markov model of order 1 (choose Yi = (Ti−k+1, ..., Ti))

Vocable:

P (T1, ..., Ti) = P (T1) · P (T2|T1) · ... · P (Ti|Ti−1)

initial probabilities transition probabilities

LIAI&C



✬

✫

✩

✪

Hidden Markov Models (HMM)

What is hidden?

☞ The model itself (i.e. the state sequence)

What do we see then?

☞ An observation w related to the state (but not the state itself)

Formally:

❑ a set of states T ={t(1), ..., t(m)

}PoS tags

❑ a transition probabilities matrix A such that Att′ = P (Ti+1 = t′|Ti = t), shorten

P (t′|t) (independant of i)

❑ an initial probabilities vector I such that It = P (T1 = t), shorten PI(t)

✰ an alphabet L (not necessarily finite) words

✰ n probability densities on L (emission probabilities): Bt(w) = P (Wi = w|Ti = t)

(for w ∈ L), shorten P (w|t).

LIAI&C



✬

✫

✩

✪

��

��

��

��

��

��

0.5 0.5

H T

P(w|C)

0.49 0.51

Coin 10.4 Coin 2 0.1

0.6

H T

0.85 0.15

0.9

Simple example of HMM

Example: a cheater tossing from two hidden (unfair) coins

States: coin 1 and coin 2: T = {1, 2}

transition matrix A =

0.4 0.6

0.9 0.1

alphabet = {H, T}

emission probabilities B1 = (0.49, 0.51) et B2 = (0.85, 0.15)

initial probabilities I = (0.5, 0.5)

☞ 5 free parameters: I1, A11, A21, B1(H), B2(H)

Observation: HTTHTTHHTTHTTTHHTHHTTHTTTTHTHHTHTHHTTTH

[ (Hidden) sequence of states: 211211211121112112111211112121121211112 ]

LIAI&C



✬

✫

✩

✪

The three basic problems for HMMs

Problems: Given an HMM and an observation sequence w = w1...wn

➮ given the parameters θ of the HMM, what is the probability of the observation

sequence: P (w|θ)

Application: Language Identification

➮ given the parameters θ of the HMM, find the most likely state sequence t that

produces w: Argmaxt

P (t|w, θ)

Application: PoS Tagging, Speech recognition

➮ find the parameters that maximize the probability of producing w: Argmaxθ

P (θ|w)

Application: Unsupervised learning

LIAI&C



✬

✫

✩

✪

Remarks:

➊ θ = (I,A,B)

= (I1, ..., Im, B1(w1), B1(w2), ..., B1(wL), B2(w1), ..., B2(wL),

..., Bm(w1), ..., Bm(wL), A11, ..., A1m, ..., Am1, ..., Amm)

i.e. (m− 1) +m · (L− 1) +m · (m− 1) = m · (m+ L− 1)− 1 free

parameters (because of sum-to-1 contraints), where m = |T | and L = |L| (in the

finite case, otherwise L stands for the total number of parameters used to representL)

➋ Supervised learning (i.e Argmaxθ

P (θ|w, t)) is easy

➌ WARNING! There is a difference between P (θ|w) and P (M|w)!

The modelM is supposed to be known here, but its parameters θ: i.e. the HMM

design is already defined (number of states, alphabet) only the parameters are missing.

LIAI&C



✬

✫

✩

✪

Contents


☞ Forward-Backward algorithms



LIAI&C



✬

✫

✩

✪

Computation of P (W|θ)

Computation of P (W|θ) is mathematically trivial:

P (W|θ) =∑

t

P (W, t|θ) =∑

t

P (W|t, θ) · P (t|θ)

Practical limitation: complexity isO(nmn) exponential!

Practical computation: forward/backward algorithms−→ complexity isO(nm2)

LIAI&C



✬

✫

✩

✪

Forward-Backward algorithms

”forward” variable : αi(t) = P (w1, ..., wi, Ti = t|θ) t ∈ T

iterative computation: αi+1(t′) = Bt′(wi+1) ·

∑

t∈T

(αi(t) ·Att′

)

α1(t) = Bt(w1) · It

”backward” variable : βi(t) = P (wi+1, ..., wn|Ti = t, θ)

βi−1(t′) =

∑

t∈T

(βi(t) ·At′t ·Bt(wi)

)

βn(t) = 1 (by convention, practical considerations)

LIAI&C



✬

✫

✩

✪

Forward-Backward algorithms (2)

”forward-backward” variable : γi(t) = P (Ti = t|w, θ)

γi(t) =P (w, Ti = t|θ)

P (w|θ)=

αi(t) · βi(t)∑

t′∈T

αi(t′) · βi(t

′)

Computation inO(nm2)→ efficient solutions to ”first problem”:

P (w|θ) =∑

t∈T

P (w, Tn = t|θ) =∑

t∈T

αn(t)

P (w|θ) =∑

t∈T

αi(t) · βi(t) ∀i : 1 ≤ i ≤ n

LIAI&C



✬

✫

✩

✪

Contents



☞ Viterbi algorithm


LIAI&C



✬

✫

✩

✪

Viterbi algorithm (1)

Efficient solution to the ”second problem”: find the most likely sequence of states t

(knowing w and the parameters θ) : ArgmaxtP (t|w, θ)

⇒ maximize (in t) P (t,w|θ).

”The” lattice ☞ temporal unfolding of all possible walks through the Markov chain

2

m

.

.

.

1

2

m

.

.

.

1

2

m

.

.

.

1

Sta

tes

Timew w

21w

n

LIAI&C



✬

✫

✩

✪


Let ρi(t) = maxt1,...,ti−1

P (t1, ..., ti−1, ti = t, w1, ..., wi|θ)

We are looking for maxt∈T

ρn(t)

It’s easy (exercise) to show that ρi(t) = maxt′

[P (t|t′, θ)P (wi|t, θ) ρi−1(t

′)]

from which the following algorithm comes:

...

LIAI&C



✬

✫

✩

✪


for all t ∈ T do

ρ1(t) = It ·Bt(w1)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

for i from 2 to n do

for all t ∈ T do

• ρi(t) = Bt(wi) ·maxt′(At′t · ρi−1(t

′))

• mark one of the transitions from t′ to t where the maximum is reached

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

reconstruct backwards (from tn) the best path following the marked transitions

LIAI&C



✬

✫

✩

✪

Contents




☞ Baum-Welch algorithm

LIAI&C



✬

✫

✩

✪

Expectation-Maximization

Our goal: maximize P (θ|w)

☞ Maximum-likelihood estimation of θ → maximization of P (w|θ)

To achieve it: Expectation-Maximization (EM) algorithm

General formulation of EM:

given

• observed data w = w1...wn

• a parameterized probability distribution P (T,W|θ) where

– T = T1...Tn are unobserved data

– θ are the parameters of the model

determine θ that maximizes P (w|θ) by convergence of iterative computation of the series

θ(i) that maximizes (in θ) ET

[logP (T,W|θ)|w, θ(i−1)

]

LIAI&C



✬

✫

✩

✪

Expectation-Maximization (2)

To do so, define the auxilary function

Q(θ, θ′) = ET [logP (T,W|θ)|w, θ′] =∑

t

P (t|w, θ′) logP (t,w|θ)

as it can be shown (with Jensen inequality) that

Q(θ, θ′) > Q(θ′, θ′)⇒ P (w|θ) > P (w|θ′)

This is the fundamental principle of EM: if we already have an estimation θ′ of the

parameters and we find another parameter configuration θ for which the first inequality (on

Q) holds, then w is most probable with model θ rather than with model θ′.

LIAI&C



✬

✫

✩

✪

Expectation-Maximization (3)

EM algorithm:

• Estimation Step: Compute Q(θ, θ(i))

• Maximization Step: Compute θ(i+1) = Argmaxθ

Q(θ, θ(i))

in other words:

1. Choose θ(0) (and set i = 0)

2. Find θ(i+1) which maximizes

∑tP (t|w, θ(i)) logP (t,w|θ(i+1))

3. Set i← i+ 1 and go back to (2) unless some convergence test is fulfilled

LIAI&C



✬

✫

✩

✪

Baum-Welch Algorithm

The Baum-Welch Algorithm is an EM algorithm for estimating HMM parameters. It’s an

answer to the ”third problem”.

The goal is therefore to find

Argmaxθ

∑

t

P (t|w, θ′) logP (t,w|θ) = Argmaxθ

∑

t

P (t,w|θ′) logP (t,w|θ)

︸︷︷︸def= Q(θ, θ′)

since P (w|θ′) does not depend on θ.

What is logP (t,w|θ)?

logP (t,w|θ) = logPI(t1) +

n∑

i=2

logP (ti|ti−1) +

n∑

i=1

logP (wi|ti)

LIAI&C



✬

✫

✩

✪

Q(θ, θ′) consists therefore of 3 terms:

Q((I,A,B), θ′) = QI(I, θ′) +QA(A, θ′) +QB(B, θ′)

Let’s compute one of these:

QI(I, θ′) =

∑

t

P (t,w|θ′) logPI(t1)

=∑

t1

∑

t2,...,tn

P (t1,w|θ′) · P (t2, ...tn|t1,w, θ′) · logPI(t1)

=∑

t∈T

P (t1 = t,w|θ′) · logPI(t)∑

t2,...,tn

P (t2, ..., tn|t1,w, θ′)

︸︷︷︸=1

=∑

t∈T

P (t1 = t,w|θ′) · log It

LIAI&C



✬

✫

✩

✪

Similarly we have:

QA(A, θ′) =n∑

i=2

∑

t∈T

∑

t′∈T

P (Ti−1 = t, Ti = t′,w|θ′) logAtt′

QB(B, θ′) =

n∑

i=2

∑

t∈T

P (Ti = t,w|θ′) logBt(wi)

Therefore Q is a sum of three independent terms (e.g. QI does not depend on A nor on

B) and therefore the maximisation over θ is achieved by the three terms separately, i.e.

maximizing QI(I, θ′) over I, QA(A, θ′) over A and QB(B, θ′) over B separately.

Notice that all these three functions are sums (over i) of functions of the form:

f(x) =m∑

j=1

yj log xj

LIAI&C



✬

✫

✩

✪

and all the above three functions have to be maximized under the constraint

m∑

j=1

xj = 1.a

This maximization under constraints is achieved using Lagrange multipliers, i.e. looking at

g(x) = f(x)− λ ·m∑

j=1

xj =m∑

j=1

(yj log xj − λ · xj)

Solving this by∂∂x

g(x) = 0, we find that λ =yj

xj.

Putting this back in the constraint we find:

xj =yj

m∑

j=1

yj

aTo be accurate: for Bt the constraint is∑

w∈L

Bt(w) = 1. This changes the formulas a bit, but not the

essence of the computation.

LIAI&C



✬

✫

✩

✪

Summarizing the obtained results, we have the following reestimation formulas (where the

max. is reached):

It =P (T1 = t,w|θ′)∑

t′∈T

P (T1 = t′,w|θ′)=

P (T1 = t,w|θ′)

P (w|θ′)

Att′ =

n∑

i=2

P (Ti−1 = t, Ti = t′,w|θ′)

n∑

i=2

∑

τ∈T

P (Ti−1 = t, Ti = τ,w|θ′)

=

n∑

i=2

P (Ti−1 = t, Ti = t′,w|θ′)

n∑

i=2

P (Ti−1 = t,w|θ′)

LIAI&C



✬

✫

✩

✪

and:

Bt(w) =

n∑

i=1s.t.wi=w

P (Ti = t,w|θ′)

n∑

i=2

P (Ti = t,w|θ′)

=

n∑

i=2

P (Ti = t,w|θ′) δwi,w

n∑

i=2

P (Ti = t,w|θ′)

with δw,w′ = 1 if w = w′ and 0 otherwise.

LIAI&C



✬

✫

✩

✪

Baum-Welch Algorithm: effective computation

How do we compute these reestimation formulas?

Let χi(t, t′) = P (Ti = t, Ti+1 = t′|w)

χi is easy to compute with ”forward” and ”backward” variables:

χi(t, t′) =

αi(t) ·Att′ ·Bt′(wi+1) · βi+1(t′)∑

τ∈T

∑

τ ′∈T

αi(τ) ·Aττ ′ ·Bτ ′(wi+1) · βi+1(τ′)

Notice: γi(t) =∑

t′∈T

χi(t, t′) for all 1 ≤ i < n

LIAI&C



✬

✫

✩

✪

Effective Reestimation formulas

It = γ1(t)

Att′ =

n−1∑

i=1

χi(t, t′)

n−1∑

i=1

γi(t)

Bt(w) =

n∑

i=1s.t.wi=w

γi(t)

n∑

i=1

γi(t)

=

n∑

i=1

γi(t) δwi,w

n∑

i=1

γi(t)

with δw,w′ = 1 if w = w′ and 0 otherwise.

LIAI&C



✬

✫

✩

✪

Baum-Welch Algorithm

1. Let θ(0) be an initial parameter set

2. Compute iteratively α, β and then γ and χ

3. Compute θ(t+1) with reestimation formulas

4. If θ(t+1) 6= θ(t), go to (2) [or another weaker stop test]

WARNING!

The algorithm converges but only towards a local maximum of E [logP (T,W|θ)]

LIAI&C



✬

✫

✩

✪

CRF versus HMM

(linear) Conditional Random Fields (CRF) are a discriminative generalization of the

HMMs where “features” no longer needs to be state-conditionnal probabilities (less

constraint).

For instance (order 1):

HMM

P (t,w) = P (t1)P (w1|t1)·

n∏

i=2

P (wi|ti)P (ti|ti−1)

t1 t2

w2

w1

tn

wn

...

CRF

P (t|w) =n∏

i=2

P (ti−1, ti|w)

(with P (ti−1, ti|w) ∝

exp(

∑

jλjfj(ti−1, ti,w, i

)

)

t1 t2 tn...

wn1w ...

LIAI&C



✬

✫

✩

✪

Keypoints

➟ HMMs definitions, their applications

➟ Three basic problems for HMMs

➟ Algorithms needed to solve these problems: Forward-Backward, Viterbi, Baum-Welch

(be aware of their existence, but not the implementation details)

LIAI&C



✬

✫

✩

✪

References

[1] L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in

Speech Recognition, Proceedings of the IEEE, vol. 77, No. 2, 1989.

[2] C. D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing,

MIT, 1999.

[3] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum-likelihood from incomplete data

via the EM algorithm, Journal of Royal Statistical Society B, 1977.

[4] H. Bourlard et al., Traitement de la parole, 2000; pp. 179-200, 202-214, 232-260.

LIAI&C



✬

✫

✩

✪

APPENDUM

LIAI&C



✬

✫

✩

✪

Justification of the maximization of the auxilary function Q for finding θ maximizing

P (w|θ):

logP (w|θ)− logP (w|θ′) = logP (w|θ)

P (w|θ′)= log

∑

t

P (w, t|θ)

P (w|θ′)

= log∑

t

P (t|w, θ′)P (w, t|θ)

P (w, t|θ′)

Jensen≥

∑

t

P (t|w, θ′) logP (w, t|θ)

P (w, t|θ′)

≥ ET [logP (T,W|θ)|w, θ′]−ET [logP (T,W|θ′)|w, θ′]

≥ Q(θ, θ′)−Q(θ′, θ′)

Therefore:

Q(θ, θ′) > Q(θ′, θ′)⇒ logP (w|θ) > logP (w|θ′)⇒ P (w|θ) > P (w|θ′)

LIAI&C



Documents

HMM: Hidden Markov Models - coling.epfl.ch · HMM: Hidden Markov Models Jean-Cedric Chappelier´ ... Baum-Welch algorithm LIA I&C Introduction to Natural Language Processing (CS-431)