Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Natural Language Processing

Hidden Markov Models

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

[email protected]

Natural Language Processing 1(10)

Sequence Classification

I Part-of-speech tagging is a sequence classification problemI Input: Sequence w1, . . . ,wn of wordsI Output: Sequence t1, . . . , tn of tags

I Why not just take one word at a time?for i = 1 to n do argmax

tiP(ti |wi , ti�1)

I Because the desired solution may be:argmaxt1,...,tn

P(t1, . . . , tn |w1, . . . ,wn)

I Two approaches:I Local optimization – simple, efficient, but may be suboptimalI Global optimization – may be computationally challenging



I Markov models are probabilistic sequencemodels used for problems such as:

1. Speech recognition2. Spell checking3. Part-of-speech tagging4. Named entity recognition

I A Markov model traverses a sequence of states emitting signalsI If the signal sequence does not uniquely determine the state

sequence, the model is said to be hiddenI Markov models have efficient algorithms for optimization



I A Markov model consists of five elements:1. A finite set of states ⌦ = {s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m}3. Initial probabilities P(s) (for every s 2 ⌦) defining the

probability of starting in state s4. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)

defining the probability of going from state sj to state si5. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)

defining the probability of emitting symbol � in state s

I We can get rid of initial probabilities by adding a start state



I A Markov model consists of five elements:1. A finite set of states ⌦ = {s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m}3. Initial probabilities P(s) (for every s 2 ⌦) defining the

probability of starting in state s4. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)






I A Markov model consists of four elements:1. A finite set of states ⌦ = {s0, s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m} (for every s 2 ⌦)

defining the probability of starting in state s3. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)





Markov Assumptions

I State transitions are assumed to be independent of everythingexcept the current state:

P(s1, . . . , sn) =nY

i=1

P(si | si�1)

I Signal emissions are assumed to be independent of everythingexcept the current state:

P(s1, . . . , sn,�1, . . . ,�n) = P(s1, . . . , sn)nY

i=1

P(�i | si )


Tagging with an HMM

I Modeling assumptions:1. States represent tags (or tag sequences)2. Signals represent words

I Probability model (first-order):

P(w1, . . . ,wn, t1, . . . , tn) =nY

i=1

P(wi |ti )P(ti |ti�1)

I An n-th order model conditions tags on n preceding tags

I Ambiguity:I The same word (signal) generated by different tags (states)I Language is ambiguous , Markov model is hidden


A Simple HMM


A Versatile Model

Problem State Signal

Part-of-speech tagging Tag WordSpeech recognition Phoneme SoundOptical character recognition Character PixelsPredictive text entry Character Keystroke...

......


Problems for HMMs

I Decoding = finding the optimal state sequence

argmaxs1

,...,snP(s1, . . . , sn,�1, . . . ,�n)

I Learning = estimating the model parameters

P̂(si |sj) for all si , sj P̂(�|s) for all s,�


QuizI In a simple substitution cipher, every letter in the alphabet is replaced by

exactly one symbol and that symbol is never used to represent anotherletter. For example, mummy would be written abaac in a cipher wherem ! a; u ! b; and y ! c.

I In a homophonic substitution cipher, more than one symbol may be usedto replace the same letter but never to represent another letter. Forexample, mummy could be written abcde in a cipher where m ! a, c, ord; u ! b; and y ! e.

I Suppose we use HMMs to relate sequences of plaintext characters andsequences of cipher characters. Which of the following HMMs are hidden?

Cipher State Signal1 Simple Plaintext Cipher2 Simple Cipher Plaintext3 Homophonic Plaintext Cipher4 Homophonic Cipher Plaintext



HMM Decoding

Joakim Nivre

Uppsala University

Department of Linguistics and Philology

[email protected]


Decoding

IDecoding = finding the optimal state sequence

argmax

s1

,...,snP(s

1

, . . . , sn,�1

, . . . ,�n)

IDifficulty:

IMaximizing over all possible state sequences

IThe number |⌦|n of sequences grows exponentially

IKey observation for dynamic programming:

ISolution of size n contains solution of size n � 1

argmaxs1,...,sn P(s1

, . . . , sn,�1

, . . . ,�n) =

argmaxsn argmaxs1,...,sn�1 P(s1

, . . . , sn�1

,�1

, . . . ,�n�1

)P(sn|sn�1

)P(�n|sn)


The Viterbi Algorithm

IBest probability and preceding state at step i and state s:

viterbi(i , s) , maxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)ptr(i , s) , argmaxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)

ICompute for all i and s recursively:

viterbi(1, s) = P(s|s0

)P(�1

|s)ptr(1, s) = s

0

viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)

IFind best end state and follow pointers backwards:

maxs viterbi(n, s) = maxs1,...,sn P(�1

, . . . ,�n, s1, . . . , sn)





, . . . ,�i , si�1


, . . . ,�i , si�1

= s 0, si = s)



)P(�1

|s)ptr(1, s) = s

0




, . . . ,�n, s1, . . . , sn)





, . . . ,�i , si�1


, . . . ,�i , si�1

= s 0, si = s)



)P(�1

|s)ptr(1, s) = s

0




, . . . ,�n, s1, . . . , sn)


Viterbi Example


Viterbi Example


Viterbi Example


Viterbi Example


Quiz

IConsider the following Viterbi computation:

IWhat is viterbi(4,N)?

1. 2.6 · 10

�9

2. 4.3 · 10

�6

3. 0

IWhat is ptr(4,N)?

1. V

2. N

3. ART



HMM Learning

Joakim Nivre

Uppsala University

Department of Linguistics and Philology

[email protected]


Learning

I Learning = estimating the model parameters

P̂(si |sj) for all si , sj P̂(�|s) for all s,�

I Supervised learning:I

Given a labeled training corpus, we can estimate parameters

using (smoothed) relative frequencies

I Weakly supervised learning:I

Given a dictionary and an unlabeled training corpus, we can

use Expectation-Maximization to estimate parameters


Maximum Likelihood Estimation

I With labeled data D = {(x1, y1), . . . , (xn, yn)}, we maximizethe joint likelihood of input and output:

argmax

✓

nY

i=1

P✓(yi )P✓(xi |yi )

I With unlabeled data D = {x1, . . . , xn}, we can insteadmaximize the marginal likelihood of the input:

argmax✓

nY

i=1

X

y2⌦Y

P✓(y)P✓(xi |y)

I In this case, Y is a hidden variable that is marginalized out


Expectation-Maximization

I Goal:

argmax✓

nY

i=1

X

y2⌦Y

P✓(y)P✓(xi |y)

I Problem:I

Maximizing a product of sums is mathematically hard

IIn most cases, there is no closed form solution

I Solution:I

Use numerical approximation methods

IMost common approach: Expectation-Maximization (EM)


Expectation-Maximization

I Basic observations:I

If we know f (x , y), we can find the MLE ✓.I

If we know f (x) and the MLE ✓, we can derive f (x , y).I

If P✓(y |x) = p and f (x) = n, then f (x , y) = np.

I Basic idea behind EM:1. Start by guessing ✓2. Compute expected counts E [f (x , y)] = P✓(y |x)f (x)3. Find MLE ✓ given expectation E [f (x , y)]4. Repeat steps 2 and 3 until convergence


EM for Hidden Markov Models

I Computing expectations:

E [f (s, s 0)] =Pn�1

i=1 P(�1, . . . ,�n, si = s, si+1 = s

0)

E [f (s,�)] =Pn

i=1 P(�1, . . . ,�i�1,�i = �, . . . ,�n, si = s)

I Difficulty:I

Summing over all possible state sequences

IThe number |⌦|n of sequences grows exponentially

I Sounds familiar?I

We can use dynamic programming again

IThe forward-backward algorithm


The Forward-Backward AlgorithmI Probability of signal sequence (Forward):

↵(i , s) , P(�1

, . . . ,�i , si = s)↵(1, s) = P(s|s

0

)P(�1

|s)↵(i , s) =

Ps0 ↵(i � 1, s0)P(s|s0)P(�i |s)P

s ↵(n, s) = P(�1

, . . . ,�n)

I Probability of signal sequence (Backward):

�(i , s) , P(�i+1

, . . . ,�n|si = s)�(n, s) = 1�(i , s) =

Ps0 �(i + 1, s0)P(s0|s)P(�i+1

|s0)Ps �(1, s)P(s|s

0

)P(�1

|s) = P(�1

, . . . ,�n)

I Expected counts of hidden states (Forward-Backward):

E [C(s, s0)] =Pn

i=1

↵(i , s)P(s0|s)P(�i+1

|s0)�(i + 1, s0)E [C(s,�)] =

Pi :�i=� ↵(i , s)P(�i |s)�(i , s)


Quiz

I Assume we have the following tagged corpus:a/DET can/NOUN can/AUX trash/VERB a/DET trash/NOUN can/NOUN

I What is the MLE of P(AUX|NOUN)?1. 0.02. 0.53. 1.0

I What is the MLE of P(trash|NOUN)?1. 0.332. 0.53. 0.67


Documents

Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and