30
Natural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Natural Language Processing 1(10)

Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Natural Language Processing

Hidden Markov Models

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

[email protected]

Natural Language Processing 1(10)

Page 2: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Sequence Classification

I Part-of-speech tagging is a sequence classification problemI Input: Sequence w1, . . . ,wn of wordsI Output: Sequence t1, . . . , tn of tags

I Why not just take one word at a time?for i = 1 to n do argmax

tiP(ti |wi , ti�1)

I Because the desired solution may be:argmaxt1,...,tn

P(t1, . . . , tn |w1, . . . ,wn)

I Two approaches:I Local optimization – simple, efficient, but may be suboptimalI Global optimization – may be computationally challenging

Natural Language Processing 2(10)

Page 3: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Hidden Markov Models

I Markov models are probabilistic sequencemodels used for problems such as:

1. Speech recognition2. Spell checking3. Part-of-speech tagging4. Named entity recognition

I A Markov model traverses a sequence of states emitting signalsI If the signal sequence does not uniquely determine the state

sequence, the model is said to be hiddenI Markov models have efficient algorithms for optimization

Natural Language Processing 3(10)

Page 4: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Hidden Markov Models

I A Markov model consists of five elements:1. A finite set of states ⌦ = {s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m}3. Initial probabilities P(s) (for every s 2 ⌦) defining the

probability of starting in state s4. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)

defining the probability of going from state sj to state si5. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)

defining the probability of emitting symbol � in state s

I We can get rid of initial probabilities by adding a start state

Natural Language Processing 4(10)

Page 5: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Hidden Markov Models

I A Markov model consists of five elements:1. A finite set of states ⌦ = {s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m}3. Initial probabilities P(s) (for every s 2 ⌦) defining the

probability of starting in state s4. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)

defining the probability of going from state sj to state si5. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)

defining the probability of emitting symbol � in state s

I We can get rid of initial probabilities by adding a start state

Natural Language Processing 4(10)

Page 6: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Hidden Markov Models

I A Markov model consists of four elements:1. A finite set of states ⌦ = {s0, s1, . . . , sk}2. A finite signal alphabet ⌃ = {�1, . . . ,�m} (for every s 2 ⌦)

defining the probability of starting in state s3. Transition probabilities P(si | sj) (for every (si , sj) 2 ⌦2)

defining the probability of going from state sj to state si4. Emission probabilities P(� | s) (for every (�, s) 2 ⌃⇥ ⌦)

defining the probability of emitting symbol � in state s

I We can get rid of initial probabilities by adding a start state

Natural Language Processing 4(10)

Page 7: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Markov Assumptions

I State transitions are assumed to be independent of everythingexcept the current state:

P(s1, . . . , sn) =nY

i=1

P(si | si�1)

I Signal emissions are assumed to be independent of everythingexcept the current state:

P(s1, . . . , sn,�1, . . . ,�n) = P(s1, . . . , sn)nY

i=1

P(�i | si )

Natural Language Processing 5(10)

Page 8: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Tagging with an HMM

I Modeling assumptions:1. States represent tags (or tag sequences)2. Signals represent words

I Probability model (first-order):

P(w1, . . . ,wn, t1, . . . , tn) =nY

i=1

P(wi |ti )P(ti |ti�1)

I An n-th order model conditions tags on n preceding tags

I Ambiguity:I The same word (signal) generated by different tags (states)I Language is ambiguous , Markov model is hidden

Natural Language Processing 6(10)

Page 9: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

A Simple HMM

Natural Language Processing 7(10)

Page 10: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

A Versatile Model

Problem State Signal

Part-of-speech tagging Tag WordSpeech recognition Phoneme SoundOptical character recognition Character PixelsPredictive text entry Character Keystroke...

......

Natural Language Processing 8(10)

Page 11: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Problems for HMMs

I Decoding = finding the optimal state sequence

argmaxs1

,...,snP(s1, . . . , sn,�1, . . . ,�n)

I Learning = estimating the model parameters

P̂(si |sj) for all si , sj P̂(�|s) for all s,�

Natural Language Processing 9(10)

Page 12: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

QuizI In a simple substitution cipher, every letter in the alphabet is replaced by

exactly one symbol and that symbol is never used to represent anotherletter. For example, mummy would be written abaac in a cipher wherem ! a; u ! b; and y ! c.

I In a homophonic substitution cipher, more than one symbol may be usedto replace the same letter but never to represent another letter. Forexample, mummy could be written abcde in a cipher where m ! a, c, ord; u ! b; and y ! e.

I Suppose we use HMMs to relate sequences of plaintext characters andsequences of cipher characters. Which of the following HMMs are hidden?

Cipher State Signal1 Simple Plaintext Cipher2 Simple Cipher Plaintext3 Homophonic Plaintext Cipher4 Homophonic Cipher Plaintext

Natural Language Processing 10(1)

Page 13: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Natural Language Processing

HMM Decoding

Joakim Nivre

Uppsala University

Department of Linguistics and Philology

[email protected]

Natural Language Processing 1(5)

Page 14: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Decoding

IDecoding = finding the optimal state sequence

argmax

s1

,...,snP(s

1

, . . . , sn,�1

, . . . ,�n)

IDifficulty:

IMaximizing over all possible state sequences

IThe number |⌦|n of sequences grows exponentially

IKey observation for dynamic programming:

ISolution of size n contains solution of size n � 1

argmaxs1,...,sn P(s1

, . . . , sn,�1

, . . . ,�n) =

argmaxsn argmaxs1,...,sn�1 P(s1

, . . . , sn�1

,�1

, . . . ,�n�1

)P(sn|sn�1

)P(�n|sn)

Natural Language Processing 2(5)

Page 15: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

The Viterbi Algorithm

IBest probability and preceding state at step i and state s:

viterbi(i , s) , maxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)ptr(i , s) , argmaxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)

ICompute for all i and s recursively:

viterbi(1, s) = P(s|s0

)P(�1

|s)ptr(1, s) = s

0

viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)

IFind best end state and follow pointers backwards:

maxs viterbi(n, s) = maxs1,...,sn P(�1

, . . . ,�n, s1, . . . , sn)

Natural Language Processing 3(5)

Page 16: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

The Viterbi Algorithm

IBest probability and preceding state at step i and state s:

viterbi(i , s) , maxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)ptr(i , s) , argmaxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)

ICompute for all i and s recursively:

viterbi(1, s) = P(s|s0

)P(�1

|s)ptr(1, s) = s

0

viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)

IFind best end state and follow pointers backwards:

maxs viterbi(n, s) = maxs1,...,sn P(�1

, . . . ,�n, s1, . . . , sn)

Natural Language Processing 3(5)

Page 17: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

The Viterbi Algorithm

IBest probability and preceding state at step i and state s:

viterbi(i , s) , maxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)ptr(i , s) , argmaxs0 P(�1

, . . . ,�i , si�1

= s 0, si = s)

ICompute for all i and s recursively:

viterbi(1, s) = P(s|s0

)P(�1

|s)ptr(1, s) = s

0

viterbi(i , s) = maxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)ptr(i , s) = argmaxs0 viterbi(i � 1, s 0)P(s|s 0)P(�i |s)

IFind best end state and follow pointers backwards:

maxs viterbi(n, s) = maxs1,...,sn P(�1

, . . . ,�n, s1, . . . , sn)

Natural Language Processing 3(5)

Page 18: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Viterbi Example

Natural Language Processing 4(5)

Page 19: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Viterbi Example

Natural Language Processing 4(5)

Page 20: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Viterbi Example

Natural Language Processing 4(5)

Page 21: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Viterbi Example

Natural Language Processing 4(5)

Page 22: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Quiz

IConsider the following Viterbi computation:

IWhat is viterbi(4,N)?

1. 2.6 · 10

�9

2. 4.3 · 10

�6

3. 0

IWhat is ptr(4,N)?

1. V

2. N

3. ART

Natural Language Processing 5(5)

Page 23: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Natural Language Processing

HMM Learning

Joakim Nivre

Uppsala University

Department of Linguistics and Philology

[email protected]

Natural Language Processing 1(8)

Page 24: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Learning

I Learning = estimating the model parameters

P̂(si |sj) for all si , sj P̂(�|s) for all s,�

I Supervised learning:I

Given a labeled training corpus, we can estimate parameters

using (smoothed) relative frequencies

I Weakly supervised learning:I

Given a dictionary and an unlabeled training corpus, we can

use Expectation-Maximization to estimate parameters

Natural Language Processing 2(8)

Page 25: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Maximum Likelihood Estimation

I With labeled data D = {(x1, y1), . . . , (xn, yn)}, we maximizethe joint likelihood of input and output:

argmax

nY

i=1

P✓(yi )P✓(xi |yi )

I With unlabeled data D = {x1, . . . , xn}, we can insteadmaximize the marginal likelihood of the input:

argmax✓

nY

i=1

X

y2⌦Y

P✓(y)P✓(xi |y)

I In this case, Y is a hidden variable that is marginalized out

Natural Language Processing 3(8)

Page 26: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Expectation-Maximization

I Goal:

argmax✓

nY

i=1

X

y2⌦Y

P✓(y)P✓(xi |y)

I Problem:I

Maximizing a product of sums is mathematically hard

IIn most cases, there is no closed form solution

I Solution:I

Use numerical approximation methods

IMost common approach: Expectation-Maximization (EM)

Natural Language Processing 4(8)

Page 27: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Expectation-Maximization

I Basic observations:I

If we know f (x , y), we can find the MLE ✓.I

If we know f (x) and the MLE ✓, we can derive f (x , y).I

If P✓(y |x) = p and f (x) = n, then f (x , y) = np.

I Basic idea behind EM:1. Start by guessing ✓2. Compute expected counts E [f (x , y)] = P✓(y |x)f (x)3. Find MLE ✓ given expectation E [f (x , y)]4. Repeat steps 2 and 3 until convergence

Natural Language Processing 5(8)

Page 28: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

EM for Hidden Markov Models

I Computing expectations:

E [f (s, s 0)] =Pn�1

i=1 P(�1, . . . ,�n, si = s, si+1 = s

0)

E [f (s,�)] =Pn

i=1 P(�1, . . . ,�i�1,�i = �, . . . ,�n, si = s)

I Difficulty:I

Summing over all possible state sequences

IThe number |⌦|n of sequences grows exponentially

I Sounds familiar?I

We can use dynamic programming again

IThe forward-backward algorithm

Natural Language Processing 6(8)

Page 29: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

The Forward-Backward AlgorithmI Probability of signal sequence (Forward):

↵(i , s) , P(�1

, . . . ,�i , si = s)↵(1, s) = P(s|s

0

)P(�1

|s)↵(i , s) =

Ps0 ↵(i � 1, s0)P(s|s0)P(�i |s)P

s ↵(n, s) = P(�1

, . . . ,�n)

I Probability of signal sequence (Backward):

�(i , s) , P(�i+1

, . . . ,�n|si = s)�(n, s) = 1�(i , s) =

Ps0 �(i + 1, s0)P(s0|s)P(�i+1

|s0)Ps �(1, s)P(s|s

0

)P(�1

|s) = P(�1

, . . . ,�n)

I Expected counts of hidden states (Forward-Backward):

E [C(s, s0)] =Pn

i=1

↵(i , s)P(s0|s)P(�i+1

|s0)�(i + 1, s0)E [C(s,�)] =

Pi :�i=� ↵(i , s)P(�i |s)�(i , s)

Natural Language Processing 7(8)

Page 30: Natural Language Processing - Uppsala Universitynivre/master/NLP-HMM.pdfNatural Language Processing Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and

Quiz

I Assume we have the following tagged corpus:a/DET can/NOUN can/AUX trash/VERB a/DET trash/NOUN can/NOUN

I What is the MLE of P(AUX|NOUN)?1. 0.02. 0.53. 1.0

I What is the MLE of P(trash|NOUN)?1. 0.332. 0.53. 0.67

Natural Language Processing 8(8)