102
MARKOV MODELS & POS TAGGING Nazife Dimililer 23/10/2012

M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Embed Size (px)

Citation preview

Page 1: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

MARKOV MODELS & POS TAGGING

Nazife Dimililer23/10/2012

Page 2: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Overview

• Markov Models• Hidden Markov Models• POS Tagging

Page 3: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Andrei Markov

• Russian statistician (1856 – 1922)• Studied temporal probability models• Markov assumption

– Statet depends only on a bounded subset of State0:t-1

• First-order Markov process– P(Statet | State0:t-1) = P(Statet | Statet-1)

• Second-order Markov process– P(Statet | State0:t-1) = P(Statet | Statet-2:t-1)

Page 4: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

4

Markov Assumption• “The future does not depend on the past, given the

present.”• Sometimes this if called the first-order Markov

assumption.• second-order assumption would mean that each

event depends on the previous two events.– This isn’t really a crucial distinction.– What’s crucial is that there is some limit on how far we are

willing to look back.

Page 5: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

5

Morkov Models

• The Markov assumption means that there is only one probability to remember for each event type (e.g., word) to another event type.

• Plus the probabilities of starting with a particular event.

• This lets us define a Markov model as:– finite state automaton in which

• the states represent possible event types(e.g., the different words in our example)

• the transitions represent the probability of one event type following another.

• It is easy to depict a Markov model as a graph.

Page 6: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

What is a (Visible) Markov Model ?

• Graphical Model (Can be interpreted as Bayesian Net)• Circles indicate states• Arrows indicate probabilistic dependencies between

states• State depends only on the previous state• “The past is independent of the future given the

present.”

Page 7: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov Model FormalizationSSS SS

{S, , P A} S : {s1…sN } are the values for the hidden states

= {P pi} are the initial state probabilities

A = {aij} are the state transition probabilities

AAAA

i P(X1 si )

Page 8: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov chain for weather

Page 9: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov chain for words

Page 10: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov chain = “First-order observable Markov Model”

• a set of states – Q = q1, q2…qN; the state at time t is qt

• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.

– Each aij represents the probability of transitioning from state i to state j

– The set of these is the transition probability matrix A

• Distinguished start and end states

aij P(qt j |qt1 i) 1i, j N

aij 1; 1iNj1

N

Page 11: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov chain = “First-order observable Markov Model”

• Markov Assumption• Current state only depends on previous

state

P(qi | q1 … qi-1) = P(qi | qi-1)

Page 12: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Another representation for start state

• Instead of start state• Special initial probability vector p

– An initial distribution over probability of start states

• Constraints:

i P(q1 i) 1iN

j 1j1

N

Page 13: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

The weather model using p

Page 14: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

The weather model: specific example

Page 15: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov chain for weather

• What is the probability of 4 consecutive warm days?

• Sequence is warm-warm-warm-warm• I.e., state sequence is 3-3-3-3

P(3, 3, 3, 3) = 3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432

Page 16: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

How about?

• Hot hot hot hot• Cold hot cold hot

• What does the difference in these probabilities tell you about the real world weather info encoded in the figure?

Page 17: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

• Set of states: •Process moves from one state to another generating a sequence of states :

• Markov chain property: probability of each subsequent state depends only on what was the previous state:

•To define Markov model, the following probabilities have to be specified: • transition probabilities and • initial probabilities

Markov Models},,,{ 21 Nsss

,,,, 21 ikii sss

)|(),,,|( 1121 ikikikiiik ssPssssP

)|( jiij ssPa )( ii sP

Page 18: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Rain Dry

0.70.3

0.2 0.8• Two states : ‘Rain’ and ‘Dry’.• Transition probabilities:

P(‘Rain’|‘Rain’)=0.3 ,

P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2,

P(‘Dry’|‘Dry’)=0.8 •Initial probabilities: say

P(‘Rain’)=0.4 ,

P(‘Dry’)=0.6 .

Example of Markov Model

Page 19: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

• By Markov chain property, probability of state sequence can be found by the formula:

• Suppose we want to calculate a probability of a sequence of states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} ) =P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6

Calculation of sequence probability

)()|()|()|(

),,,()|(

),,,(),,,|(),,,(

112211

1211

12112121

iiiikikikik

ikiiikik

ikiiikiiikikii

sPssPssPssP

sssPssP

sssPssssPsssP

Page 20: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Hidden Markov Model (HMM)

• Evidence can be observed, but the state is hidden• Three components

– Priors (initial state probabilities)– State transition model– Evidence observation model

• Changes are assumed to be caused by a stationary process– The transition and observation models do not change

Page 21: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Hidden Markov models.

• Set of states:

•Process moves from one state to another generating a sequence of states :

• Markov chain property: probability of each subsequent state depends only on what was the previous state:

•States are not visible, but each state randomly generates one of M observations (or visible states)

},,,{ 21 Nsss

,,,, 21 ikii sss

)|(),,,|( 1121 ikikikiiik ssPssssP

},,,{ 21 Mvvv

Page 22: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Hidden Markov models.

•Model is represented by

M=(A, B, ).•To define hidden Markov model, the following probabilities have to be specified:

• matrix of transition probabilities

A=(aij), aij= P(si | sj) ,

• matrix of observation probabilities

B=(bi (vm )), bi(vm ) = P(vm | si) and

• a vector of initial probabilities

=(i), i = P(si) .

Page 23: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

What is an HMM?

• Graphical Model• Circles indicate states• Arrows indicate probabilistic dependencies between states

HMM = Hidden Markov Model

Page 24: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

What is an HMM?

• Green circles are hidden states• Dependent only on the previous state• “The past is independent of the future given the present.”

Page 25: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

What is an HMM?

• Purple nodes are observed states• Dependent only on their corresponding hidden

state

Page 26: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

HMM Formalism

• {S, K, , P , A B} • S : {s1…sN } are the values for the hidden states

• K : {k1…kM } are the values for the observations

SSS

KKK

S

K

S

K

Page 27: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

HMM Formalism

• {S, K, , P , A B} • = {P pi} are the initial state probabilities

• A = {aij} are the state transition probabilities

• B = {bik} are the observation state probabilities

Note : sometimes one uses B = {bijk}

output then depends on previous state / transition as well

A

B

AAA

BB

SSS

KKK

S

K

S

K

Page 28: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Low High

0.70.3

0.2 0.8

DryRain

0.6 0.60.4 0.4

Example of Hidden Markov Model

Page 29: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

• Two states : ‘Low’ and ‘High’ atmospheric pressure.• Two observations : ‘Rain’ and ‘Dry’.• Transition probabilities:

P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

• Observation probabilities :

P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,

P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .

• Initial probabilities: say

P(‘Low’)=0.4 , P(‘High’)=0.6 .

Example of Hidden Markov Model

Page 30: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

•Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’,’Rain’}.

•Consider all possible hidden state sequences:

P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})

where first term is :

P({‘Dry’,’Rain’} , {‘Low’,’Low’}) = P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)= 0.4*0.4*0.6*0.4*0.3

Calculation of observation sequence probability

Page 31: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

• Evaluation problem.

• Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK ,

• calculate the probability that model M has generated sequence O .• Decoding problem.

• Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK ,

• calculate the most likely sequence of hidden states si that produced this

observation sequence O.• Learning problem.

• Given some training observation sequences O=o1 o2 ... oK and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data.

• O=o1...oK denotes a sequence of observations ok{v1,…,vM}.

Main issues using HMMs :

Page 32: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

HMM for Ice Cream

• You are a climatologist in the year 2799• Studying global warming• You can’t find any records of the weather in

Baltimore, MD for summer of 2008• But you find Jason Eisner’s diary• Which lists how many ice-creams Jason ate

every date that summer• Our job: figure out how hot it was

Page 33: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Hidden Markov Model

• For Markov chains, the output symbols are the same as the states.– See hot weather: we’re in state hot

• But in named-entity or part-of-speech tagging (and speech recognition and other things)– The output symbols are words– But the hidden states are something else

• Part-of-speech tags• Named entity tags

• So we need an extension!• A Hidden Markov Model is an extension of a Markov chain in

which the input symbols are not the same as the states.• This means we don’t know which state we are in.

Page 34: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

POS Tagging : Current Performance

• How many tags are correct?– About 97% currently– But baseline is already 90%

• Baseline is performance of stupidest possible method• Tag every word with its most frequent tag• Tag unknown words as nouns

35

Input: the lead paint is unsafeOutput: the/Det lead/N paint/N is/V unsafe/Adj

Page 35: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

HMMs as POS Taggers

• Introduce Hidden Markov Models for part-of-speech tagging/sequence classification

• Cover the three fundamental questions of HMMs:– How do we fit the model parameters of an HMM?– Given an HMM, how do we efficiently calculate the likelihood

of an observation w?– Given an HMM and an observation w, how do we efficiently

calculate the most likely state sequence for w?

Page 36: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

HMMs• We want a model of sequences s and observations w

• Assumptions:– States are tag n-grams– Usually a dedicated start and end state / word– Tag/state sequence is generated by a markov model– Words are chosen independently, conditioned only on the tag/state– These are totally broken assumptions: why?

s1 s2 sn

w1 w2 wn

s0

Page 37: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

38

Part of speech tagging

• 8 (ish) traditional parts of speech• Noun, verb, adjective, preposition, adverb,

article, interjection, pronoun, conjunction, etc

• This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)

• Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

Page 38: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

POS examples

• N noun chair, bandwidth, pacing• V verb study, debate, munch• ADJ adj purple, tall, ridiculous• ADV adverb unfortunately, slowly,• P preposition of, by, to• PRO pronoun I, me, mine• DET determiner the, a, that, those

Page 39: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

POS Tagging example

WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

Page 40: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

POS Tagging

• Words often have more than one POS: back– The back door = JJ– On my back = NN– Win the voters back = RB– Promised to back the bill = VB

• The POS tagging problem is to determine the POS tag for a particular instance of a word.

These examples from Dekang Lin

Page 41: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

POS tagging as a sequence classification task

• We are given a sentence (an “observation” or “sequence of observations”)– Secretariat is expected to race tomorrow– She promised to back the bill

• What is the best sequence of tags which corresponds to this sequence of observations?

• Probabilistic view:– Consider all possible sequences of tags– Out of this universe of sequences, choose the tag sequence

which is most probable given the observation sequence of n words w1…wn.

Page 42: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Getting to HMM

• We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.

• Hat ^ means “our estimate of the best one”• Argmaxx f(x) means “the x such that f(x) is

maximized”

Page 43: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Getting to HMM

• This equation is guaranteed to give us the best tag sequence

• But how to make it operational? How to compute this value?

• Intuition of Bayesian classification:– Use Bayes rule to transform into a set of other

probabilities that are easier to compute

Page 44: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Using Bayes Rule

Page 45: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Likelihood and prior

n

Page 46: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Two kinds of probabilities (1)

• Tag transition probabilities P(ti|ti-1)– Determiners likely to precede adjs and nouns

• That/DT flight/NN• The/DT yellow/JJ hat/NN• So we expect P(NN|DT) and P(JJ|DT) to be high• But P(DT|JJ) to be:

– Compute P(NN|DT) by counting in a labeled corpus:

Page 47: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Two kinds of probabilities (2)

• Word likelihood probabilities p(wi|ti)– VBZ (3sg Pres verb) likely to be “is”– Compute P(is|VBZ) by counting in a labeled corpus:

Page 48: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

POS tagging: likelihood and prior

n

Page 49: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

An Example: the verb “race”

• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

• People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

• How do we pick the right tag?

Page 50: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Disambiguating “race”

Page 51: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

• P(NN|TO) = .00047• P(VB|TO) = .83• P(race|NN) = .00057• P(race|VB) = .00012• P(NR|VB) = .0027• P(NR|NN) = .0012

• P(VB|TO)P(NR|VB)P(race|VB) = .00000027• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032

• So we (correctly) choose the verb reading

Page 52: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Transitions between the hidden states of HMM, showing A probs

Page 53: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

B observation likelihoods for POS HMM

Page 54: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

The A matrix for the POS HMM

Page 55: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

The B matrix for the POS HMM

Page 56: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Viterbi intuition: we are looking for the best ‘path’

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

S1 S2 S4S3 S5

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

Slide from Dekang Lin

Page 57: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Viterbi example

Page 58: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Motivations and Applications

• Part-of-speech tagging– The representative put chairs on the table– AT NN VBD NNS IN AT NN– AT JJ NN VBZ IN AT NN

• Some tags :– AT: article, NN: singular or mass noun, VBD: verb, past

tense, NNS: plural noun, IN: preposition, JJ: adjective

Page 59: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Markov Models

• Probabilistic Finite State Automaton• Figure 9.1

Page 60: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

What is the probability of a sequence of states ?

P(X1,K ,XT )

P(X1)P(X2 | X1)P(X3 | X1,X2 )...P(XT | X1K ,XT 1)

P(X1)P(X2 | X1)P(X3 | X2 )...P(XT | XT 1)

X1aXtXt1

t1

T 1

Page 61: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Example

• Fig 9.1

P(t, i, p)

P(X1 t)P(X2 i | X1 t)P(X3 p | X2 i)1.0 0.30.6

0.18

Page 62: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

The crazy soft drink machine

• Fig 9.2

B cola iced tea lemonade

CP 0.6 0.1 0.3

IP 0.1 0.7 0.2

Page 63: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Probability of {lem,ice} ?

• Sum over all paths taken through HMM• Start in CP

– 1 x 0.3 + 0.7 x 0.1 +– 1 x 0.3 + 0.3 x 0.7

Page 64: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Inference in an HMM

• Compute the probability of a given observation sequence

• Given an observation sequence, compute the most likely hidden state sequence

• Given an observation sequence and set of possible models, which model most closely fits the data?

Page 65: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

O (o1...oT ), (A,B,)

Compute P(O | )

oTo1 otot-1 ot+1

Given an observation sequence and a model, compute the probability of the observation sequence

Decoding

Page 66: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Decoding

TToxoxox bbbXOP ...),|(2211

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 67: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Decoding

TToxoxox bbbXOP ...),|(2211

TT xxxxxxx aaaXP132211

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 68: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Decoding

)|(),|()|,( XPXOPXOP

TToxoxox bbbXOP ...),|(2211

TT xxxxxxx aaaXP132211

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 69: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Decoding

)|(),|()|,( XPXOPXOP

TToxoxox bbbXOP ...),|(2211

TT xxxxxxx aaaXP132211

...)|(

X

XPXOPOP )|(),|()|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 70: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

P(O | ) x1bx1o1

{x1 ...xT }

t1

T 1

axt xt1bxt1ot1

Decoding

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Complexity O(N T .2T )

E.g. N 5,T 100 gives 2.100.5100 1072

Page 71: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Dynamic Programming

Page 72: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

i (t) P(o1...ot , xt i | )

Forward Procedure

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

• Special structure gives us an efficient solution using dynamic programming.

• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences.

• Define:

i (1) P(o1, x1 i | )

i .bio1

Page 73: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

Page 74: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Page 75: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Page 76: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Page 77: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 78: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 79: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 80: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

P(o1...ot , xt i, xt1 j)i1...N P(ot1 | xt1 j)

P(o1...ot , xt1 j | xt i)i1...N P(xt i)P(ot1 | xt1 j)

P(o1...ot , xt i)i1...N P(xt1 j | xt i)P(ot1 | xt1 j)

i (t)aijb jot1i1...N

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 81: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Dynamic Programming

j (t 1) i (t)aijb jot1i1...N

Complexity O(N 2 .T )

E.g. N 5,T 100 gives 3000

Page 82: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

i (t) P(ot1...oT | xt i)

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Backward Procedure

i (T ) 1

i (t) aijbiot1 j (t 1)

j1...N

Probability of the rest of the states given the first state

Page 83: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012
Page 84: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Decoding Solution

N

ii TOP

1

)()|(

N

iiiOP

1

)1()|(

)()()|(1

ttOP i

N

ii

Forward Procedure

Backward Procedure

Combination

Page 85: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

P(O,Xt i | ) P(o1...ot ,Xt i,ot1...oT | )

P(o1...ot ,Xt i | ).P(ot1...oT | o1...ot ,Xt i,)

P(o1...ot ,Xt i | ).P(ot1...oT | Xt i,)

i (t).i (t)

)()()|(1

ttOP i

N

ii

Page 86: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Best State Sequence

• Find the state sequence that best explains the observations

• Two approaches – Individually most likely states– Most likely sequence (Viterbi)

)|(maxarg OXPX

Page 87: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Best State Sequence (1)

i (t) P(Xt i |O,)

P(Xt i,O | )

P(O | )

i (t).i (t)

j1

n

j (t). j (t)

Most likely state at each point in time

Xt arg maxi (t)

Page 88: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Best State Sequence (2)

• Find the state sequence that best explains the observations

• Viterbi algorithm

)|(maxarg OXPX

Page 89: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j

Page 90: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

x1 xt-1 xt xt+1

Initialization

1(i) ibio1

1(i) 0

Page 91: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

T

)1(ˆ1

^

tXtX

t

)(maxarg)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Page 92: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

HMMs and Bayesian Nets (1)x1 xt-1 xt xt+1 xT

P(x1...xT ,o1...oT ) P(x1)P(o1 | x1) P(xi1 | xii1

T 1

).P(oi1 | xi1)

x1bx1o1

t1

T 1

axt xt1bxt1ot1

Page 93: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

HMM and Bayesian Nets (2)

Conditionally independent of Given

Because of d-separation

“The past is independent of the future given the present.”

Page 94: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Inference in an HMM

• Compute the probability of a given observation sequence

• Given an observation sequence, compute the most likely hidden state sequence

• Given an observation sequence and set of possible models, which model most closely fits the data?

Page 95: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Dynamic Programming

Page 96: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Parameter Estimation

• Given an observation sequence, find the model that is most likely to produce that sequence.

• No analytic method• Given a model and observation sequence, update

the model parameters to better fit the observations.

A

B

AAA

BBB B

arg max

P(Otraining | )

Page 97: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

i (t) P(o1...ot , xt i | )

i (t) P(ot1...oT | xt i)

)()()|(1

ttOP i

N

ii

pt (i, j) P(Xt i,Xt1 j |O,)

P(Xt i,Xt1 j,O | )

P(O | )

Page 98: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Parameter EstimationA

B

AAA

BBB B

pt (i, j) i (t)aijb jot1

j (t 1)

m (t)m (t)m1...N

Probability of traversing an arc

Nj

ti jipt...1

),()( Probability of being in state i

Page 99: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

Parameter EstimationA

B

AAA

BBB B

)1(ˆ i i

Now we can compute the new estimates of the model parameters.

aij pt (i, j)t1

T 1i (t)t1

T 1

bik i (t){t:otk}

i (t)t1

T

Page 100: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Instance of Expectation Maximization

• We have that

• We may get stuck in local maximum (or even saddle point)

• Nevertheless, Baum-Welch usually effective

P(O | ) P(O | )

Page 101: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

Some Variants• So far, ergodic models

– All states are connected– Not always wanted

• Epsilon or null-transitions– Not all states/transitions emit output symbols

• Parameter tying– Assuming that certain parameters are shared – Reduces the number of parameters that have to be estimated

• Logical HMMs (Kersting, De Raedt, Raiko)– Working with structured states and observation symbols

• Working with log probabilities and addition instead of multiplication of probabilities (typically done)

Page 102: M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012

oTo1 otot-1 ot+1

The Most Important ThingA

B

AAA

BBB B

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.