Download pptx - Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Review: Hidden Markov Models

• Efficient dynamic programming algorithms exist for– Finding Pr(S)– The highest probability path P that maximizes

Pr(S,P) (Viterbi)• Training the model

– State seq known: MLE + smoothing– Otherwise: Baum-Welch algorithm

S2

S4

S1

0.9

0.5

0.50.8

0.2

0.1

S3

A

C

0.6

0.4

A

C

0.3

0.7A

C

0.5

0.5

A

C

0.9

0.1

HMM for Segmentation

• Simplest Model: One state per entity type

HMM Learning

• Manally pick HMM’s graph (eg simple model, fully connected)

• Learn transition probabilities: Pr(si|sj)

• Learn emission probabilities: Pr(w|si)

Learning model parameters• When training data defines unique path through HMM

– Transition probabilities• Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

– Emission probabilities• Probability of emitting symbol k from state i =

number of times k generated from i number of transition from I

What is a “symbol” ???

Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?

4601 => “4601”, “9999”, “9+”, “number”, … ?

000.. . . .999

3 -d ig i ts

00000 .. . .99999

5 -d ig i ts

0 ..99 0000 ..9999 000000 ..

O th e rs

N u m b e rs

A .. ..z

C h a rs

a a ..

M u lt i -le tte r

W o rds

. , / - + ? #

D e lim ite rs

A ll

Datamold: choose best abstraction level using holdout set

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

We can extend the HMM model so that each state generates multiple “features” – but they should be independent.

Borthwick et al solution

We could use YFCL: an SVM, logistic regression, a decision tree, …. We’ll be talking about logistic regression.

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Instead of an HMM, classify each token. Don’t learn transition probabilities, instead constrain them at test time.

Stupid HMM tricks

startPr(red)

Pr(green) Pr(green|green) = 1

Pr(red|red) = 1

Stupid HMM tricks

startPr(red)

Pr(green)Pr(green|green) = 1

Pr(red|red) = 1

Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)

argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y)= argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y)

Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

From NB to Maxent

Zy

yw

docf

yjkydocf

k

i

kj

/)Pr(

)|Pr(

ncombinatiok j,th -i )(

0]:?1 class of doc of position at appears [word ),(,

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

yxfi ),(0

From NB to Maxent

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

yxfi ),(0

From NB to Maxent

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

yxfi ),(0

i

i yxfxyP ),()|(log 0 Or:

Idea: keep the same functional form as naïve Bayes, but pick the parameters to optimize performance on training data.

One possible definition of performance is conditional log likelihood of the data:

t

tt xyP )|(log

MaxEnt Comments– Implementation:

• All methods are iterative• For NLP like problems with many features, modern gradient-like or

Newton-like methods work well• Thursday I’ll derive the gradient for CRFs

– Smoothing: • Typically maxent will overfit data if there are many infrequent features. • Old-school solutions: discard low-count features; early stopping with

holdout set; …• Modern solutions: penalize large parameter values with a prior centered

on zero to limit size of alphas (ie, optimize log likelihood - sum alpha); other regularization techniques

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Borthwick et al idea

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations

...)|Pr( tt xs

Another idea….

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

...),|Pr( ,1 ttt sxs

MaxEnt taggers and MEMMs

St -1 S

t

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

......),|Pr( ,2,1 tttt ssxs

Learning does not change – you’ve just added a few additional features that are the previous labels.

Classification is trickier – we don’t know the previous-label features at test time – so we will need to search for the best sequence of labels (like for an HMM).

Partial history of the idea

• Sliding window classifiers– Sejnowski’s NETTalk, mid 1980’s

• Recurrent neural networks and other “recurrent” sliding-window classifiers– Late 1980’s and 1990’s

• Ratnaparkhi’s thesis– Mid-late 1990’s

• Frietag, McCallum & Pereira ICML 2000– Formalize notion of MEMM

• OpenNLP– Based largely on MaxEnt taggers, Apache Open Source

Ratnaparkhi’s MXPOST

• Sequential learning problem: predict POS tags of words.

• Uses MaxEnt model described above.

• Rich feature set.• To smooth, discard features

occurring < 10 times.

MXPOST

MXPOST: learning & inference

GISFeature

selection

23

Using the HMM to segment

• Find highest probability path through the HMM.• Viterbi: quadratic dynamic programming

algorithm

House

ot

Road

City

Pin

15213 Butler Highway Greenville 21578

House

Road

City

Pinot

House

Road

City

Pin

15213 Butler ... 21578

Inference for MENE (Borthwick et al system)

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

Goal: best legal path through lattice (i.e., path that runs through the most black ink. Like Viterbi but cost of possible transitions are ignored.)

Inference for MXPOST

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1

1,

1,1

iii

iikii

iii

yxy

yyxy

yyxyxy

(Approx view): find best path, weights are now on arcs from state to state.

window of k tagsk=1


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


More accurately: find total flow to each node, weights are now on arcs from state to state.

'

11 )',|Pr()'()(y

tttt yYxyYyy


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1,2

1,

1,1

iiii

iikii

iii

yyxy

yyxy

yyxyxy

Find best path? tree? Weights are on hyperedges

Inference for MxPOST

I

O

iI

iO


oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

… …

Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

MXPost results

• State of art accuracy (for 1996)

• Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art).

• Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

MEMMs

• Basic difference from ME tagging:– ME tagging: previous state is feature of MaxEnt

classifier– MEMM: build a separate MaxEnt classifier for each

state.• Can build any HMM architecture you want; eg parallel nested

HMM’s, etc.• Data is fragmented: examples where previous tag is “proper

noun” give no information about learning tags when previous tag is “noun”

– Mostly a difference in viewpoint– MEMM does allow possibility of “hidden” states and

Baum-Welsh like training

MEMM task: FAQ parsing

MEMM features

MEMMs

Looking forward

• HMMS– Easy to train generative model– Features for a state must be independent (-)

• MaxEnt tagger/MEMM– Multiple cascaded classifiers– Features can be arbitrary (+)– Have we given anything up?

37

HMM inference

House

ot

Road

City

Pin

• Total probability of transitions out of a state must sum to 1

• But …they can all lead to “unlikely” states

• So…. a state can be a (probable) “dead end” in the lattice

House

Road

City

Pinot

House

Road

City

Pin

15213 Butler ... 21578


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


More accurately: find total flow to

each node, weights are now on arcs

from state to state.

'

11 )',|Pr()'()(y

tttt yYxyYyy

Flow out of each node is always fixed:

y

tt yYxyYy 1)',|Pr(,' 1

Label Bias Problem (Lafferty, McCallum, Pereira ICML 2001)

• Consider this MEMM, and enough training data to perfectly model it:

Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1

Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1

Pr(0123|rib)=1Pr(0453|rob)=1

Another max-flow scheme

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


More accurately: find total flow to

each node, weights are now on arcs

from state to state.

'

11 )',|Pr()'()(y

tttt yYxyYyy

Flow out of a node is always fixed:

y

tt yYxyYy 1)',|Pr(,' 1

Another max-flow scheme: MRFs

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


Goal is to learn how to weight edges in the graph:• weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and

isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]

Another max-flow scheme: MRFs

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges

in the graph, given features from the examples.

Another view of label bias [Sha & Pereira]

So what’s the alternative?