Introduction to Markov Models

Introduction to Markov Models

Estimating the probability of phrases of words, sentences, etc.…

But first:

A few preliminaries

CIS 391 - Intro to AI 2


What counts as a word? A tricky question….


How to find Sentences??

Q1: How to estimate the probability of a

given sentence W?

A crucial step in speech recognition (and lots of

other applications)

First guess: products of unigrams

Given word lattice:

Unigram counts (in 1.7 * 106 words of AP text):

Not quite right…CIS 391 - Intro to AI 5

ˆ( ) ( )w W

P W P w

form subsidy for

farm subsidies far

form 183 subsidy 15 for 18185

farm 74 subsidies 55 far 570

Predicting a word sequence II

Next guess: products of bigrams

• For W=w1w2w3… wn,

Given word lattice:

Bigram counts (in 1.7 * 106 words of AP text):

Better (if not quite right) … (But the counts are tiny! Why?)

CIS 391 – Intr)o to AI 6

1

1

1

ˆ( ) ( )n

i i

i

P W P w w

form subsidy for

farm subsidies far

form subsidy 0 subsidy for 2

form subsidies 0 subsidy far 0

farm subsidy 0 subsidies for 6

farm subsidies 4 subsidies far 0

How can we estimate P correctly?

Problem: Naïve Bayes model for bigrams violates independence

assumptions.

Let’s do this right….

Let W=w1w2w3… wn. Then, by the chain rule,

We can estimate P(w2|w1) by the Maximum Likelihood Estimator

and P(w3|w1w2) by

and so on…


1 1 11 2 3 2 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP W P w P w w P w w w P w w w

1 2

1

( )

( )

Count w w

Count w

1 2 3

1 2

( )

( )

Count w w w

Count w w

1 1 11 2 3 2 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP W P w P w w P w w w P w w w


and finally, Estimating P(wn|w1w2…wn-1)

Again, we can estimate P(wn|w1w2…wn-1) with the MLE

So to decide pat vs. pot in Heat up the oil in a large p?t,

compute for pot

1 2

1 2 1

( ... )

( ... )

n

n

Count w w w

Count w w w

("Heat up the oil in a large pot")

("Heat up the oil in a larg ")

0

e 0

Count

Count


Hmm..The Web Changes Things (2008 or so)

Even the web in 2008 yields low counts!


Statistics and the Web II

So, P(“pot”|”heat up the oil in a large___”) = 8/49 0.16

But the web has grown!!!


….


165/891=0.185


So….

A larger corpus won’t help much unless it’s

HUGE …. but the web is!!!

But what if we only have 100 million words for our

estimates??

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with 100 million

words of training data??

Assuming (for now) uniform distribution over only 5000 words

So even with 108 words of data, for even trigrams we encounter

the sparse data problem…..



The Markov Assumption:

Only the Immediate Past Matters

The Markov Assumption: Estimation

We estimate the probability of each wi given previous context by

which can be estimated by

So we’re back to counting only unigrams and bigrams!!

AND we have a correct practical estimation method for P(W) given the Markov assumption!


P(wi|w1w2…wi-1) = P(wi|wi-1)

1

1

( )

( )

i i

i

Count w w

Count w


Markov Models


Visualizing an n-gram based language model:

the Shannon/Miller/Selfridge method

To generate a sequence of n words given unigram

estimates:

• Fix some ordering of the vocabulary v1 v2 v3 …vk.

• For each word wi , 1 ≤ i ≤ n

—Choose a random value ri between 0 and 1

— wi = the first vj such that

1

( )j

m i

m

P v r


Visualizing an n-gram based language model:

the Shannon/Miller/Selfridge method

To generate a sequence of n words given a 1st

order Markov model (i.e. conditioned on one

previous word):

• Fix some ordering of the vocabulary v1 v2 v3 …vk.

• Use unigram method to generate an initial word w1

• For each remaining wi , 2 ≤ i ≤ n

—Choose a random value ri between 0 and 1

— wi = the first vj such that1

1

( | )j

m i i

m

P v w r


The Shannon/Miller/Selfridge method trained on

Shakespeare

(This and next two slides from Jurafsky)

Wall Street Journal just isn’t Shakespeare

Shakespeare as corpus

N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams.

• So 99.96% of the possible bigrams were never seen (have zero entries in the table)

Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

The Sparse Data Problem Again

How likely is a 0 count? Much more likely than I let on!!!


English word frequencies well described

by Zipf’s Law

Zipf (1949) characterized the relation between word

frequency and rank as:

Purely Zipfian data plots as a straight line on a log-

log scale

*Rank (r): The numerical position of a word in a list sorted by

decreasing frequency (f ).

(f) log - log(C) log(r)

C/f r

)constant (for

CCrf


Word frequency & rank in Brown Corpus

vs Zipf


From: Interactive mathematics http://www.intmath.com

Lots of area

under the tail

of this curve!


Zipf’s law for the Brown corpus

Smoothing

This black art is why NLP is taught in the engineering school –Jason Eisner


Smoothing

At least one unknown word likely per sentence given Zipf!!

To fix 0’s caused by this, we can smooth the data.

• Assume we know how many types never occur in the data.

• Steal probability mass from types that occur at least once.

• Distribute this probability mass over the types that never occur.


Smoothing

….is like Robin Hood:

• it steals from the rich

• and gives to the poor

Review: Add-One Smoothing

Estimate probabilities by assuming every possible word

type v V actually occurred one extra time (as if by

appending an unabridged dictionary)

So if there were N words in our corpus, then instead of

estimating

we estimate


P̂

( )ˆ( )Count w

P wN

( 1)ˆ( )Count w

P wN V


Add-One Smoothing (again)

Pro: Very simple technique

Cons:

• Probability of frequent n-grams is underestimated

• Probability of rare (or unseen) n-grams is overestimated

• Therefore, too much probability mass is shifted towards unseen n-grams

• All unseen n-grams are smoothed in the same way

Using a smaller added-count improves things but only some

More advanced techniques (Kneser Ney, Witten-Bell) use properties of component n-1 grams and the like...

(Hint for this homework )

Documents

Introduction to Markov Models