12/6/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini

04/21/23 CPSC503 Winter 2009 1

CPSC 503Computational Linguistics

Lecture 5Giuseppe Carenini

04/21/23 CPSC503 Winter 2009 2

Today 23/9

• Min Edit Distance• n-grams• Model evaluation

04/21/23 CPSC503 Winter 2009 3

Minimum Edit Distance• Def. Minimum number of edit

operations (insertion, deletion and substitution) needed to transform one string into another.

gumbo

gumb

gum

gam

delete o

delete b

substitute u by a

04/21/23 CPSC503 Winter 2009 4

Minimum Edit Distance Algorithm

• Dynamic programming (very common technique in NLP)

• High level description:– Fills in a matrix of partial comparisons– Value of a cell computed as “simple”

function of surrounding cells– Output: not only number of edit operations

but also sequence of operations

04/21/23 CPSC503 Winter 2009 5

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

04/21/23 CPSC503 Winter 2009 6

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

04/21/23 CPSC503 Winter 2009 7

Min edit distance and alignment

See demo

04/21/23 CPSC503 Winter 2009 8

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, funnel...

…in this context– trust funn – a lot of funn

– I want too go their

Real-word isolated

Real-word context

?!Is it an impossible (or very unlikely) word

in this context?Find the most likely

sub word in this context

04/21/23 CPSC503 Winter 2009 9

Key Transition

• Up to this point we’ve mostly been discussing words in isolation

• Now we’re switching to sequences of words

• And we’re going to worry about assigning probabilities to sequences of words

04/21/23 CPSC503 Winter 2009 10

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/21/23 CPSC503 Winter 2009 11

Only Spelling?A.Assign a probability to a sentence

• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing

B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the

disabled

AB

?),..,( 1 nwwP Impossible to estimate

04/21/23 CPSC503 Winter 2009 12

Impossible to estimate!Assuming 105 word types and average sentence contains 10 words ->

sample space?

• Google language model Update (22 Sept. 2006): was based on corpusNumber of sentences: 95,119,665,584

?),..,( 1 nwwP

Most sentences will not appear or appear only once

Key point in Stat. NLP: your corpus should be >> than your sample space!

04/21/23 CPSC503 Winter 2009 13

Decompose: apply chain rule

Chain Rule:

)|(),..(1

111

i

jji

n

in AAPAAP

nw1

)|()(

)|()...|()()(),..,(

112

1

1112111

k

kn

k

nn

nn

wwPwP

wwPwwPwPwPwwP

Applied to a word sequence from position 1 to n:

04/21/23 CPSC503 Winter 2009 14

Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) x

P(big|the) x P(red|the big) x

P(dog|the big red) x P(barks|the big red dog)

Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|

<S>)

04/21/23 CPSC503 Winter 2009 15

Not a satisfying solution Even for small n (e.g., 6) we would

need a far too large corpus to estimate:

)|(....... 516 wwP

)|()|( 11

11

nNnn

nn wwPwwP

Markov Assumption: the entire prefix history isn’t necessary.

),|()|(3

)|()|(2

)()|(1

211

1

11

1

11

nnnn

n

nnn

n

nn

n

wwwPwwPN

wwPwwPN

wPwwPN unigram

bigram

trigram

04/21/23 CPSC503 Winter 2009 16

Prob of a sentence: N-Grams

)|()()(),..,( 112

111

kk

n

k

nn wwPwPwPwwP

)()()(),..,(2

111 kn

k

nn wPwPwPwwP

)|()()(),..,( 12

111 kkn

k

nn wwPwPwPwwP

unigram

bigram

trigram)|()()(),..,( 2,12

111 kkkn

k

nn wwwPwPwPwwP

Chain-rule

simplifications

04/21/23 CPSC503 Winter 2009 17

Bigram<s>The big red dog barks

P(The big red dog barks)= P(The|<S>) x

)|()|()(),..,( 12

111 kkn

k

nn wwPSwPwPwwP

Trigram?

04/21/23 CPSC503 Winter 2009 18

Estimates for N-Grams

)(

),()(

),(

)(

)()|(

1

1

1

1

1

,11

n

nn

words

n

pairs

nn

n

nnnn

wC

wwC

NwC

NwwC

wP

wwPwwP

bigram

..in general)(

)()|(

11

111

1

nNn

nnNnn

NnnwC

wwCwwP

04/21/23 CPSC503 Winter 2009 19

Estimates for Bigrams

)(

),()(

),(

)(

),()|(

bigC

redbigC

NbigC

NredbigC

bigP

redbigPbigredP

words

pairs

Silly Corpus :

“<S>The big red dog barks against the big pink dog”

Word types vs. Word tokens

04/21/23 CPSC503 Winter 2009 20

Berkeley ____________Project (1994) Table: Counts

nw

1nw

)(

)()|(

1

11

n

nnnn

wC

wwCwwP

Corpus: ~10,000 sentences, 1616 word typesWhat domain?

Dialog? Reviews?

04/21/23 CPSC503 Winter 2009 21

BERP Table:nw

1nw

)|( 1nn wwP

04/21/23 CPSC503 Winter 2009 22

BERP Table Comparison

)(

)(

1

1

n

nn

wC

wwC

Counts

Prob.

1?

nw

1nw

04/21/23 CPSC503 Winter 2009 23

Some Observations

• What’s being captured here?– P(want|I) = .32– P(to|want) = .65– P(eat|to) = .26– P(food|Chinese) = .56– P(lunch|eat) = .055

nw

1nw

04/21/23 CPSC503 Winter 2009 24

Some Observations

• P(I | I)• P(want | I)• P(I | food)

• I I I want• I want I want to• The food I want is

nw

1nw

Speech-based restaurant consultant!

04/21/23 CPSC503 Winter 2009 25

Generation• Choose N-Grams according to their

probabilities and string them together

nw

1nw

• I want -> want to -> to eat -> eat lunch• I want -> want to -> to eat -> eat Chinese -> Chinese food

04/21/23 CPSC503 Winter 2009 26

Two problems with applying:

)|()()( 12

11 kkn

k

n wwPwPwP

nw

1nw

to

04/21/23 CPSC503 Winter 2009 27

Problem (1)

• We may need to multiply many very small numbers (underflows!)

• Easy Solution:– Convert probabilities to logs and

then do ……– To get the real probability (if you

need it) go back to the antilog.

04/21/23 CPSC503 Winter 2009 28

Problem (2)• The probability matrix for n-grams is

sparse• How can we assign a probability to a

sequence where one of the component n-grams has a value of zero?

•Solutions:– Add-one smoothing– Good-Turing– Back off and Deleted Interpolation

04/21/23 CPSC503 Winter 2009 29

Add-One• Make the zero counts 1.• Rationale: If you had seen these

“rare” events chances are you would only have seen them once.

unigram N

wCwP

)()(

VN

wCwP

1)(

)(*

N

wCwP

)()(

**

VN

NwCwC

)1)(()(*

)(

)(*

wC

wCdc discount

04/21/23 CPSC503 Winter 2009 30

Add-One: Bigram

)(

),()|(

1

11

n

nnnn

wC

wwCwwP

VwC

wwCwwP

n

nnnn

)(

1),()|(

1

11

*

VwC

NwwCwwC

nnnnn

)()1),((),(

111

*……

Counts

nw

1nw

04/21/23 CPSC503 Winter 2009 31

BERP Original vs. Add-one smoothed Countsnw

1nw

6

19225

15

04/21/23 CPSC503 Winter 2009 32

Add-One Smoothed Problems

• An excessive amount of probability mass is moved to the zero counts

• When compared empirically with MLE or other smoothing techniques, it performs poorly

• -> Not used in practice• Detailed discussion [Gale and Church

1994]

04/21/23 CPSC503 Winter 2009 33

Better smoothing technique• Good-Turing Discounting (clear

example on textbook)

More advanced: Backoff and Interpolation

• To estimate an ngram use “lower order” ngrams– Backoff: Rely on the lower order ngrams when

we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams

– Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts

04/21/23 CPSC503 Winter 2009 34

Impossible to estimate!Sample space much bigger than any

realistic corpus

?),..,( 1 nwwP

Chain rule does not help

Markov assumption :Unigram… sample space?Bigram … sample space?Trigram … sample space?

Sparse matrix: Smoothing techniques

N-Grams Summary: final

Look at practical issues: sec. 4.8

04/21/23 CPSC503 Winter 2009 35

You still need a big corpus…• The biggest one is the Web!

),(

),,(),|(

21

321213

wwC

wwwCwwwP

Web

Web

Web

• Impractical to download every page from the Web to count the ngrams =>

• Rely on page counts (which are only approximations)– Page can contain an ngram multiple times– Search Engines round-off their counts• Such “noise” is tolerable in practice

04/21/23 CPSC503 Winter 2009 36

Today 23/9

• Min Edit Distance• n-grams• Model evaluation

04/21/23 CPSC503 Winter 2008 37

Model Evaluation: GoalYou may want to compare: • 2-grams with 3-grams • two different smoothing techniques

(given the same n-grams)

On a given corpus…

04/21/23 CPSC503 Winter 2008 38

Model Evaluation: Key Ideas

Corpus

Training Set Testing set

A:split

B: train models

Models: Q1 and Q2

C:Apply models• counting • frequencies • smoothing

• Compare results

Nw1

04/21/23 CPSC503 Winter 2009 39

Entropy• Def1. Measure of uncertainty• Def2. Measure of the information

that we need to resolve an uncertain situation

– Let p(x)=P(X=x); where x X.

– H(p)= H(X)= - xX p(x)log2p(x)

– It is normally measured in bits.

04/21/23 CPSC503 Winter 2008 40

Model Evaluation

?),..,( 1 nwwP

?),..,( 1 nwwQ

Actual distribution

Our approximation

How different?

Relative Entropy (KL divergence) ?

D(p||q)= xX p(x)log(p(x)/q(x))

04/21/23 CPSC503 Winter 2008 41

Entropy of

)(log)()()( 111

1

n

Lw

nn wPwPwHPHn

),..,( 1 nwwP

Entropy rate )(1

1nwH

n

)(log)(1

lim)( 11

1

n

Lw

n

nwPwP

nLH

n

Language

EntropyAssumptions:ergodic and stationary

)(log1

lim)( 1n

nwP

nLH

Entropy can be computed by taking the average log probability of a looooong sample

NL?

Shannon-McMillan-Breiman

04/21/23 CPSC503 Winter 2008 42

Cross-EntropyBetween probability distribution P and

another distribution Q (model for P)

)(log)()||()(),( xQxPQPDPHQPHx

)(),( PHQPH

Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower

)(log)(1

lim),( 11

1

n

Lw

n

nwQwP

nQPH

n

)(log1

lim),( 1n

nwQ

nQPH

Applied to Languag

e

04/21/23 CPSC503 Winter 2008 43

Model Evaluation: In practiceCorpus

Training Set Testing set

A:split

B: train models

Models: Q1 and Q2

C:Apply models• counting • frequencies • smoothing

• Compare cross-perplexities

Nw1

),(),( 21 2?2 QPHQPH)(log

1),( 1

nwQn

QPH

04/21/23 CPSC503 Winter 2008 44

k-fold cross validation and t-test• Randomly divide the corpus in k

subsets of equal size• Use each for testing (all the other

for training)In practice you do k times what we saw in previous

slide

• Now for each model you have k perplexities

• Compare average models perplexities with t-test

04/21/23 CPSC503 Winter 2009 45

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/21/23 CPSC503 Winter 2009 46

Next Time

• Hidden Markov-Models (HMM)• Part-of-Speech (POS) tagging

Documents

12/6/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini