8/27/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini

04/19/23 CPSC503 Winter 2008 1

CPSC 503Computational Linguistics

Lecture 5Giuseppe Carenini

04/19/23 CPSC503 Winter 2008 2

Today 22/9

• Finish spelling• n-grams• Model evaluation

04/19/23 CPSC503 Winter 2008 3

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, funnel...

…in this context– trust funn – a lot of funn

– I want too go their

Real-word isolated

Real-word context

?!Is it an impossible (or very unlikely) word

in this context?Find the most likely

sub word in this context

04/19/23 CPSC503 Winter 2008 4

Key Transition

• Up to this point we’ve mostly been discussing words in isolation

• Now we’re switching to sequences of words

• And we’re going to worry about assigning probabilities to sequences of words

04/19/23 CPSC503 Winter 2008 5

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/19/23 CPSC503 Winter 2008 6

Only Spelling?A.Assign a probability to a sentence

• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing

B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the

disabled

AB

?),..,( 1 nwwP Impossible to estimate

04/19/23 CPSC503 Winter 2008 7

Impossible to estimate!Assuming 104 word types and average sentence contains 10 words ->

sample space?

• Google language model Update (22 Sept. 2006): was based on corpusNumber of sentences: 95,119,665,584

?),..,( 1 nwwP

Most sentences will not appear or appear only once

Key point in Stat. NLP: your corpus should be >> than your sample space!

04/19/23 CPSC503 Winter 2008 8

Decompose: apply chain rule

Chain Rule:

)|(),..(1

111

i

jji

n

in AAPAAP

nw1

)|()(

)|()...|()()(),..,(

112

1

1112111

k

kn

k

nn

nn

wwPwP

wwPwwPwPwPwwP

Applied to a word sequence from position 1 to n:

04/19/23 CPSC503 Winter 2008 9

Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) *

P(big|the) * P(red|the big)*

P(dog|the big red)* P(barks|the big red dog)

Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|

<S>)

04/19/23 CPSC503 Winter 2008 10

Not a satisfying solution Even for small n (e.g., 6) we would

need a far too large corpus to estimate:

)|(....... 516 wwP

)|()|( 11

11

nNnn

nn wwPwwP

Markov Assumption: the entire prefix history isn’t necessary.

),|()|(3

)|()|(2

)()|(1

211

1

11

1

11

nnnn

n

nnn

n

nn

n

wwwPwwPN

wwPwwPN

wPwwPN unigram

bigram

trigram

04/19/23 CPSC503 Winter 2008 11

Prob of a sentence: N-Grams

)|()()(),..,( 112

111

kk

n

k

nn wwPwPwPwwP

)()()(),..,(2

111 kn

k

nn wPwPwPwwP

)|()()(),..,( 12

111 kkn

k

nn wwPwPwPwwP

unigram

bigram

trigram)|()()(),..,( 2,12

111 kkkn

k

nn wwwPwPwPwwP

Chain-rule

simplifications

04/19/23 CPSC503 Winter 2008 12

Bigram<s>The big red dog barks

P(The big red dog barks)= P(The|<S>) *

)|()|()(),..,( 12

111 kkn

k

nn wwPSwPwPwwP

Trigram?

04/19/23 CPSC503 Winter 2008 13

Estimates for N-Grams

)(

),()(

),(

)(

)()|(

1

1

1

1

1

,11

n

nn

words

n

pairs

nn

n

nnnn

wC

wwC

NwC

NwwC

wP

wwPwwP

bigram

..in general)(

)()|(

11

111

1

nNn

nnNnn

NnnwC

wwCwwP

04/19/23 CPSC503 Winter 2008 14

Estimates for Bigrams

)(

),()(

),(

)(

),()|(

bigC

redbigC

NbigC

NredbigC

bigP

redbigPbigredP

words

pairs

Silly Corpus :

“<S>The big red dog barks against the big pink dog”

Word types vs. Word tokens

04/19/23 CPSC503 Winter 2008 15

Berkeley ____________Project (1994) Table: Counts

nw

1nw

)(

)()|(

1

11

n

nnnn

wC

wwCwwP

Corpus: ~10,000 sentences, 1616 word typesWhat domain?

Dialog? Reviews?

04/19/23 CPSC503 Winter 2008 16

BERP Table:nw

1nw

)|( 1nn wwP

04/19/23 CPSC503 Winter 2008 17

BERP Table Comparison

)(

)(

1

1

n

nn

wC

wwC

Counts

Prob.

1?

nw

1nw

04/19/23 CPSC503 Winter 2008 18

Some Observations

• What’s being captured here?– P(want|I) = .32– P(to|want) = .65– P(eat|to) = .26– P(food|Chinese) = .56– P(lunch|eat) = .055

nw

1nw

04/19/23 CPSC503 Winter 2008 19

Some Observations

• P(I | I)• P(want | I)• P(I | food)

• I I I want• I want I want to• The food I want is

nw

1nw

Speech-based restaurant consultant!

04/19/23 CPSC503 Winter 2008 20

Generation• Choose N-Grams according to their

probabilities and string them together

nw

1nw

• I want -> want to -> to eat -> eat lunch• I want -> want to -> to eat -> eat Chinese -> Chinese food

04/19/23 CPSC503 Winter 2008 21

Two problems with applying:

)|()()( 12

11 kkn

k

n wwPwPwP

nw

1nw

to

04/19/23 CPSC503 Winter 2008 22

Problem (1)

• We may need to multiply many very small numbers (underflows!)

• Easy Solution:– Convert probabilities to logs and

then do ……– To get the real probability (if you

need it) go back to the antilog.

04/19/23 CPSC503 Winter 2008 23

Problem (2)• The probability matrix for n-grams is

sparse• How can we assign a probability to a

sequence where one of the component n-grams has a value of zero?

•Solutions:– Add-one smoothing– Good-Turing– Back off and Deleted Interpolation

04/19/23 CPSC503 Winter 2008 24

Add-One• Make the zero counts 1.• Rationale: If you had seen these

“rare” events chances are you would only have seen them once.

unigram N

wCwP

)()(

VN

wCwP

1)(

)(*

N

wCwP

)()(

**

VN

NwCwC

)1)(()(*

)(

)(*

wC

wCdc discount

04/19/23 CPSC503 Winter 2008 25

Add-One: Bigram

)(

),()|(

1

11

n

nnnn

wC

wwCwwP

VwC

wwCwwP

n

nnnn

)(

1),()|(

1

11

*

VwC

NwwCwwC

nnnnn

)()1),((),(

111

*……

Counts

nw

1nw

04/19/23 CPSC503 Winter 2008 26

BERP Original vs. Add-one smoothed Countsnw

1nw

6

19225

15

04/19/23 CPSC503 Winter 2008 27

Add-One Smoothed Problems

• An excessive amount of probability mass is moved to the zero counts

• When compared empirically with MLE or other smoothing techniques, it performs poorly

• -> Not used in practice• Detailed discussion [Gale and Church

1994]

04/19/23 CPSC503 Winter 2008 28

Better smoothing technique• Good-Turing Discounting (clear

example on textbook)

More advanced: Backoff and Interpolation

• To estimate an ngram use “lower order” ngrams– Backoff: Rely on the lower order ngrams when

we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams

– Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts

04/19/23 CPSC503 Winter 2008 29

Impossible to estimate!Sample space much bigger than any

realistic corpus

?),..,( 1 nwwP

Chain rule does not help

Markov assumption :Unigram… sample space?Bigram … sample space?Trigram … sample space?

Sparse matrix: Smoothing techniques

N-Grams Summary: final

Look at practical issues: sec. 4.8

04/19/23 CPSC503 Winter 2008 30

You still need a big corpus…• The biggest one is the Web!

),(

),,(),|(

21

321213

wwC

wwwCwwwP

Web

Web

Web

• Impractical to download every page from the Web to count the ngrams =>

• Rely on page counts (which are only approximations)– Page can contain an ngram multiple times– Search Engines round-off their counts• Such “noise” is tolerable in practice

04/19/23 CPSC503 Winter 2008 31

Today 22/9

• Finish spelling• n-grams• Model evaluation

04/19/23 CPSC503 Winter 2008 32

Entropy• Def1. Measure of uncertainty• Def2. Measure of the information

that we need to resolve an uncertain situation

– Let p(x)=P(X=x); where x X.

– H(p)= H(X)= - xX p(x)log2p(x)

– It is normally measured in bits.

04/19/23 CPSC503 Winter 2008 33

Model Evaluation

?),..,( 1 nwwP

?),..,( 1 nwwQ

Actual distribution

Our approximation

How different?

Relative Entropy (KL divergence) ?

D(p||q)= xX p(x)log(p(x)/q(x))

04/19/23 CPSC503 Winter 2008 34

Entropy of

)(log)()()( 111

1

n

Lw

nn wPwPwHPHn

),..,( 1 nwwP

Entropy rate )(1

1nwH

n

)(log)(1

lim)( 11

1

n

Lw

n

nwPwP

nLH

n

Language

EntropyAssumptions:ergodic and stationary

)(log1

lim)( 1n

nwP

nLH

Entropy can be computed by taking the average log probability of a looooong sample

NL?

Shannon-McMillan-Breiman

04/19/23 CPSC503 Winter 2008 35

Cross-EntropyBetween probability distribution P and

another distribution Q (model for P)

)(log)()||()(),( xQxPQPDPHQPHx

)(),( PHQPH

Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower

)(log)(1

lim),( 11

1

n

Lw

n

nwQwP

nQPH

n

)(log1

lim),( 1n

nwQ

nQPH

Applied to Languag

e

04/19/23 CPSC503 Winter 2008 36

Model Evaluation: In practice

Corpus

Training Set Testing set

A:split

B: train model

Model: Q

C:Apply model• counting

• frequencies • smoothing

• Compute cross-perplexity

),(2 QPH

Nw1

)(log1

),( 1nwQ

nQPH

04/19/23 CPSC503 Winter 2008 37

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/19/23 CPSC503 Winter 2008 38

Next Time

• Hidden Markov-Models (HMM)• Part-of-Speech (POS) tagging

Documents

8/27/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini