62
Statistical Inference: n-gram Models over Space Data Chia-Hao Lee Reference: 1.Foundation of Statistical Natural Language Processing 2.The Projection of Professor Berlin Chen 3.A Bit of Process in Language Modeling Extended Version, Joshua T. Goodman,2001

Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

Statistical Inference: n-gram Models over Space Data

Chia-Hao Lee

Reference: 1.Foundation of Statistical Natural Language Processing2.The Projection of Professor Berlin Chen 3.A Bit of Process in Language Modeling Extended Version, Joshua T. Goodman,2001

Page 2: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

2

outline

• N-gram• MLE• Smoothing

– Add-one– Lidstone’s Law– Witten-Bell– Good-Turing

• Back-off Smoothing– Simple linear interpolation– General linear interpolation– Katz – Kneser-Ney

• Evaluation

Page 3: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

3

Introduction

• Statistical NLP aims to do statistical inference for the field of natural language.

• In general, statistical inference consists of taking some data and then making some inferences about this distribution.– Use to predict prepositional phrase attachment

• A running example of statistical estimation : language modeling

Page 4: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

4

Reliability vs. Discrimination

• In order to do inference about one feature, we wish to find other features of the model that predict it.– Stationary model

• Based on various classificatory, we try to predict the target feature.

• We use the equivalence classing to help predict the value of the target feature.– Independence assumptions: Features are independent

Page 5: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

5

• The more classificatory features that we identify, the more finely conditions that we can predict the target feature.

• Diving the data into many bins gives us greater discrimination.

• Using a lot of bins, a particular bin may contain no or a very small number of training instances, and we can not do statistical estimation.

• Is there a good compress between two criteria??

Reliability vs. Discrimination

Page 6: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

6

N-gram models

• The task of predicting the next word can be stated as attempting to estimate the probability function P :

• History: classification of the previous words

• Markov assumption: only the last few words affect the next word

• The same n-1 words are placed in the same equivalence class:– (n-1) order Markov model or n-gram model

( )11 ,, −nn wwwP L

Page 7: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

7

• Naming:– gram is a Greek root and so should be put together with number

Greek prefix– Shannon actually did use the term digram, but this usage has not

survived now.– Now we always use bigram instead of digram.

N-gram models

Page 8: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

8

• For example:She swallowed the large green ___ .– “swallowed” influence the next word more stronger than “the

large green ___ “.

• However, there is the problem that if we divide the data into too many bins, then there are a lot of parameters to estimate.

N-gram models

Page 9: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

9

N-gram models

Page 10: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

10

• Five-gram model that we thought would be useful, may well not be practical, even if we have a very large corpus.

• One way of reducing the number of parameters is to reduce the value of n .

• Removing the inflectional ending from words– Stemming

• And grouping words into semantic classes

• Or …(ref. Ch12,Ch14)

N-gram models

Page 11: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

11

• Corpus: Jane Austen’s novel– Freely available and not too large

• As our corpus for building models, reserving Persuasionfor testing– Emma, Mansfield Park, Northanger Abbey,

Pride and Prejudice (傲慢與偏見), and Sense and Sensibility

• Preprocessing – Remove punctuation leaving white-space– Add SGML tags <s> and </s>

• N=617,091 words , V=14,585 word types

Building n-gram models

Page 12: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

12

Statistical Estimators

• Find out how to derive a good probability estimate for the target feature, using the following function:

• Can be reduced to having good solutions to simply estimating the unknown probability distribution of n-grams . (all in one bin, with no classificatory features)– bigram: h1a, h2a, h3a, h4b,h5b…reduce to a and b

( ) ( )( )11

111 ,,

,,,,

−− =

n

nnn wwP

wwPwwwP

L

LL

Page 13: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

13

• We assume that the training text consists of N words.

• We append n-1 dummy start symbols to the beginning of the text.– N n-gram with a uniform amount of conditioning available for the

next word in all cases

Statistical Estimators

Page 14: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

14

Maximum Likelihood Estimation

• MLE estimates from relative frequencies.– Predict: comes across __?__– Using trigram model: 10 instances (trigrams)– Using relative frequency:

( ) ( ) ( ) ( ) 0,1.0,1.0,8.0 ==== xPaPmorePasP

( ) ( )N

wwCwwP n

nMLE,,

,, 11

LL =

( ) ( )( )11

111 ,,

,,,,

−− =

n

nnnMLE wwC

wwCwwwP

L

LL

Page 15: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

15

Smoothing

• Sparseness– Standard N-gram models is that they must be trained from some

corpus.– Large number of cases of putative ‘zero probability’ n-gram that

should really have some non-zero probability.

• Smoothing – Reevaluating some zero or low probability in n-gram .

Page 16: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

16

Laplace’s law

• To solute the failure of the MLE, the oldest solution is to employ Laplace’s law (also called add-one) :

• For sparse sets of data over large vocabularies, such as n-grams , Laplace’s law actually gives far too much of the probability space to unseen events.

( ) ( )BNwwCwwP n

nLap ++

=1,,, 1

1L

L

Page 17: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

17

Add-One Smoothing

• Take counts before normalize

• Unigram MLE : (ordinary)

• The probability estimate for an n-gram seen r times is. (using add-one)

• So, the frequency estimate becomes

( ) ( )( )

( )NwC

wCwC

wP x

ii

xx ==

( ) ( )( )BN

rwP ir ++

=1

( ) ( )( )BN

rNwf ir ++

=1

Page 18: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

18

• The alternative view : discounting– Lowing some non-zero counts that will be assigned to zero

counts.

Add-One Smoothing

CCdc

*

1−=

Page 19: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

19

• Unigram example :

Add-One Smoothing

{ } 5,,,, == VEDCBAV{ } 10,,,,,,,,,, === SNCCBBBAAAAAS

'',''0,''2,''3,''5 EDforCforBforAfor

( ) ( )( ) 4.0

51015

=++

=AP

( ) ( )( ) 27.0

51013

=++

=BP

( ) ( )( ) 2.0

51012

=++

=CP

( ) ( ) ( )( ) 067.0

51010

=++

== EPDP

Page 20: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

20

• Bigram MLE :

• Smoothed :

Add-One Smoothing

( ) [ ][ ]1

11

−− =

n

nnnn wC

wwCwwP

( ) [ ][ ] VwC

wwCwwP

n

nnnn +

+=

−−

1

11

* 1

Page 21: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

21

Add-One Smoothing : Example

N (want)=1215N (want,want)=0N (want,to)=768

For example : Bigram

Page 22: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

22

Add-One Smoothing :Example

P (want|want)=0/1215=0P (to|want)=786/1215=0.65

Page 23: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

23

Add-One Smoothing :Example

P’(want|want)=(0+1)/(1215+1616)=0.00035P’(to|want)=(786+1)/(1215+1616)=0.28

Page 24: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

24

• P(want|want) changes from 0 to 0.00035 • P(to|want) changes from 0.65 to 0.28

• The sharp change occurs because too much probability mass is moved to all the zero.

• Gale and Church summarize add-one smoothing is worse at predicting the actual probability than unsmoothed MLE.

Add-One Smoothing

Page 25: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

25

Lidstone’s Law and Jeffreys-Perks Law

• Lidstone’s Law :– Add some normally smaller positive value λ

• Jeffreys-Perks Law:– Viewed as linear interpolation between MLE and a uniform prior– Also called ‘ Expected Likelihood Estimation ’

( ) ( )λ

λBNwwC

wwP nnLid +

+=

,,,, 1

1L

L

( ) ( ) ( ) ( )λ

λλμ

BNwwC

BNwwC

wwP nnnLid +

+=−+=

,,11,,

,, 111

LLL

λμ

BNN+

=where :

Page 26: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

26

Held out Estimation

• A cardinal sin in Statistical NLP is to test on training data.Why??– Overtraining– Models memorize the training text

• Test data is independent of the training data.

Page 27: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

27

• When starting to work with some data, one should always separate it into a training portion and a testing portion.– Separate data immediately into training and test data(5~10%,reliable).– Divide training and test data into two again– Held out (validation) data(10%)

• Independent of primary training and test data• Involve many fewer parameters• Sufficient data for estimating parameters

• Research:– Write an algorithm, train it and test it (X)

• Subtly probing– Separate to Development test set, final test set (O)

Held out Estimation

Page 28: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

28

Held out Estimation

• How to select test/held out data?– Randomly or aside large contiguous chunks

• Comparing average scores is not enough– Divide the test data into several parts– t-test

Page 29: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

29

Held out Estimation : t-test

System 1 System 2Score 71,61,55,60,68,49,

42,72,76,55,6442,55,75,45,54,51,55,36,58,55,67

Total 673 593n 11 11mean 61.2 53.9

1081.6 1186.9df 10 10

ix22 )(∑ −= iiji xxs

Pooled 4.1131010

9.186,16.081,12 ≈++

=s 60.1

114.11329.532.61

2 2

21 ≈×−

=−

=

ns

xxt

t=1.60<1.725, the data fail the significance test.

Page 30: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

30

• The held out estimator : (for n-grams)

• The probability of one of these n-grams :

∑=

=

=

Held out Estimation

=

})(:{12

n112

n111

111

)( dataout heldin w woffrequency )(data gin trainin w woffrequency )(

rwwCwwnr

n

n

nn

wwCTwwCwwC

LL

L

LL

LL

:the total number of times that all n-grams(that appeared r times in the training text) appeared in the held out data

rT

NNTwwPr

rnho =)...( 1

Page 31: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

31

Cross-validation• Dividing training data into two parts.

– First, estimates are built by doing counts on one part.– Second, we use the other pool of held out data to refine those

estimates.

• Two-way cross-validation– delete estimation

( ) ( )10

1001

1 ,,rr

rrndel NNN

TTwwP++

=L

: the number of n-grams occurring r times in the part of the training data.: the total occurrences of those bigrams from part in the part.

th00 th1

0rN01

rT

Page 32: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

32

• Leaving-one-out– Training corpus : N-1– Held out data : 1– Repeated N times

Cross-validation

Page 33: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

33

Witten-Bell Discounting

• A much better smoothing method that is only slightly more complex than add-one.

• Zero-frequency word or N-gram as one that just hasn’t happened.– can be modeled by probability of seeing an n-gram for the first

time

• Key : things seen once !

• The count of ‘first time’ n-grams is just for the number of n-gram types w saw in data.

Page 34: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

34

• Probability of total unseen (zero) N-grams :

– T is the type we have already seen – T differs from V (V is total types we might see)

• Divide up to among all the zero N-grams– Divided equally

Witten-Bell Discounting

TNTp

iCii +=∑

=0:

*

( )TNZTpi +

=*

∑=

=0

1iCi

Zwhere : (number of n-gram types with zero-counts)

Page 35: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

35

• Discounted probability of the seen n-grams

• Another formulation (in term of frequency count)

Witten-Bell Discounting

0* >+

= ii

i cifTN

cp

(ci : the count of a seen n-gram i)

⎪⎩

⎪⎨

>+

=+=

0,

0,*

ii

i

i

cifTN

Nc

cifTN

NZT

c

Page 36: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

36

• For example (of unigam modeling)

Witten-Bell Discounting

{ } 5,,,, == VEDCBAV{ } 10,,,,,,,,,, === SNCCBBBAAAAAS

{ } 2,3,,,'',''0,''2,''3,''5 === ZCBATEDforCforBforAfor

( ) ( ) 385.0310

5=

+=AP

( ) ( ) 23.0310

3=

+=BP

( ) ( ) 154.0310

2=

+=CP

( ) ( ) ( ) 116.021*

3103

=+

== EPDP

Page 37: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

37

• Bigram– Consider bigrams with the history word .

• for zero-count bigram (with as the history)

– : frequency count of word in the corpus– : types of nonzero-count bigrams (with as the history)– : types of zero-count bigrams (with as the history)

xwxw

Witten-Bell Discounting

( )( )

( )( ) ( )xx

x

wwCixi wTwC

wTwwp

ix+

=∑=0,:

*

( ) ( )( ) ( ) ( )( )11

*

−− +=

iix

xxi wTwCwZ

wTwwp

( )( )∑

=

=0:

1ixwwCi

xwZ

( )xwC

( )xwT

( )xwZ

xw

xwxw

Page 38: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

38

• Bigram– For nonzero-count bigram

Witten-Bell Discounting

( )( )

( )( ) ( )xx

ix

wwCixi wTwC

wwCwwp

ix+

=∑>0:

*

Page 39: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

39

Good-Turing estimation

• A method is slightly more complex than Witten-Bell. – To re-estimate zero or low counts by higher counts

• Good-Turing estimation :– For any n-gram, that occurs times, we pretend it occurs

times:

– The probability estimate for a n-gram

r*r ( )

r

r

nn

rr 1* 1 ++= , A new frequency count

: the number of n-grams that occurs exactly r times in the training datarn

( )NrxPGT

*

=

N : the size of the training data

Page 40: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

40

• The size (word counts) of the training data remains the same– Let

• Unseen: N1/N, why?0*=(0+1)*N1/N0 and number of zero frequency words: N0

So, the probability = ((N1/N0)*N0)/N= N1/N (MLE)

Nnrr

r =⋅∑∞

=1

( ) NnrnrnrNr

rr

rr

r =⋅=⋅+=⋅= ∑∑∑∞

=

=+

= 1

'

01

0

*

''1~

Good-Turing estimation

( )1+=′ rrset

Page 41: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

41

• Imagine you are fishing. You have caught 10 Carp (鯉魚),3 Cod (鱈魚), 2 tuna (鮪魚), 1 trout (鱒魚), 1 salmon (鮭魚),1 eel (鰻魚)

• How likely is it that next species is now?– P0=n1/N=3/18=1/6

• How likely is eel ? 1*– n1=3, n2=1– 1*=(1+1) × 1=2/3– P(eel)=1*/N=(2/3)/18=1/27

• How likely is tuna? 2*– n2=1, n3=1– 2*=(2+1) × 1/1=3– P(tuna)=2*/N=3/18=1/6

• But how likely is Cod? 3*– Need a smoothing for n4 in advance

Good-Turing estimation : Example

Page 42: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

42

• The problem of Good-Turing estimate is that when nr+1=0 and P(r*)>P((r+1)*)– The choice of k may be overcome the second problem.– Experimentally 4≦k≦8 (Katz), Parameter k,Nk+1≠0

Good-Turing estimation

( ) ( ) ( ) ( ) 021ˆˆ2

211 <+⋅−⋅+⇒< +++ knnnkaPaP kkkkGTkGT

Page 43: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

43

Combining Estimators

• Because of the same estimate for all n-grams that never appeared, we hope to produce better estimates by looking at (n-1)-grams .

• Combine multiple probability estimates from various different models.– Simple linear interpolation– Katz Back-Off– General linear interpolation

Page 44: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

44

Simple linear interpolation

• Also called mixture model

• How to get the weights :– Expectation Maximization (EM) algorithm– Powell’s algorithm

• The method works quite well. Chen and Goodman use it as baseline model.

( ) ( ) ( ) ( )21331221112 ,, −−−−− ++= nnnnnnnnnli wwwPwwPwPwwwP λλλ

1,10 =≤≤ ∑i

ii λλwhere :

Page 45: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

45

• The difference between GLI and SLI is that the weights ( ) of the GLI is a function of the history.

• Can make bad use of component models– Ex: unigram estimate is always combined in with the same

weight regardless of whether the trigram is good or bad.

i

General linear interpolation

λ

( ) ( ) ( )∑=

=k

iiiil hwPhhwP

( ) ( ) 1,10, =≤≤∀ ∑i

ii handhh λλwhere :

Page 46: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

46

Katz Back-off

• Extend the intuition of the GT estimate by adding the combination of higher-order language models with lower-order ones

• Larger counts are not discounted because they are taken to be reliable. (r>k)

• Lower counts are total discounted. (r<=k)

Page 47: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

47

• We take the bigram (n-gram, n=2) counts for example :

Katz Back-off

[ ]( ) ( )⎪

⎪⎨

=>≥

>=

00

1

1*

rifwPwrkifrdkrifr

wwC

iKatzi

rii

β

[ ]ii wwCr 1−=1.

( )

( )1

1

1

1*

11

1

nnk

nnk

rr

dk

k

r+

+

+−

+−

=2.

( )[ ] [ ]

[ ]

( )[ ]∑

∑∑

=

>−−

−=

0,:

0,:1

*1

1

1

1

,,

iii

iiii

wwCwikatz

wwCwii

wii

i wP

wwCwwCwβ3.

Page 48: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

48

Katz Back-off

( )[ ] [ ]

[ ]

( )[ ]∑

∑∑

=

>−−

−=

0,:

0,:1

*1

1

1

1

,,

iii

iiii

wwCwikatz

wwCwii

wii

i wP

wwCwwCwβ

count after discount

Unigram weight

count before discount

Page 49: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

49

• Derivation of : – Before of the derivation, the have to satisfy two equation:

1. 2.

rdrd

( ) 11

1 nrrnk

rr =−∑

=

Katz Back-off

rrdr

*

μ=and

( ) 111

*

11 +

==

+−=−∑∑ k

k

rr

k

rr nknnrrn

( ) ( ) 111

* 1 +=

+−=−⇒∑ r

k

rrr nknnrrn

( )( ) 1

1 11

1

*

=+−

−⇒

+

=∑

r

k

rrr

nkn

nrrn

( )( ) 1

1 11

1*

1n

nknnrrnk

r r

r =+−−

⇒∑= +

( ) 11

1 nrdnk

rrr =−∑

=

11

*

1 nrrrn

k

rr =⎟⎟

⎞⎜⎜⎝

⎛−⇒∑

=

μ

( ) 11

* nrrnk

rr =−⇒∑

=

μ

(1)

(2)

Page 50: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

50

∵(1)=(2);∴ ( )

( )

Katz Back-off

11

1*

1 ++−−

r

r

nknnrrn ( )*rrnr μ−=

( )( )[ ] r

k

drr

nknrnrr

−=−=+−

−⇒

+

111

*

11

1*

μ

( )( )[ ]11

1*

11

++−−

−=⇒k

r nknrnrr

d

( )[ ] ( )( )[ ]11

1*

11

11

+

+

+−−−+−

=k

k

nknrnrrnknr

( )( )[ ]11

11*

11

+

+

+−+−

=k

k

nknrnkrnr ( )

( )1

1

1

1*

11

1

nnk

nnk

rr

dk

k

r+

+

+−

+−

=∴divide r*n1

Page 51: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

51

• Take the conditional probabilities of bigrams (n-gram, n=2)For example :

Katz Back-off

( )

[ ][ ]

[ ][ ]

( ) ( )⎪⎪⎪

⎪⎪⎪

=

>≥

>

=

0

0,

,

1

1

1

1

1

1

rifwPw

rkifwC

wwCd

krifwC

wwC

wwP

iKatzi

i

iir

i

ii

iiKatz

α

( )

( )1

1

1

1*

11

1

nnk

nnk

rr

dk

k

r+

+

+−

+−

=1.

( )( )

[ ]

( )[ ]∑∑

=

>−

−=

0:

0:1

1

1

1

1

iii

iii

wwCwiKatz

wwCwiiKatz

i wP

wwPwα2.

Page 52: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

52

• A small vocabulary consists of only five words, i.e., .The frequency counts for word pairs started with are:

, and the word frequency counts are :

Katz back-off smoothing with Good-Turing estimate is used here for word pairs with frequency counts equal to or less than two. Show the conditional probabilities of word bigrams started with ,i.e.,

Katz Back-off : Example{ }521 ,,, wwwV L=

[ ] [ ] [ ] [ ] [ ] 0,,,1,,2,,3, 5111413121 ===== wwCwwCwwCwwCwwC

[ ] [ ] [ ] [ ] [ ] 4,6,10,8,6 54321 ===== wCwCwCwCwC

( ) ( ) ( )?,,, 151211 wwPwwPwwP KatzKatzKatz L

1w

1w

Page 53: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

53

Katz Back-off : Example

( ) datatrainingtheintimesrexactlyoccursthatgramsnofnumbertheisnwheren

nrr r

r

r −+= + ,1 1*

( ) ( )21

63

1212 ===∴ wwPwwP MLkatz

( ) 211111* =⋅+= ( ) 3

11122* =⋅+=

( )

( ) 43

2

323

11121

1112

23

2 =−

−=

⋅+−

⋅+−

=d

( )

( ) 21

3132

11121

1112

12

1 =−−

=⋅+

⋅+−

=d

( ) ( )41

62

43*2 13213 =⋅==⇒= wwPdwwPrFor MLKatz

( ) ( )121

61

21*1 14114 =⋅==⇒= wwPdwwPrFor MLKatz

( )122

1034

344

346

121

41

211

1 ⋅=+

−−−=wα

( ) ( ) ( )101

346

122

1034*0 1111 =⋅⋅==⇒= wPwwwPrFor MLKatz α

( ) ( ) ( )151

344

122

1034* 5115 =⋅⋅== wPwwwP MLKatz α

( ) ( ) ( ) 1151211 =+++ wwPwwPwwPAnd KatzKatzKatz L

Page 54: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

54

Kneser-Ney Back-off smoothing

• Absolute discounting

• The lower n-gram (back-off n-gram) is not proportional to the number of occurrences of a word but instead of the number of different words that it follows.

Page 55: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

55

• Take the conditional probabilities of bigrams for example :

Kneser-Ney Back-off smoothing

( )[ ]{ }

[ ] [ ]( ) ( )⎪⎩

⎪⎨⎧ >

−=

−−

otherwisewPw

wwCifwC

DwwCwwP

iKNi

iii

ii

iiKN

1

11

1

10,

0,,max

α

( ) [ ][ ]∑ ••

=

jwj

iiKN wC

wCwP1.

( )

[ ]{ }[ ][ ]

( )[ ]∑

=

> −

−−

=

0:

0: 1

1

1

1

1

0,max1

iii

iii

wwCwiKN

wwCw i

ii

i wPwC

DwwC

wα2.

Page 56: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

56

• Given a text sequence as the following : SABCAABBCS (S is the sequence’s start/end marks)

Show the corresponding unigram conditional probabilities:

Kneser-Ney Back-off smoothing : Example

[ ] 3=• AC [ ] 2=• BC

[ ] 1=•CC [ ] 1=• SC

( )73

=⇒ APKN ( )72

=BPKN

( )71

=CPKN ( )71

=SPKN

Page 57: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

57

Evaluation

• Cross entropy :

• Perplexity =

• A LM that assigned probability to 100 words would have perplexity 100

∑=+=x

)p(x)logq(x-q)||D(pH(X)q)H(X,

Entropy2

100log100log100

1100

1log100

12

100

12

100

12 ==−= ∑∑

== iiEntropy

1002 100log2 ==perplexity

Page 58: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

58

Evaluation

• In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:

N

N

i ii wwwP∏= −1 11 )...|(

1

Page 59: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

59

Evaluation

N

N

i ii

N

i Nii

N

i

Nii

N

i

wwwPN

wwwPNEntropy

wwwP

wwwP

wwwP

perplexity

ii

N

iii

= −

=−

=−

=

∑ ⋅−

=

=

=

=

==

=−

1 11

11

11

1

1

11

1

)...|(log1

)...|(log1

)...|(1

)...|(

1

)...|(

1

2

122

112

1112

P(w1n)=p(w1)p(w2)…p(wn)

logP(w1n)=Σp(wi)

Page 60: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

60

Evaluation

• “true” model for any data source will have the lowest possible perplexity

• The lower the perplexity of our model, the closer it is, in some sense, to the true model

• Entropy, which is simply log2 of perplexity• Entropy is the average number of bits per word that

would be necessary to encode the test data using an optimal coder

Page 61: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

61

Evaluation

entropy .01 .1 .16 .2 .3 .4 .5 .75 1

perplexity 0.69% 6.7% 10% 13% 19% 24% 29% 41% 50%

• entropy : 5 4perplexity : 32 16 50%

• entropy : 5 4.5perplexity : 32 29.3%216

Page 62: Statistical Inference: n-gram Models over Space Databerlin.csie.ntnu.edu.tw/Courses/2006S-Natural Language Processing... · Statistical Inference: n-gram Models over Space Data Chia-Hao

62

Conclusions

• A number of smoothing method are available which often offer similar and good performance.

• More powerful combining methods ?