Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models

Lecture 4: Language Model Evaluation and Advanced methods

Kai-Wei ChangCS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

vKneser-Ney smoothingvDiscriminative Language Models vNeural Language ModelsvEvaluation: Cross-entropy and perplexity


Recap: Smoothing

v Add-one smoothingv Add-𝜆 smoothingvparameters tuned by the cross-validation

vWitten-Bell SmoothingvT: # word types N: # tokensvT/(N+T): total prob. mass for unseen wordsvN/(N+T): total prob. mass for observed tokens

vGood-TuringvReallocate the probability mass of n-grams that

occur r+1 times to n-grams that occur r times.


Recap: Back-off and interpolation

v Idea: even if we’ve never seen “red glasses”, we know it is more likely to occur than “red abacus”

v Interpolation:paverage(z | xy) = µ3 p(z | xy) + µ2 p(z | y) + µ1 p(z)where µ3 + µ2 + µ1 = 1 and all are ≥ 0


Absolute Discounting

vSave ourselves some time and just subtract 0.75 (or some d)!

v But should we really just use the regular unigram P(w)?

5

)()()(),()|( 11

11scountingAbsoluteDi wPw

wcdwwcwwP i

i

iiii −

−

−− +

−= λ

discountedbigram

unigram

Interpolationweight


Kneser-Ney Smoothing

v Betterestimateforprobabilitiesoflower-orderunigrams!v Shannongame:I can’t see without my

reading___________?v “Francisco”ismorecommonthan“glasses”v…but“Francisco”alwaysfollows“San”

6

Francisco glasses



v InsteadofP(w):“Howlikelyisw”v Pcontinuation(w):“Howlikelyiswtoappearasa

novelcontinuation?v Foreachword,countthenumberofbigramtypesit

completesv Everybigramtypewasanovelcontinuationthefirst

timeitwasseen

7

PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}



v Howmanytimesdoeswappearasanovelcontinuation:

v Normalized by the total number of word bigram types

8

PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}

{(wj−1,wj ) : c(wj−1,wj )> 0}

PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}

{(wj−1,wj ) : c(wj−1,wj )> 0}



v Alternative metaphor: The number of # of word types seen to precede w

v normalized by the # of words preceding all words:

v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

9

PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}{w 'i−1 : c(w 'i−1,w ')> 0}

w '∑

| {wi−1 : c(wi−1,w)> 0} |



10

PKN (wi |wi−1) =max(c(wi−1,wi )− d, 0)

c(wi−1)+λ(wi−1)PCONTINUATION (wi )

λ(wi−1) =d

c(wi−1){w : c(wi−1,w)> 0}

λ is a normalizing constant; the probability mass we’ve discounted

thenormalizeddiscountThenumberofwordtypesthatcanfollowwi-1=#ofwordtypeswediscounted=#oftimesweappliednormalizeddiscount


Kneser-Ney Smoothing: Recursive formulation

11

PKN (wi |wi−n+1i−1 ) = max(cKN (wi−n+1

i )− d, 0)cKN (wi−n+1

i−1 )+λ(wi−n+1

i−1 )PKN (wi |wi−n+2i−1 )

cKN (•) =count(•) for the highest order

continuationcount(•) for lower order

!"#

$#

Continuationcount=Numberofuniquesinglewordcontextsfor�


Practical issue: Huge web-scale n-gramsvHow to deal with, e.g., Google N-gram

corpusvPruning

vOnly store N-grams with count > threshold.v Remove singletons of higher-order n-grams


Huge web-scale n-grams

vEfficiencyvEfficient data structures

v e.g. tries

vStore words as indexes, not stringsvQuantize probabilities (4-8 bits instead of

8-byte float)

13

https://en.wikipedia.org/wiki/Trie


600.465 - Intro to NLP - J. Eisner 14

Smoothing

This dark art is why NLP is taught in the engineering school.

14

There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.


Conditional Modeling

v Generative language model (tri-gram model):

v Then, we compute the conditional probabilities by maximum likelihood estimation

v Can we model 𝑃 𝑤$ 𝑤%,𝑤' directly?

v Given a context x, which outcomes y are likely in that context?P (NextWord=y | PrecedingWords=x)


𝑃(𝑤), …𝑤+)=P 𝑤) 𝑃 𝑤0 𝑤) …𝑃 𝑤+ 𝑤+10,𝑤+1)


Modeling conditional probabilities

vLet’s assume𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦D )EDY: NextWord, x: PrecedingWords

v𝑃(𝑦|𝑥) is high ⇔ score(x,y) is highvThis is called soft-maxvRequire that P(y | x) ≥ 0, and ∑ 𝑃(𝑦|𝑥)E = 1;

not true of score(x,y)


Linear Scoring

v Score(x,y): How well does y go with x?v Simplest option: a linear function of (x,y).

But (x,y) isn’t a number ⇒ describe it by some numbers (i.e. numeric features)

v Then just use a linear function of those numbers.

17

Ranges over all features Whether (x,y) has feature k(0 or 1)Or how many times it fires (≥ 0)Or how strongly it fires (real #)

Weight of the kth feature. To be learned …


What features should we use?

vModel p wI w$1),w$10):𝑓'(“𝑤$1),𝑤$10”, “𝑤$”) for Score(“𝑤$1),𝑤$10”, “𝑤$”) canbev # “𝑤$1)” appears in the training corpus. v 1, if “𝑤$ ” is an unseen word; 0, otherwise.v 1, if “𝑤$1),𝑤$10” = “a red”; 0, otherwise.v 1, if “𝑤$10” belongs to the “color” category; 0 otherwise.


What features should we use?

vModel p ”𝑔𝑙𝑎𝑠𝑠𝑒𝑠” ”𝑎𝑟𝑒𝑑”):𝑓'(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”) for Score(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”)v # “𝑟𝑒𝑑” appears in the training corpus. v 1, if “𝑎” is an unseen word; 0, otherwise.v 1, if “a𝑟𝑒𝑑” = “a red”; 0, otherwise.v 1, if “𝑟𝑒𝑑” belongs to the “color” category; 0 otherwise.


Log-Linear Conditional Probability


where we choose Z(x) to ensure that

unnormalizedprob (at leastit’s positive!)

thus, Partition function


v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ)

v Easier to maximize the log of that:

v Alas, some weights θi may be optimal at -∞ or +∞.When would this happen? What’s going “wrong”?

Training θ

21

This version is “discriminative training”: to learn to predict y from x, maximize p(y|x).

Whereas in “generative models”, we learn to model x, too, by maximizing p(x,y).


Generalization via Regularization

v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ) ⋅ pprior(θ)

v Easier to maximize the log of that

v Encourages weights close to 0.v “L2 regularization”: Corresponds to a Gaussian prior

22

𝑝 𝜃 ∝ 𝑒 X Y/ZY


Gradient-based training

v Gradually adjust θ in a direction that improves

23

Gradient ascent to gradually increase f(θ):

while (∇f(θ) ≠ 0) // not at a local max or minθ = θ + 𝜂∇f(θ) // for some small 𝜂 > 0

Remember: ∇f(θ) = (∂f(θ)/∂θ1, ∂f(θ)/∂θ2, …)update means: θk += ∂f(θ) / ∂θk


Gradient-based training

v Gradually adjust θ in a direction that improves

v Gradient w.r.t 𝜃


More complex assumption?

v 𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦′ )𝑦′Y: NextWord, x: PrecedingWords

v Assume we saw:

What is P(shoes; blue)?

v Can we learn categories of words(representation) automatically?

v Can we build a high order n-gram model without blowing up the model size?

25

redglasses;yellowglasses;greenglasses;blueglassesredshoes;yellowshoes;greenshoes;


Neural language model

vModel 𝑃(𝑦|𝑥) with a neural network

26

Example1:Onehotvector:eachcomponentofthevectorrepresentsoneword[0,0,1,0,0]

Example2:wordembeddings


Neural language model

vModel 𝑃(𝑦|𝑥) with a neural network

27

Learnedmatricestoprojecttheinputvectors

Obtain(y|x)byperforming softmax

Concatenateprojectedvectors

Non-linearfunctione.g.,ℎ = tanh(𝑊b 𝑐 + 𝑏)


Why?

vPotentially generalize to unseen contexts vExample: P(“red” | “the”, “shoes”, “are”)vThis does not occurs in training corpus but

[“the”, ”glasses”, ”are”, “red”] does.v If the word representations of “red” and “blue”

are similar, then the model can generalize.

vWhy are “red” and “blue” similar?vBecause NN saw “red skirt”, “blue skirt”, “red

pen”, ”blue pen”, etc.


Training neural language models

vCan use gradient ascent as well

vUsing the chain rule to derive the gradienta.k.a. back propagation

vMore complex NN architectures can be used – e.g., LSTM, char-based models


Language model evaluation

vHow to compare models?v we need an unseen text set, why?

v Information theory: study resolution of uncertainty.vPerplexity: measure how well a probability

distribution predicts a sample


Cross-Entropy

v A common measure of model qualityv Task-independentv Continuous – slight improvements show up here

even if they don’t change # of right answers on task

v Just measure probability of (enough) test datav Higher prob means model better predicts the futurev There’s a limit to how well you can predict random

stuffv Limit depends on “how random” the dataset is

(easier to predict weather than headlines, especially in Arizona)


32

Cross-Entropy (“xent”)

v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …

1/8 * 1/8 * 1/8 * 1/16 …

v high prob → low xent by 3 cosmetic improvements:v Take logarithm (base 2) to prevent underflow:

log (1/8 * 1/8 * 1/8 * 1/16 …) = log 1/8 + log 1/8 + log 1/8 + log 1/16 … = (-3) + (-3) + (-3) + (-4) + …

v Negate to get a positive value in bits 3+3+3+4+…v Divide by length of text à 3.25 bits per letter (or per

word)


Average?Geometric average of 1/23,1/23, 1/23, 1/24

= 1/23.25 ≈ 1/9.5

33

Cross-Entropy (“xent”)

v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …

1/8 * 1/8 * 1/8 * 1/16 …

v Cross-entropy à 3.25 bits per letter (or per word)vWant this to be small (equivalent to wanting good

compression!)vLower limit is called entropy – obtained in principle as

cross-entropy of the true model measured on an infinite amount of data

v perplexity = 2xent (meaning ≈9.5 choices)

Average?Geometric average of 1/23,1/23, 1/23, 1/24

= 1/23.25 ≈ 1/9.5


More math: Entropy H(X)

v The entropy H(𝑝) of a discrete random variable 𝑋is the expected negative log probability:H p = −∑ 𝑝 𝑥 log0 𝑃(𝑥)k

vEntropy is measure of uncertainty


Entropy of coin tossing

vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝

vp=0.5: H(p)= 1vP=1: H(p) = 0

6501 Natural Language Processing 35

Entropy of coin tossing

vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝

vp=0.5: H(p)= 1vP=1: H(p) = 0


How many bits to encode messages

vConsider three letters (A, B, C, D):v If p=(½, ½, 0, 0), how many bits per letter

in average to encode a message ~ p?vEncode A as 0, B as 1;AAABBBAA ⇒ 00011100

v If p=(¼ , ¼ , ¼ , ¼ )vA: 00, B: 01, C:10, D:11; ABDA⇒ 00011100

vHow about p=(½, ¼, ¼, 0)vA: 0, B:10, C:11; AAACBA ⇒ 00011100


More math: Cross Entropy

vCross-entropy:vAvg.#bitstoencodeevents~p(x)usingacodingschemem(x)

vH p,𝑚 = −∑ 𝑝 𝑥 log0𝑚(𝑥)kvNot symmetric: H p,𝑚 ≠ H 𝑚,𝑝vLower bounded by H(p)

v Let p=(½, ¼, ¼, 0)vWe encode A:00, B:01, C:10, D:11

(i,e., m= (¼, ¼, ¼, ¼))vAAACBA?


000000100100

Perplexity and geometric mean

v


CS498JH: Introduction to NLP

Perplexity

Language model m is better than m’ if it assigns lower perplexity (i.e. lower cross-entropy, and higher probability) to the test corpus w1...wN

10

Perplexity(w1

. . .wN) = 2

H(w1

. . . wN)

= 2

� 1

N log

2

m(w1

. . . wN)

= m(w1

. . .wN)� 1

N

= N

s1

m(w1

. . .wN)

An experiment

vTrain: 38M WSJ text, |V|= 20kvTest: 1.5M WSJ text

vWord level LSTM ~85vChar level ~79


CS498JH: Introduction to NLP

Models:Unigram, Bigram, Trigram model (with Good-Turing)

Training data: 38M words of WSJ text (Vocabulary: 20K types)

Test data:1.5M words of WSJ text

Results:

An experiment

Unigram Bigram TrigramPerplexity 962 170 109

12

Documents

Lecture 4: Language Model Evaluation and Advanced methodskc2wc/teaching/NLP16/slides/04...This lecture vKneser-Ney smoothing vDiscriminative Language Models vNeural Language Models