Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Lecture 4: Language Model Evaluation and Advanced methods
Kai-Wei ChangCS @ University of Virginia
Couse webpage: http://kwchang.net/teaching/NLP16
16501 Natural Language Processing
This lecture
vKneser-Ney smoothingvDiscriminative Language Models vNeural Language ModelsvEvaluation: Cross-entropy and perplexity
26501 Natural Language Processing
Recap: Smoothing
v Add-one smoothingv Add-𝜆 smoothingvparameters tuned by the cross-validation
vWitten-Bell SmoothingvT: # word types N: # tokensvT/(N+T): total prob. mass for unseen wordsvN/(N+T): total prob. mass for observed tokens
vGood-TuringvReallocate the probability mass of n-grams that
occur r+1 times to n-grams that occur r times.
36501 Natural Language Processing
Recap: Back-off and interpolation
v Idea: even if we’ve never seen “red glasses”, we know it is more likely to occur than “red abacus”
v Interpolation:paverage(z | xy) = µ3 p(z | xy) + µ2 p(z | y) + µ1 p(z)where µ3 + µ2 + µ1 = 1 and all are ≥ 0
46501 Natural Language Processing
Absolute Discounting
vSave ourselves some time and just subtract 0.75 (or some d)!
v But should we really just use the regular unigram P(w)?
5
)()()(),()|( 11
11scountingAbsoluteDi wPw
wcdwwcwwP i
i
iiii −
−
−− +
−= λ
discountedbigram
unigram
Interpolationweight
6501 Natural Language Processing
Kneser-Ney Smoothing
v Betterestimateforprobabilitiesoflower-orderunigrams!v Shannongame:I can’t see without my
reading___________?v “Francisco”ismorecommonthan“glasses”v…but“Francisco”alwaysfollows“San”
6
Francisco glasses
6501 Natural Language Processing
Kneser-Ney Smoothing
v InsteadofP(w):“Howlikelyisw”v Pcontinuation(w):“Howlikelyiswtoappearasa
novelcontinuation?v Foreachword,countthenumberofbigramtypesit
completesv Everybigramtypewasanovelcontinuationthefirst
timeitwasseen
7
PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}
6501 Natural Language Processing
Kneser-Ney Smoothing
v Howmanytimesdoeswappearasanovelcontinuation:
v Normalized by the total number of word bigram types
8
PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}
{(wj−1,wj ) : c(wj−1,wj )> 0}
PCONTINUATION (w)∝ {wi−1 : c(wi−1,w)> 0}
{(wj−1,wj ) : c(wj−1,wj )> 0}
6501 Natural Language Processing
Kneser-Ney Smoothing
v Alternative metaphor: The number of # of word types seen to precede w
v normalized by the # of words preceding all words:
v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability
9
PCONTINUATION (w) ={wi−1 : c(wi−1,w)> 0}{w 'i−1 : c(w 'i−1,w ')> 0}
w '∑
| {wi−1 : c(wi−1,w)> 0} |
6501 Natural Language Processing
Kneser-Ney Smoothing
10
PKN (wi |wi−1) =max(c(wi−1,wi )− d, 0)
c(wi−1)+λ(wi−1)PCONTINUATION (wi )
λ(wi−1) =d
c(wi−1){w : c(wi−1,w)> 0}
λ is a normalizing constant; the probability mass we’ve discounted
thenormalizeddiscountThenumberofwordtypesthatcanfollowwi-1=#ofwordtypeswediscounted=#oftimesweappliednormalizeddiscount
6501 Natural Language Processing
Kneser-Ney Smoothing: Recursive formulation
11
PKN (wi |wi−n+1i−1 ) = max(cKN (wi−n+1
i )− d, 0)cKN (wi−n+1
i−1 )+λ(wi−n+1
i−1 )PKN (wi |wi−n+2i−1 )
cKN (•) =count(•) for the highest order
continuationcount(•) for lower order
!"#
$#
Continuationcount=Numberofuniquesinglewordcontextsfor�
6501 Natural Language Processing
Practical issue: Huge web-scale n-gramsvHow to deal with, e.g., Google N-gram
corpusvPruning
vOnly store N-grams with count > threshold.v Remove singletons of higher-order n-grams
126501 Natural Language Processing
Huge web-scale n-grams
vEfficiencyvEfficient data structures
v e.g. tries
vStore words as indexes, not stringsvQuantize probabilities (4-8 bits instead of
8-byte float)
13
https://en.wikipedia.org/wiki/Trie
6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 14
Smoothing
This dark art is why NLP is taught in the engineering school.
14
There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.
6501 Natural Language Processing
Conditional Modeling
v Generative language model (tri-gram model):
v Then, we compute the conditional probabilities by maximum likelihood estimation
v Can we model 𝑃 𝑤$ 𝑤%,𝑤' directly?
v Given a context x, which outcomes y are likely in that context?P (NextWord=y | PrecedingWords=x)
15600.465 - Intro to NLP - J. Eisner 15
𝑃(𝑤), …𝑤+)=P 𝑤) 𝑃 𝑤0 𝑤) …𝑃 𝑤+ 𝑤+10,𝑤+1)
6501 Natural Language Processing
Modeling conditional probabilities
vLet’s assume𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦D )EDY: NextWord, x: PrecedingWords
v𝑃(𝑦|𝑥) is high ⇔ score(x,y) is highvThis is called soft-maxvRequire that P(y | x) ≥ 0, and ∑ 𝑃(𝑦|𝑥)E = 1;
not true of score(x,y)
166501 Natural Language Processing
Linear Scoring
v Score(x,y): How well does y go with x?v Simplest option: a linear function of (x,y).
But (x,y) isn’t a number ⇒ describe it by some numbers (i.e. numeric features)
v Then just use a linear function of those numbers.
17
Ranges over all features Whether (x,y) has feature k(0 or 1)Or how many times it fires (≥ 0)Or how strongly it fires (real #)
Weight of the kth feature. To be learned …
6501 Natural Language Processing
What features should we use?
vModel p wI w$1),w$10):𝑓'(“𝑤$1),𝑤$10”, “𝑤$”) for Score(“𝑤$1),𝑤$10”, “𝑤$”) canbev # “𝑤$1)” appears in the training corpus. v 1, if “𝑤$ ” is an unseen word; 0, otherwise.v 1, if “𝑤$1),𝑤$10” = “a red”; 0, otherwise.v 1, if “𝑤$10” belongs to the “color” category; 0 otherwise.
186501 Natural Language Processing
What features should we use?
vModel p ”𝑔𝑙𝑎𝑠𝑠𝑒𝑠” ”𝑎𝑟𝑒𝑑”):𝑓'(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”) for Score(“𝑟𝑒𝑑”, “𝑎”, “𝑔𝑙𝑎𝑠𝑠𝑒𝑠”)v # “𝑟𝑒𝑑” appears in the training corpus. v 1, if “𝑎” is an unseen word; 0, otherwise.v 1, if “a𝑟𝑒𝑑” = “a red”; 0, otherwise.v 1, if “𝑟𝑒𝑑” belongs to the “color” category; 0 otherwise.
196501 Natural Language Processing
Log-Linear Conditional Probability
20600.465 - Intro to NLP - J. Eisner 20
where we choose Z(x) to ensure that
unnormalizedprob (at leastit’s positive!)
thus, Partition function
6501 Natural Language Processing
v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ)
v Easier to maximize the log of that:
v Alas, some weights θi may be optimal at -∞ or +∞.When would this happen? What’s going “wrong”?
Training θ
21
This version is “discriminative training”: to learn to predict y from x, maximize p(y|x).
Whereas in “generative models”, we learn to model x, too, by maximizing p(x,y).
6501 Natural Language Processing
Generalization via Regularization
v n training examplesv feature functions f1, f2, …v Want to maximize p(training data|θ) ⋅ pprior(θ)
v Easier to maximize the log of that
v Encourages weights close to 0.v “L2 regularization”: Corresponds to a Gaussian prior
22
𝑝 𝜃 ∝ 𝑒 X Y/ZY
6501 Natural Language Processing
Gradient-based training
v Gradually adjust θ in a direction that improves
23
Gradient ascent to gradually increase f(θ):
while (∇f(θ) ≠ 0) // not at a local max or minθ = θ + 𝜂∇f(θ) // for some small 𝜂 > 0
Remember: ∇f(θ) = (∂f(θ)/∂θ1, ∂f(θ)/∂θ2, …)update means: θk += ∂f(θ) / ∂θk
6501 Natural Language Processing
Gradient-based training
v Gradually adjust θ in a direction that improves
v Gradient w.r.t 𝜃
246501 Natural Language Processing
More complex assumption?
v 𝑃(𝑦|𝑥) = exp(score x, y )/∑ exp(𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦′ )𝑦′Y: NextWord, x: PrecedingWords
v Assume we saw:
What is P(shoes; blue)?
v Can we learn categories of words(representation) automatically?
v Can we build a high order n-gram model without blowing up the model size?
25
redglasses;yellowglasses;greenglasses;blueglassesredshoes;yellowshoes;greenshoes;
6501 Natural Language Processing
Neural language model
vModel 𝑃(𝑦|𝑥) with a neural network
26
Example1:Onehotvector:eachcomponentofthevectorrepresentsoneword[0,0,1,0,0]
Example2:wordembeddings
6501 Natural Language Processing
Neural language model
vModel 𝑃(𝑦|𝑥) with a neural network
27
Learnedmatricestoprojecttheinputvectors
Obtain(y|x)byperforming softmax
Concatenateprojectedvectors
Non-linearfunctione.g.,ℎ = tanh(𝑊b 𝑐 + 𝑏)
6501 Natural Language Processing
Why?
vPotentially generalize to unseen contexts vExample: P(“red” | “the”, “shoes”, “are”)vThis does not occurs in training corpus but
[“the”, ”glasses”, ”are”, “red”] does.v If the word representations of “red” and “blue”
are similar, then the model can generalize.
vWhy are “red” and “blue” similar?vBecause NN saw “red skirt”, “blue skirt”, “red
pen”, ”blue pen”, etc.
286501 Natural Language Processing
Training neural language models
vCan use gradient ascent as well
vUsing the chain rule to derive the gradienta.k.a. back propagation
vMore complex NN architectures can be used – e.g., LSTM, char-based models
296501 Natural Language Processing
Language model evaluation
vHow to compare models?v we need an unseen text set, why?
v Information theory: study resolution of uncertainty.vPerplexity: measure how well a probability
distribution predicts a sample
306501 Natural Language Processing
Cross-Entropy
v A common measure of model qualityv Task-independentv Continuous – slight improvements show up here
even if they don’t change # of right answers on task
v Just measure probability of (enough) test datav Higher prob means model better predicts the futurev There’s a limit to how well you can predict random
stuffv Limit depends on “how random” the dataset is
(easier to predict weather than headlines, especially in Arizona)
316501 Natural Language Processing
32
Cross-Entropy (“xent”)
v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …
1/8 * 1/8 * 1/8 * 1/16 …
v high prob → low xent by 3 cosmetic improvements:v Take logarithm (base 2) to prevent underflow:
log (1/8 * 1/8 * 1/8 * 1/16 …) = log 1/8 + log 1/8 + log 1/8 + log 1/16 … = (-3) + (-3) + (-3) + (-4) + …
v Negate to get a positive value in bits 3+3+3+4+…v Divide by length of text à 3.25 bits per letter (or per
word)
6501 Natural Language Processing
Average?Geometric average of 1/23,1/23, 1/23, 1/24
= 1/23.25 ≈ 1/9.5
33
Cross-Entropy (“xent”)
v Want prob of test data to be high:p(h | <S>, <S>) * p(o | <S>, h) * p(r | h, o) * p(s | o, r) …
1/8 * 1/8 * 1/8 * 1/16 …
v Cross-entropy à 3.25 bits per letter (or per word)vWant this to be small (equivalent to wanting good
compression!)vLower limit is called entropy – obtained in principle as
cross-entropy of the true model measured on an infinite amount of data
v perplexity = 2xent (meaning ≈9.5 choices)
Average?Geometric average of 1/23,1/23, 1/23, 1/24
= 1/23.25 ≈ 1/9.5
6501 Natural Language Processing
More math: Entropy H(X)
v The entropy H(𝑝) of a discrete random variable 𝑋is the expected negative log probability:H p = −∑ 𝑝 𝑥 log0 𝑃(𝑥)k
vEntropy is measure of uncertainty
346501 Natural Language Processing
Entropy of coin tossing
vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝
vp=0.5: H(p)= 1vP=1: H(p) = 0
6501 Natural Language Processing 35
Entropy of coin tossing
vToss a coin P(H)=𝑝, P(T)=1 − pvH(p)= −𝑝 log0 𝑝 − 1 − 𝑝 log0 1 − 𝑝
vp=0.5: H(p)= 1vP=1: H(p) = 0
6501 Natural Language Processing 36
How many bits to encode messages
vConsider three letters (A, B, C, D):v If p=(½, ½, 0, 0), how many bits per letter
in average to encode a message ~ p?vEncode A as 0, B as 1;AAABBBAA ⇒ 00011100
v If p=(¼ , ¼ , ¼ , ¼ )vA: 00, B: 01, C:10, D:11; ABDA⇒ 00011100
vHow about p=(½, ¼, ¼, 0)vA: 0, B:10, C:11; AAACBA ⇒ 00011100
6501 Natural Language Processing 37
More math: Cross Entropy
vCross-entropy:vAvg.#bitstoencodeevents~p(x)usingacodingschemem(x)
vH p,𝑚 = −∑ 𝑝 𝑥 log0𝑚(𝑥)kvNot symmetric: H p,𝑚 ≠ H 𝑚,𝑝vLower bounded by H(p)
v Let p=(½, ¼, ¼, 0)vWe encode A:00, B:01, C:10, D:11
(i,e., m= (¼, ¼, ¼, ¼))vAAACBA?
386501 Natural Language Processing
000000100100
Perplexity and geometric mean
v
6501 Natural Language Processing 39
CS498JH: Introduction to NLP
Perplexity
Language model m is better than m’ if it assigns lower perplexity (i.e. lower cross-entropy, and higher probability) to the test corpus w1...wN
10
Perplexity(w1
. . .wN) = 2
H(w1
. . . wN)
= 2
� 1
N log
2
m(w1
. . . wN)
= m(w1
. . .wN)� 1
N
= N
s1
m(w1
. . .wN)
An experiment
vTrain: 38M WSJ text, |V|= 20kvTest: 1.5M WSJ text
vWord level LSTM ~85vChar level ~79
6501 Natural Language Processing 40
CS498JH: Introduction to NLP
Models:Unigram, Bigram, Trigram model (with Good-Turing)
Training data: 38M words of WSJ text (Vocabulary: 20K types)
Test data:1.5M words of WSJ text
Results:
An experiment
Unigram Bigram TrigramPerplexity 962 170 109
12