Upload
dortha-martin
View
218
Download
1
Embed Size (px)
Citation preview
04/21/23 CPSC503 Winter 2009 1
CPSC 503Computational Linguistics
Lecture 5Giuseppe Carenini
04/21/23 CPSC503 Winter 2009 2
Today 23/9
• Min Edit Distance• n-grams• Model evaluation
04/21/23 CPSC503 Winter 2009 3
Minimum Edit Distance• Def. Minimum number of edit
operations (insertion, deletion and substitution) needed to transform one string into another.
gumbo
gumb
gum
gam
delete o
delete b
substitute u by a
04/21/23 CPSC503 Winter 2009 4
Minimum Edit Distance Algorithm
• Dynamic programming (very common technique in NLP)
• High level description:– Fills in a matrix of partial comparisons– Value of a cell computed as “simple”
function of surrounding cells– Output: not only number of edit operations
but also sequence of operations
04/21/23 CPSC503 Winter 2009 5
target
source
ij
Minimum Edit Distance Algorithm Details
ed[i,j] = min distance between first i chars of the source and first j chars of the target
del-cost =1sub-cost=2
ins-cost=1
update
x
y
z
del
ins
sub or equal
?
i-1 , ji-1, j-1
i , j-1
MIN(z+1,y+1, x + (2 or 0))
04/21/23 CPSC503 Winter 2009 6
target
source
ij
Minimum Edit Distance Algorithm Details
ed[i,j] = min distance between first i chars of the source and first j chars of the target
del-cost =1sub-cost=2
ins-cost=1
update
x
y
z
del
ins
sub or equal
?
i-1 , ji-1, j-1
i , j-1
MIN(z+1,y+1, x + (2 or 0))
04/21/23 CPSC503 Winter 2009 7
Min edit distance and alignment
See demo
04/21/23 CPSC503 Winter 2009 8
Spelling: the problem(s)
Non-word isolated
Non-word context
Detection
Correction
Vw?
Find the most
likely correct word
funn -> funny, funnel...
…in this context– trust funn – a lot of funn
– I want too go their
Real-word isolated
Real-word context
?!Is it an impossible (or very unlikely) word
in this context?Find the most likely
sub word in this context
04/21/23 CPSC503 Winter 2009 9
Key Transition
• Up to this point we’ve mostly been discussing words in isolation
• Now we’re switching to sequences of words
• And we’re going to worry about assigning probabilities to sequences of words
04/21/23 CPSC503 Winter 2009 10
Knowledge-Formalisms Map(including probabilistic formalisms)
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse
and Dialogue
Semantics
AI planners
04/21/23 CPSC503 Winter 2009 11
Only Spelling?A.Assign a probability to a sentence
• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing
B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the
disabled
AB
?),..,( 1 nwwP Impossible to estimate
04/21/23 CPSC503 Winter 2009 12
Impossible to estimate!Assuming 105 word types and average sentence contains 10 words ->
sample space?
• Google language model Update (22 Sept. 2006): was based on corpusNumber of sentences: 95,119,665,584
?),..,( 1 nwwP
Most sentences will not appear or appear only once
Key point in Stat. NLP: your corpus should be >> than your sample space!
04/21/23 CPSC503 Winter 2009 13
Decompose: apply chain rule
Chain Rule:
)|(),..(1
111
i
jji
n
in AAPAAP
nw1
)|()(
)|()...|()()(),..,(
112
1
1112111
k
kn
k
nn
nn
wwPwP
wwPwwPwPwPwwP
Applied to a word sequence from position 1 to n:
04/21/23 CPSC503 Winter 2009 14
Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) x
P(big|the) x P(red|the big) x
P(dog|the big red) x P(barks|the big red dog)
Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|
<S>)
04/21/23 CPSC503 Winter 2009 15
Not a satisfying solution Even for small n (e.g., 6) we would
need a far too large corpus to estimate:
)|(....... 516 wwP
)|()|( 11
11
nNnn
nn wwPwwP
Markov Assumption: the entire prefix history isn’t necessary.
),|()|(3
)|()|(2
)()|(1
211
1
11
1
11
nnnn
n
nnn
n
nn
n
wwwPwwPN
wwPwwPN
wPwwPN unigram
bigram
trigram
04/21/23 CPSC503 Winter 2009 16
Prob of a sentence: N-Grams
)|()()(),..,( 112
111
kk
n
k
nn wwPwPwPwwP
)()()(),..,(2
111 kn
k
nn wPwPwPwwP
)|()()(),..,( 12
111 kkn
k
nn wwPwPwPwwP
unigram
bigram
trigram)|()()(),..,( 2,12
111 kkkn
k
nn wwwPwPwPwwP
Chain-rule
simplifications
04/21/23 CPSC503 Winter 2009 17
Bigram<s>The big red dog barks
P(The big red dog barks)= P(The|<S>) x
)|()|()(),..,( 12
111 kkn
k
nn wwPSwPwPwwP
Trigram?
04/21/23 CPSC503 Winter 2009 18
Estimates for N-Grams
)(
),()(
),(
)(
)()|(
1
1
1
1
1
,11
n
nn
words
n
pairs
nn
n
nnnn
wC
wwC
NwC
NwwC
wP
wwPwwP
bigram
..in general)(
)()|(
11
111
1
nNn
nnNnn
NnnwC
wwCwwP
04/21/23 CPSC503 Winter 2009 19
Estimates for Bigrams
)(
),()(
),(
)(
),()|(
bigC
redbigC
NbigC
NredbigC
bigP
redbigPbigredP
words
pairs
Silly Corpus :
“<S>The big red dog barks against the big pink dog”
Word types vs. Word tokens
04/21/23 CPSC503 Winter 2009 20
Berkeley ____________Project (1994) Table: Counts
nw
1nw
)(
)()|(
1
11
n
nnnn
wC
wwCwwP
Corpus: ~10,000 sentences, 1616 word typesWhat domain?
Dialog? Reviews?
04/21/23 CPSC503 Winter 2009 21
BERP Table:nw
1nw
)|( 1nn wwP
04/21/23 CPSC503 Winter 2009 22
BERP Table Comparison
)(
)(
1
1
n
nn
wC
wwC
Counts
Prob.
1?
nw
1nw
04/21/23 CPSC503 Winter 2009 23
Some Observations
• What’s being captured here?– P(want|I) = .32– P(to|want) = .65– P(eat|to) = .26– P(food|Chinese) = .56– P(lunch|eat) = .055
nw
1nw
04/21/23 CPSC503 Winter 2009 24
Some Observations
• P(I | I)• P(want | I)• P(I | food)
• I I I want• I want I want to• The food I want is
nw
1nw
Speech-based restaurant consultant!
04/21/23 CPSC503 Winter 2009 25
Generation• Choose N-Grams according to their
probabilities and string them together
nw
1nw
• I want -> want to -> to eat -> eat lunch• I want -> want to -> to eat -> eat Chinese -> Chinese food
04/21/23 CPSC503 Winter 2009 26
Two problems with applying:
)|()()( 12
11 kkn
k
n wwPwPwP
nw
1nw
to
04/21/23 CPSC503 Winter 2009 27
Problem (1)
• We may need to multiply many very small numbers (underflows!)
• Easy Solution:– Convert probabilities to logs and
then do ……– To get the real probability (if you
need it) go back to the antilog.
04/21/23 CPSC503 Winter 2009 28
Problem (2)• The probability matrix for n-grams is
sparse• How can we assign a probability to a
sequence where one of the component n-grams has a value of zero?
•Solutions:– Add-one smoothing– Good-Turing– Back off and Deleted Interpolation
04/21/23 CPSC503 Winter 2009 29
Add-One• Make the zero counts 1.• Rationale: If you had seen these
“rare” events chances are you would only have seen them once.
unigram N
wCwP
)()(
VN
wCwP
1)(
)(*
N
wCwP
)()(
**
VN
NwCwC
)1)(()(*
)(
)(*
wC
wCdc discount
04/21/23 CPSC503 Winter 2009 30
Add-One: Bigram
)(
),()|(
1
11
n
nnnn
wC
wwCwwP
VwC
wwCwwP
n
nnnn
)(
1),()|(
1
11
*
VwC
NwwCwwC
nnnnn
)()1),((),(
111
*……
Counts
nw
1nw
04/21/23 CPSC503 Winter 2009 31
BERP Original vs. Add-one smoothed Countsnw
1nw
6
19225
15
04/21/23 CPSC503 Winter 2009 32
Add-One Smoothed Problems
• An excessive amount of probability mass is moved to the zero counts
• When compared empirically with MLE or other smoothing techniques, it performs poorly
• -> Not used in practice• Detailed discussion [Gale and Church
1994]
04/21/23 CPSC503 Winter 2009 33
Better smoothing technique• Good-Turing Discounting (clear
example on textbook)
More advanced: Backoff and Interpolation
• To estimate an ngram use “lower order” ngrams– Backoff: Rely on the lower order ngrams when
we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams
– Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts
04/21/23 CPSC503 Winter 2009 34
Impossible to estimate!Sample space much bigger than any
realistic corpus
?),..,( 1 nwwP
Chain rule does not help
Markov assumption :Unigram… sample space?Bigram … sample space?Trigram … sample space?
Sparse matrix: Smoothing techniques
N-Grams Summary: final
Look at practical issues: sec. 4.8
04/21/23 CPSC503 Winter 2009 35
You still need a big corpus…• The biggest one is the Web!
),(
),,(),|(
21
321213
wwC
wwwCwwwP
Web
Web
Web
• Impractical to download every page from the Web to count the ngrams =>
• Rely on page counts (which are only approximations)– Page can contain an ngram multiple times– Search Engines round-off their counts• Such “noise” is tolerable in practice
04/21/23 CPSC503 Winter 2009 36
Today 23/9
• Min Edit Distance• n-grams• Model evaluation
04/21/23 CPSC503 Winter 2008 37
Model Evaluation: GoalYou may want to compare: • 2-grams with 3-grams • two different smoothing techniques
(given the same n-grams)
On a given corpus…
04/21/23 CPSC503 Winter 2008 38
Model Evaluation: Key Ideas
Corpus
Training Set Testing set
A:split
B: train models
Models: Q1 and Q2
C:Apply models• counting • frequencies • smoothing
• Compare results
Nw1
04/21/23 CPSC503 Winter 2009 39
Entropy• Def1. Measure of uncertainty• Def2. Measure of the information
that we need to resolve an uncertain situation
– Let p(x)=P(X=x); where x X.
– H(p)= H(X)= - xX p(x)log2p(x)
– It is normally measured in bits.
04/21/23 CPSC503 Winter 2008 40
Model Evaluation
?),..,( 1 nwwP
?),..,( 1 nwwQ
Actual distribution
Our approximation
How different?
Relative Entropy (KL divergence) ?
D(p||q)= xX p(x)log(p(x)/q(x))
04/21/23 CPSC503 Winter 2008 41
Entropy of
)(log)()()( 111
1
n
Lw
nn wPwPwHPHn
),..,( 1 nwwP
Entropy rate )(1
1nwH
n
)(log)(1
lim)( 11
1
n
Lw
n
nwPwP
nLH
n
Language
EntropyAssumptions:ergodic and stationary
)(log1
lim)( 1n
nwP
nLH
Entropy can be computed by taking the average log probability of a looooong sample
NL?
Shannon-McMillan-Breiman
04/21/23 CPSC503 Winter 2008 42
Cross-EntropyBetween probability distribution P and
another distribution Q (model for P)
)(log)()||()(),( xQxPQPDPHQPHx
)(),( PHQPH
Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower
)(log)(1
lim),( 11
1
n
Lw
n
nwQwP
nQPH
n
)(log1
lim),( 1n
nwQ
nQPH
Applied to Languag
e
04/21/23 CPSC503 Winter 2008 43
Model Evaluation: In practiceCorpus
Training Set Testing set
A:split
B: train models
Models: Q1 and Q2
C:Apply models• counting • frequencies • smoothing
• Compare cross-perplexities
Nw1
),(),( 21 2?2 QPHQPH)(log
1),( 1
nwQn
QPH
04/21/23 CPSC503 Winter 2008 44
k-fold cross validation and t-test• Randomly divide the corpus in k
subsets of equal size• Use each for testing (all the other
for training)In practice you do k times what we saw in previous
slide
• Now for each model you have k perplexities
• Compare average models perplexities with t-test
04/21/23 CPSC503 Winter 2009 45
Knowledge-Formalisms Map(including probabilistic formalisms)
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse
and Dialogue
Semantics
AI planners
04/21/23 CPSC503 Winter 2009 46
Next Time
• Hidden Markov-Models (HMM)• Part-of-Speech (POS) tagging