Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Lecture 5:
Spell checkers, n-grams
Ling 1330/2330 Intro to Computational Linguistics
Na-Rae Han, 9/3/2020
Objectives
L&C ch.2: Writers aids, spell checkers Types of "writers' aids" utilities
Spell checkers Edit distance
n-gram as context
Character-level n-grams
Word-level n-grams
Frequent n-grams in English
NLTK Buliding n-grams
n-gram frequency distribution
9/3/2020 2
Writers' aids in the wild
9/3/2020 3
Types of NLP-based writing helper utilities:
Spell checkers
Grammar checkers
Built-in dictionaries & thesauri
Anything else?
What works well and what doesn't?
How spell checkers operate 1
9/3/2020 4
Real-time spell checkers
Spell checker detects errors as you type.
May make suggestions for correction
→Writer can accept or reject them
Some systems auto-correct without your approval
→ Predominantly on mobile platform
Must run in background, requires a "real-time" response – system must be light and fast bright, birth, broth,
births, brat
How spell checkers operate 2
9/3/2020 5
Global spell checkers You run a checker on the whole document or a set region
System has access to wider context (the whole paragraph, etc.)
It finds errors and corrects them, often automatically
A human may or may not proofread results afterwards
Adaptive spell checkers "Learns" the language of the user, adjust lexicon/rules
Manual: User has the option to add or remove from the lexicon
Automatic: adjust lexicon and rules in the background. Often uses context and statistical data.
Modus operandi of most mobile platform
Detection vs. correction
9/3/2020 6
There are two distinct tasks:
Error detection
Simply find the misspelled words
Error correction
Correct the misspelled words
It is EASY to tell that briht is a misspelled word;
But what is the CORRECT word?
bright? birth? births? brat?
Why not Brit or brought?
We need a way to measure the degree of similaritybetween source and target words
Measure of string similarity
9/3/2020 7
How is a mistyped word related to the intended?
Types of errors
Insertion: A letter has been added to a word
ex. "arguement" instead of "argument"
Deletion: A letter has been omitted from a word
ex. "pychology" instead of "psychology"
Substitution: One letter has been replaced by another
ex. "miopic" instead of "myopic"
Transposition: Two adjacent letters have been switched
ex. "concsious" instead of "conscious"
Minimum edit distance
9/3/2020 8
In order to rank possible spelling corrections, it is useful to calculate the minimum edit distance (= minimum number of operations it would take to convert word1 to word2).
Edit distance; also known as Levenshtein distance (without Transposition)
Example: briht
bright? birth? births? brat? Brit? brought?
briht→ bright
briht→ brit
briht→ birth
briht→ brunt
briht→ brat
briht→ brought
(1 insertion)
(1 deletion)
(2 transpositions)
(2 substitutions)
(1 substitution + 1 deletion = 2)
(1 substitution + 2 insertions = 3)
NOT 2 deletions &2 insertions!
Minimum edit distance: is that enough?
9/3/2020 9
Example: briht
bright? birth? births? brat? Brit? brought?
briht→ bright
briht→ brit
briht→ birth
briht→ brunt
briht→ brat
briht→ brought
(1 insertion)
(1 deletion)
(2 transpositions)
(2 substutitions)
(1 substitution + 1 deletion = 2)
(1 substitution + 2 insertions = 3)
Any other considerations in ranking these candidates? word frequency context probability of error type keyboard layout
Increasingly important as spell checkers grow
more intelligent
MS Word considers word contexts
9/3/2020 10
MS Word considers word contexts
9/3/2020 11
suggests "form" → "from"
suggests "your" → "you're" instead of "from" → "form"
MS Word considers word contexts
9/3/2020 12
suggests "form" → "from"
suggests "your" → "you're" instead of "from" → "form"
Comparing:
"your form""your from""you form""you from"
"you're from"
BIGRAMS
n-grams: word-level
9/3/2020 13
n-gram: a stretch of text n units long
unigrams, bigrams, trigrams, 4-grams, 5-grams, …
'Colorless green ideas sleep furiously.'
Word bigrams:[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously'), ('furiously', '.')]
Word trigrams:[('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously'), ('sleep', 'furiously', '.')]
Word-level n-grams
9/3/2020 14
How likely do you think these n-grams are in English:
Putting in terms of conditional probability:
After a user types in 'are you', what is the most likely next word?
How about 'in the'? 'in the middle'?
are you
is you
are you so
are you also
are you does
46622
4441
428
26
-
For fun: most frequent bigrams?
9/3/2020 15
2551888 of the
1887475 in the
1041011 to the
861798 on the
676658 and the
648408 to be
578806 for the
561171 at the
498217 in a
479627 do n't
Source: http://www.ngrams.info/download_coca.asp
455367 with the
451460 from the
443547 of a
395939 that the
362176 is a
361879 going to
335255 by the
330828 as a
319846 with a
317431 I think
Most frequent trigrams?
9/3/2020 16
198630 I do n't
140305 one of the
129406 a lot of
117289 the United States
79825 do n't know
76782 out of the
75015 as well as
73540 going to be
61373 I did n't
61132 to be a
Source: http://www.ngrams.info/download_coca.asp
4-grams? 5-grams?
9/3/2020 17
54647 I do n't know
43766 I do n't think
33975 in the United States
29848 the end of the
27176 do n't want to
12663 I do n't want to
10663 at the end of the
8484 in the middle of the
8038 I do n't know what
6446 I do n't know if
Source: http://www.ngrams.info/download_coca.asp
Building n-grams with NLTK
9/3/2020 18
>>> chom = 'colorless green ideas sleep furiously'.split()>>> chom['colorless', 'green', 'ideas', 'sleep', 'furiously']
>>> nltk.bigrams(chom)<generator object bigrams at 0x000001C432AEAA98>>>> list(nltk.bigrams(chom))[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously')]
>>> nltk.ngrams(chom, 2)<generator object ngrams at 0x000001C432AEAA20>>>> list(nltk.ngrams(chom, 2))[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously')]>>> list(nltk.ngrams(chom, 3))[('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously')]>>> chom3grams = list(nltk.ngrams(chom, 3))
nltk.bigrams()
nltk.ngrams(list, n)
These return a generator object. Cast into a list for multiple use.
Careful with NLTK n-grams
9/3/2020 19
>>> rtoks['Rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.']>>> nltk.ngrams(rtoks, 2)<generator object ngrams at 0x0A18B0C0>>>> for gram in nltk.ngrams(rtoks, 2):
print(gram)
('Rose', 'is')('is', 'a')('a', 'rose')('rose', 'is')('is', 'a')('a', 'rose')('rose', 'is')('is', 'a')('a', 'rose')('rose', '.')
>>> r2grams = nltk.ngrams(rtoks, 2)>>> nltk.FreqDist(r2grams)FreqDist({('is', 'a'): 3, ('a', 'rose'): 3, ('rose', 'is'): 2, ('rose', '.'): 1, ('Rose', 'is'): 1})
nltk.ngrams() returns a generator object.
Cannot see it as a whole, but works in for loop, ONCE!
Feed to nltk.FreqDist()to obtain bigram frequency
distribution.
Practice with Gettysburg
9/3/2020 20
Process The Gettysburg Address (gettysburg_address.txt)
Build word-level bigrams from tokens.
How many times does the bigram ('to', 'be') occur?
What are the top 10 most frequent bigrams?
Hint: feed bigrams into nltk.FreqDist()
10 minutes
9/3/2020 21
9/3/2020 22
>>> g2grams = list(nltk.bigrams(gtoks))>>> g2grams[-30:][(',', 'shall'), ('shall', 'have'), ('have', 'a'), ('a', 'new'), ('new', 'birth'), ('birth', 'of'), ('of', 'freedom'), ('freedom', '-'), ('-', 'and'), ('and', 'that'), ('that', 'government'), ('government', 'of'), ('of', 'the'), ('the', 'people'), ('people', ','), (',', 'by'), ('by', 'the'), ('the', 'people'), ('people', ','), (',', 'for'), ('for', 'the'), ('the', 'people'), ('people', ','), (',', 'shall'), ('shall', 'not'), ('not', 'perish'), ('perish', 'from'), ('from', 'the'), ('the', 'earth'), ('earth', '.')]>>> g2gramfd = nltk.FreqDist(g2grams)>>> g2gramfd[('to', 'be')]2>>> g2gramfd.most_common(10)[(('nation', ','), 4), (('to', 'the'), 3), (('.', 'It'), 3), (('It', 'is'), 3), ((',', 'we'), 3), (('we', 'can'), 3), (('can', 'not'), 3), (('-', 'that'), 3), (('the', 'people'), 3), (('people', ','), 3)]
Feed to nltk.FreqDist()to obtain bigram frequency
distribution.
Wrap-up
9/3/2020 23
Homework #2 out
Process the Bible + Jane Austen
Due next Tuesday 4pm
Next class (Tue): n-gram context
conditional probability, conditional frequency distribution
Review the NLTK Book, chapters 1 through 3.