Lecture 5: N-gram Context, List Comprehensionnaraehan/ling1330/Lecture5.pdfObjectives Context-aware spell checkers n-gram as context Character-level n-grams Word-level n-grams Frequent

Lecture 5:

Spell checkers, n-grams

Ling 1330/2330 Intro to Computational Linguistics

Na-Rae Han, 9/3/2020

Objectives

L&C ch.2: Writers aids, spell checkers Types of "writers' aids" utilities

Spell checkers Edit distance

n-gram as context

Character-level n-grams

Word-level n-grams

Frequent n-grams in English

NLTK Buliding n-grams

n-gram frequency distribution

9/3/2020 2

Writers' aids in the wild

9/3/2020 3

Types of NLP-based writing helper utilities:

Spell checkers

Grammar checkers

Built-in dictionaries & thesauri

Anything else?

What works well and what doesn't?

How spell checkers operate 1

9/3/2020 4

Real-time spell checkers

Spell checker detects errors as you type.

May make suggestions for correction

→Writer can accept or reject them

Some systems auto-correct without your approval

→ Predominantly on mobile platform

Must run in background, requires a "real-time" response – system must be light and fast bright, birth, broth,

births, brat

How spell checkers operate 2

9/3/2020 5

Global spell checkers You run a checker on the whole document or a set region

System has access to wider context (the whole paragraph, etc.)

It finds errors and corrects them, often automatically

A human may or may not proofread results afterwards

Adaptive spell checkers "Learns" the language of the user, adjust lexicon/rules

Manual: User has the option to add or remove from the lexicon

Automatic: adjust lexicon and rules in the background. Often uses context and statistical data.

Modus operandi of most mobile platform

Detection vs. correction

9/3/2020 6

There are two distinct tasks:

Error detection

Simply find the misspelled words

Error correction

Correct the misspelled words

It is EASY to tell that briht is a misspelled word;

But what is the CORRECT word?

bright? birth? births? brat?

Why not Brit or brought?

We need a way to measure the degree of similaritybetween source and target words

Measure of string similarity

9/3/2020 7

How is a mistyped word related to the intended?

Types of errors

Insertion: A letter has been added to a word

ex. "arguement" instead of "argument"

Deletion: A letter has been omitted from a word

ex. "pychology" instead of "psychology"

Substitution: One letter has been replaced by another

ex. "miopic" instead of "myopic"

Transposition: Two adjacent letters have been switched

ex. "concsious" instead of "conscious"

Minimum edit distance

9/3/2020 8

In order to rank possible spelling corrections, it is useful to calculate the minimum edit distance (= minimum number of operations it would take to convert word1 to word2).

Edit distance; also known as Levenshtein distance (without Transposition)

Example: briht

bright? birth? births? brat? Brit? brought?

briht→ bright

briht→ brit

briht→ birth

briht→ brunt

briht→ brat

briht→ brought

(1 insertion)

(1 deletion)

(2 transpositions)

(2 substitutions)

(1 substitution + 1 deletion = 2)

(1 substitution + 2 insertions = 3)

NOT 2 deletions &2 insertions!

https://en.wikipedia.org/wiki/Edit_distance

Minimum edit distance: is that enough?

9/3/2020 9

Example: briht

bright? birth? births? brat? Brit? brought?

briht→ bright

briht→ brit

briht→ birth

briht→ brunt

briht→ brat

briht→ brought

(1 insertion)

(1 deletion)

(2 transpositions)

(2 substutitions)

(1 substitution + 1 deletion = 2)

(1 substitution + 2 insertions = 3)

Any other considerations in ranking these candidates? word frequency context probability of error type keyboard layout

Increasingly important as spell checkers grow

more intelligent

MS Word considers word contexts

9/3/2020 10


9/3/2020 11

suggests "form" → "from"

suggests "your" → "you're" instead of "from" → "form"


9/3/2020 12

suggests "form" → "from"

suggests "your" → "you're" instead of "from" → "form"

Comparing:

"your form""your from""you form""you from"

"you're from"

BIGRAMS

n-grams: word-level

9/3/2020 13

n-gram: a stretch of text n units long

unigrams, bigrams, trigrams, 4-grams, 5-grams, …

'Colorless green ideas sleep furiously.'

Word bigrams:[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously'), ('furiously', '.')]

Word trigrams:[('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously'), ('sleep', 'furiously', '.')]

Word-level n-grams

9/3/2020 14

How likely do you think these n-grams are in English:

Putting in terms of conditional probability:

After a user types in 'are you', what is the most likely next word?

How about 'in the'? 'in the middle'?

are you

is you

are you so

are you also

are you does

46622

4441

428

26

-

For fun: most frequent bigrams?

9/3/2020 15

2551888 of the

1887475 in the

1041011 to the

861798 on the

676658 and the

648408 to be

578806 for the

561171 at the

498217 in a

479627 do n't

Source: http://www.ngrams.info/download_coca.asp

455367 with the

451460 from the

443547 of a

395939 that the

362176 is a

361879 going to

335255 by the

330828 as a

319846 with a

317431 I think

http://www.ngrams.info/download_coca.asp

Most frequent trigrams?

9/3/2020 16

198630 I do n't

140305 one of the

129406 a lot of

117289 the United States

79825 do n't know

76782 out of the

75015 as well as

73540 going to be

61373 I did n't

61132 to be a



4-grams? 5-grams?

9/3/2020 17

54647 I do n't know

43766 I do n't think

33975 in the United States

29848 the end of the

27176 do n't want to

12663 I do n't want to

10663 at the end of the

8484 in the middle of the

8038 I do n't know what

6446 I do n't know if



Building n-grams with NLTK

9/3/2020 18

>>> chom = 'colorless green ideas sleep furiously'.split()>>> chom['colorless', 'green', 'ideas', 'sleep', 'furiously']

>>> nltk.bigrams(chom)<generator object bigrams at 0x000001C432AEAA98>>>> list(nltk.bigrams(chom))[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously')]

>>> nltk.ngrams(chom, 2)<generator object ngrams at 0x000001C432AEAA20>>>> list(nltk.ngrams(chom, 2))[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously')]>>> list(nltk.ngrams(chom, 3))[('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously')]>>> chom3grams = list(nltk.ngrams(chom, 3))

nltk.bigrams()

nltk.ngrams(list, n)

These return a generator object. Cast into a list for multiple use.

Careful with NLTK n-grams

9/3/2020 19

>>> rtoks['Rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.']>>> nltk.ngrams(rtoks, 2)<generator object ngrams at 0x0A18B0C0>>>> for gram in nltk.ngrams(rtoks, 2):

print(gram)

('Rose', 'is')('is', 'a')('a', 'rose')('rose', 'is')('is', 'a')('a', 'rose')('rose', 'is')('is', 'a')('a', 'rose')('rose', '.')

>>> r2grams = nltk.ngrams(rtoks, 2)>>> nltk.FreqDist(r2grams)FreqDist({('is', 'a'): 3, ('a', 'rose'): 3, ('rose', 'is'): 2, ('rose', '.'): 1, ('Rose', 'is'): 1})

nltk.ngrams() returns a generator object.

Cannot see it as a whole, but works in for loop, ONCE!

Feed to nltk.FreqDist()to obtain bigram frequency

distribution.

Practice with Gettysburg

9/3/2020 20

Process The Gettysburg Address (gettysburg_address.txt)

Build word-level bigrams from tokens.

How many times does the bigram ('to', 'be') occur?

What are the top 10 most frequent bigrams?

Hint: feed bigrams into nltk.FreqDist()

10 minutes

9/3/2020 21

9/3/2020 22

>>> g2grams = list(nltk.bigrams(gtoks))>>> g2grams[-30:][(',', 'shall'), ('shall', 'have'), ('have', 'a'), ('a', 'new'), ('new', 'birth'), ('birth', 'of'), ('of', 'freedom'), ('freedom', '-'), ('-', 'and'), ('and', 'that'), ('that', 'government'), ('government', 'of'), ('of', 'the'), ('the', 'people'), ('people', ','), (',', 'by'), ('by', 'the'), ('the', 'people'), ('people', ','), (',', 'for'), ('for', 'the'), ('the', 'people'), ('people', ','), (',', 'shall'), ('shall', 'not'), ('not', 'perish'), ('perish', 'from'), ('from', 'the'), ('the', 'earth'), ('earth', '.')]>>> g2gramfd = nltk.FreqDist(g2grams)>>> g2gramfd[('to', 'be')]2>>> g2gramfd.most_common(10)[(('nation', ','), 4), (('to', 'the'), 3), (('.', 'It'), 3), (('It', 'is'), 3), ((',', 'we'), 3), (('we', 'can'), 3), (('can', 'not'), 3), (('-', 'that'), 3), (('the', 'people'), 3), (('people', ','), 3)]

Feed to nltk.FreqDist()to obtain bigram frequency

distribution.

Wrap-up

9/3/2020 23

Homework #2 out

Process the Bible + Jane Austen

Due next Tuesday 4pm

Next class (Tue): n-gram context

conditional probability, conditional frequency distribution

Review the NLTK Book, chapters 1 through 3.

Documents

Lecture 5: N-gram Context, List Comprehensionnaraehan/ling1330/Lecture5.pdfObjectives Context-aware spell checkers n-gram as context Character-level n-grams Word-level n-grams Frequent