29
Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights Hand-coded Corpus-based estimation Dynamic Programming Shortest path

Outline

  • Upload
    jania

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Outline. Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights Hand-coded Corpus-based estimation Dynamic Programming Shortest path . Detecting and Correcting Spelling Errors. - PowerPoint PPT Presentation

Citation preview

Page 1: Outline

OutlineApplications:• Spelling correctionFormal Representation:• Weighted FSTsAlgorithms:• Bayesian Inference (Noisy channel model)• Methods to determine weights

– Hand-coded– Corpus-based estimation

• Dynamic Programming– Shortest path

Page 2: Outline

Detecting and Correcting Spelling Errors

Sources of lexical/spelling errors• Speech: lexical access and recognition errors (more later)• Text: typing and cognitive• OCR: recognition errorsApplications:• Spell checking• Hand-writing recognition of zip codes, signatures, GraffitiIssues:• Correct non-words in isolation (dg for dog, why not dig?)• Correcting non-words could lead to valid words

– Homophone substitution: “parents love there children”; “Lets order a desert after dinner”

– Correcting words in context

Page 3: Outline

Patterns of Error

Human typists make different types of errors from OCR systems -- why?Error classification I: performance-based:• Insertion: catt• Deletion: ct• Substitution: car• Transposition: ctaError classification II: cognitive• People don’t know how to spell (nucular/nuclear; potatoe/potato)• Homonymous errors (their/there)

Page 4: Outline

Probability: RefresherPopulation: 10 Princeton students

– What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4–That a rcs is a CS major? p(c) = 0.3–That a rcs is a vegetarian and CS major? p(c,v) = 0.2–That a vegetarian is a CS major? p(c|v) = 0.5–That a CS major is a vegetarian? p(v|c) = 0.66–That a non-CS major is a vegetarian? p(v|c’) = ??

–4 vegetarians

–3 CS majors

Page 5: Outline

Bayes Rule and Noisy Channel model• We know the joint probabilities

– p(c,v) = p(c) p(v|c) (chain rule)– p(v,c) = p(c,v) = p(v) p(c|v)

• So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c).

• “Noisy channel” metaphor: channel corrupts the input; recover the original.

– think cell-phone conversations!!– Hearer’s challenge: decode what the speaker said (w), given a channel-

corrupted observation (O).

)|(maxarg* OwPwVw

)()|()()|(

vpcvpcpvcp

)(*)|(maxarg* wPwOPwVw

Source modelChannel model

Page 6: Outline

How do we use this model to correct spelling errors?

• Simplifying assumptions– We only have to correct non-word errors– Each non-word (O) differs from its correct word (w) by one step

(insertion, deletion, substitution, transposition)• Generate and Test Method: (Kernighan et al 1990)

– Generate a word using one of substitution, deletion or insertion, transposition operations

– Test if the resulting word is in the dictionary.• Example:

Observation

Correct Correct letter

Error Letter

Position Type of Error

caat cat - a 2 insertion

caat carat r - 3 deletion

Page 7: Outline

How do we decide which correction is most likely?

Validate the generated word in a dictionary.• But there may be multiple valid words, how to rank them?• Rank them based on a scoring function

– P(w | typo) = P(typo | w) * P(w)– Note there could be other scoring functions

• Propose n-best solutionsEstimate the likelihood P(typo|w) and the prior P(w)• count events from a corpus to estimate these probabilities• Labeled versus Unlabeled corpus• For spelling correction, what do we need?

– Word occurrence information (unlabeled corpus)– A corpus of labeled spelling errors– Approximate word replacement by local letter replacement

probabilities: Confusion matrix on letters

Page 8: Outline

Cat vs Carat

Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus• cat occurs 6500 times, so p(cat) = .00013• carat occurs 3000 times, so p(carat) = .00006Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word))• suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500

times (p(-r)=.15)Scoring function: p(word|typo) = p(typo|word) * p(word)• p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013• p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009

Page 9: Outline

Encoding One-Error Correction as WFSTs

Let Σ = {c,a,r,t}; One-edit model:

Dictionary model:

One-Error spelling correction:• Input ● Edit ● Dictionary

tc a

ra t

a

c:c,a:a,r:r,t:t c:c,a:a,r:r,t:tc:c,a:a,r:r,t:t:c,:a,:r,:t

c:,a:,r:,t:

c:a,c:r,c:t,a:c,a:t…

Del

0Ins

0 0

Sub

t

Page 10: Outline

IssuesWhat if there are no instances of carat in corpus?• Smoothing algorithmsEstimate of P(typo|word) may not be accurate• Training probabilities on typo/word pairsWhat if there is more than one error per word?

Page 11: Outline

Minimum Edit Distance

How can we measure how different one word is from another word?• How many operations will it take to transform one word into

another?caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)• Levenshtein distance: smallest number of insertion, deletion, or

substitution operations that transform one string into another (ins=del=subst=1)

• Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent

Page 12: Outline

Computing Levinshtein Distance

]1||,1|[|),(

)(]1,[),(]1,1[

)(],1[min],[

tsdtsLev

tinsjidtssubjid

sdeljidjid

j

ji

i

• Dynamic Programming algorithm

– Solution for a problem is a function of the solutions of subproblems

– d[i,j] contains the distance upto si and tj

– d[i,j] is computed by combining the distance of shorter substrings using insertion, deletion and substitution operations.

– optimal edit operations is recovered by storing back-pointers.

Page 13: Outline

Edit Distance MatrixNB: errors

Cost=1 for insertions and deletions; Cost=2 for substitutionsRecompute the matrix: insertions=deletions=substituitions=1

Page 14: Outline

Levenstein Distance with WFSTs

Let Σ = {c,a,r,t}; Edit model:

The two sentences to compared are encoded as FSTs.Levenstein distance between two sentences:• Dist(s1,s2) = s1 ● Edit ● s2

Subc:c,a:a,r:r,t:t

:c,:a,:r,:t

c:,a:,r:,t:

c:a,c:r,c:t,a:c,a:t…

Del

Ins

0

Page 15: Outline

Spelling Correction with WFSTsDictionary: FST representation of wordsIsolated word spelling correction:• AllCorrections(w) = w ● Edit ● Dictionary• BestCorrection(w) = Bestpath (w ● Edit ● Dictionary)Spelling correction in context: “parents love there children”• S = w1, w2, … wn

• Spelling correction of wi

• Generate possible edits for wi

• Pick the edit that fits best in context• Use a n-gram language model (LM) to rank the alternatives.• “love there” vs “love their”; “there children” vs “their children”• SentenceCorrection (S) = F(S) ● Edit ● LM

Page 16: Outline

• Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.

Can humans understand ‘what is meant’ as opposed to ‘what is said/written’?How?http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/

Page 17: Outline

Summary

We can apply probabilistic modeling to NL problems like spell-checking• Noisy channel model, Bayesian method• Training priors and likelihoods on a corpus

Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems• e.g. Minimum Edit Distance algorithm

A number of Speech and Language tasks can be cast in this framework.• Generate alternatives using a generator• Select best/ Rank the alternatives using a model• If the generator and the model are encodable as FST

– Decoding becomes • composition followed by search for best path.

Page 18: Outline

Word Classes and Tagging

Page 19: Outline

Word Classes and Tagging

Words can be grouped into classes based on a number of criteria.• Application independent criterion

– Syntactic class (Nouns, Verbs, Adjectives…)– Proper names (People names, country names…)– Dates, currencies

• Application specific criterion– Product names (Ajax, Slurpee, Lexmark 3100)– Service names (7-cents plan, GoldPass)

Tagging: Categorizing words of a sentence into one of the classes.

Page 20: Outline

Syntactic Classes in English: Open Class Words

Nouns: • Defined semantically: words for people, places, things• Defined syntactically: words that take determiners• Count nouns: nouns that can be counted

– One book, two computers, hundred men• Mass nouns: nouns that represent homogenous groups, can occur without

articles.– snow, salt, milk, water, hair

• Proper nouns; common nounsVerbs: words for actions and processes• Hit, love, run, fly, differ, goAdjectives: words for describing qualities and properties (modifiers) of objects• White, black, old, young, good, badAdverbs: words for describing modifiers of actions• Unfortunately, John walked home extremely slowly yesterday• Subclasses: locative (home), degree (very), manner (slowly), temporal

(yesterday)

Page 21: Outline

Syntactic Classes in English: Closed Class Words

Closed Class words: • fixed set for a language• Typically high frequency words

Prepositions: relational words for describing relations among objects and events• In, on, before, by• Particles: looked up, throw out

Articles/Determiners: definite versus indefinite• Indefinite: a, an• Definite: the

Conjunctions: used to join two phrases, clauses, sentences.• Coordinating conjunctions: and, or, but• Subordinating conjunctions: that, since, because

Pronouns: shorthand to refer to objects and events.• Personal pronouns: he, she, it, they, us• Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s• Wh-pronouns: whose, what, who, whom, whomever

Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action• Tense: past, present, future• Aspect: completed or on-going• Polarity: negation• Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might)• Copula: “be” connects a subject to a predicate (John is a teacher)

Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).

Page 22: Outline

Tagset

Tagset: set of tags to use; depends on the application.• Basic tags; tags with some morphology• Composition of a number of subtags

– Agglutinative languagesPopular tagsets for English• Penn Treebank Tagset: 45 tags• CLAWS tagset: 61 tags• C7 tagset: 146 tagsHow do we decide how many tags to use?• Application utility• Ease of disambiguation• Annotation consistency

– “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions– “TO” tag represents preposition “to” and infinitival marker “to read”

Supertags: fold in syntactic information into tagset• of the order of 1000 tags

Page 23: Outline

Tagging: Disambiguating Words

Three different models• ENGTWOL model (Karlsson et.al. 1995)• Transformation-based model (Brill 1995)• Hidden Markov Model taggerENGTWOL tagger• Constraint-based tagger• 1,100 hand-written constraints to rule out invalid combinations of tags.

– Use of probabilistic constraints and syntactic informationTransformation-based model• Start with the most likely assignment• Make note of the context when the most likely assignment is wrong. • Induce a transformation rule that corrects the most likely assignment to the correct

tag in that context.• Rules can be seen as α β | δ – γ• Compilable into an FST

Page 24: Outline

Again, the Noisy Channel Model

Input to channel: Part-of-speech sequence T• Output from channel: a word sequence W• Decoding task: find T’ = P(T|W)• Using Bayes Rule

• And since P(W) doesn’t change for any hypothetical T’• T’ = P(W|T) P(T) • P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual

Probability

Source Noisy Channel Decoder

maxargVT

)()()|(maxarg WPTPTWP

VT

maxargVT

Page 25: Outline

Stochastic Tagging: Markov Assumption

• The tagging model is approximated using Markov assumptions.– T’ = P(T) * P(W|T)– Markov (first-order) assumption: – Independence assumption:– Thus:

• The probability distributions are estimated from an annotated corpus.– Maximum Likelihood Estimate

• P(w|t) = count(w,t)/count(t)• P(ti|ti-1) = count(ti, ti-1)/count(ti-1)• Don’t forget to smooth the counts!!

– There are other means of estimating these probabilities.

maxargVT

i

ii ttPTP )|()( 1

i

ii twPTWP )|()|()|(*)|(maxarg' 1

iii

iiVT

ttPtwPT

Page 26: Outline

Best Path Search

Search for the best path pervades many Speech and NLP problems.• ASR: best path through a composition of acoustic, pronunciation and

language models• Tagging: best path through a composition of lexicon and contextual

model• Edit distance: best path through a search space set up by insertion,

deletion and substitution operations.In general: • Decisions/operations create a weighted search space• Search for the best sequence of decisions Dynamic programming solution• Sometimes the score is only relevant.• Most often the path (sequence of states; derivation) is relevant.

Page 27: Outline

Multi-stage decision problems

DT •VB VBZ

NN NNS

The dog runs .•

P(dog|NN) = 0.99

P(dog|VB) = 0.01

P(the|DT) = 0.999

P(runs|NNS) = 0.63

P(runs|VBZ) = 0.37

P( | ) = 0.999• •

P(DT|BOS) =1

P(NN|DT) = 0.9

P(VB|DT) = 0.1

P(NNS|NN) = 0.3

P(VBZ|NN) = 0.7

P( |NNS) = 0.3

P( |VBZ) = 0.7

P(EOS | ) = 1

••

BOS EOS

P(NNS|VB) = 0.7

P(VBZ|VB) = 0.3

Page 28: Outline

Multi-stage decision problems

Find the state sequence through this space that maximizes P(w|t)*P(t|t-1)cost(BOS, EOS) = 1*cost(DT, EOS)cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS), P(the|DT)*P(VB|DT)*cost(VB,EOS)}

DT •VB VBZ

NN NNS

The dog runs .•

BOS EOS

Page 29: Outline

Two ways of reasoning

Forward approach (Backward reasoning)• Compute the best way to get from a state to the goal state.Backward approach (Forward reasoning)• Compute the best way from the source state to get to a

state.A combination of these two approaches is used in unsupervised training of HMMs.• Forward-backward algorithm (Appendix D)