39
Natural Language Processing Matilde Marcolli MAT1509HS: Mathematical and Computational Linguistics University of Toronto, Winter 2019, T 4-6 and W 4, BA6180 MAT1509HS Win2019: Linguistics Natural Language Processing

Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Natural Language Processing

Matilde MarcolliMAT1509HS: Mathematical and Computational Linguistics

University of Toronto, Winter 2019, T 4-6 and W 4, BA6180

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 2: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Reference

C.D. Manning, H. Schutze, Foundations of Statistical NaturalLanguage Processing, MIT Press, 1999.

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 3: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Setting based on Probabilistic Linguistics

• Electronic Corpora- Linguistic Data Consortium- European Language Resources Association- International Computer Archive of Modern English- Oxford Text Archive- Child Language Data Exchange System

• Stemming: stripping off affixes and word formation and extractstem of words from a word list

• Markup: syntactic structure is marked

• Penn Treebank: Lisp-like bracketing to mark binary treestructure of sentence

• SGML (Standard Generalized Markup Language): HTML is atype of SGML encoding; Text Encoding Initiative (TEI) encodingscheme suitable for marking parts of various texts, XML simplifiedform good for web applications

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 4: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Grammatical Tagging: automated tagging for categories (partsof speech: nouns, verbs,...)

• Tag Sets: American Brown Corpus (developed to tag theLancaster–Oslo–Bergen corpus and British National Corpus)

• Penn Treebank tag set: most widely used in computationalsetting (simplified version of previous)

• rule: least marked category is used as default whenever a wordcannot be placed in any other more precise subcategory withadditional markings

• Example: “Adjectives” used if cannot further place into“comparatives, superlatives, numerals,...”

• available tag sets are very different (some coarser, some morerefined)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 5: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• good criterion for tag sets: want to use tag of parts of speechthat “best predicts” the behavior of nearby words

• some tags may be too coarse to be good predictors

• Example: adverb not has very different distribution than otheradverbs, but if tagged in the same way the distinction is lost

• Example: should one use same tag for auxiliaries and for otherverbs? (Penn Treebank does)

• correctly identifying proper names is an issue

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 6: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• at both syntactic and semantic level punctuation matters...- the panda bear eats shoots and leaves- the panda bear eats, shoots, and leaves

• Collocations: a collection of two or more words with limitedcompositionality

• compositionality: meaning of whole expression deduced frommeaning of parts

• Example: an idiom is not compositionalI heard it through the grapevine

• Example: a combination of adjective+noun which cannot bereplaced by other synonymsstrong tea (cannot replace “strong” with other words referring to“physical strength” in the same context)

• Example: technical terms are also examples of collocation

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 7: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Collocations are identified by frequency based search (forcombination of words)

• Collocation window: catch combinations of words that are notcontiguous in a sentence

• Example: identify “knock – door” as a collocation (cannotreplace “knock” with another term like “hit” or “beat”, it wouldnot be correct) requires a window of several wordsthey knocked repeatedly at the metal front door

• the collocation in such cases can be identified by computing themean and variance of the signed distance between the two wordsover the corpus

• collocations = pairs with low deviation (if two unrelated words,variance would be higher)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 8: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• still high frequency and low variance can happen by accident:need to test hypothesis

• null hypothesis (no association) and compute probability of eventunder this hypothesis, reject if too low(below a given significance level)

• under null hypothesis (independence): P(w1,w2) = P(w1) ·P(w2)

• t-test for collocations: Pf (w1,w2) = frequency, number ofoccurrences normalized by corpus size N

t =Pf (w1,w2)− P(w1) · P(w2)√

σ2/N

σ = variance

• if t-value is not larger than critical value (for a desired“confidence level”) then cannot reject null hypothesis

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 9: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• use for disambiguation for synonyms

• Example: strong and powerful can sometime be used assynonyms but not always

• look for other words that form a collocation with one but notwith the other (as a way to separate their meaning)

• Examples: powerful computer; strong support

• w1 and w2 the words being compared; w another word

t ∼ Pf (w ,w1)− Pf (w ,w2)√Pf (w ,w1) + Pf (w ,w2)

comparison of the means of two independent normal populations

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 10: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• t-test assumes probabilities normally distributed

• more general test: χ2-test

• form a 2× 2 matrix of frequencies P(v1, v2)

(Oij) =

(Pf (v1 = w1, v2 = w2) Pf (v1 = w1, v2 6= w2)Pf (v1 6= w1, v2 = w2) Pf (v1 6= w1, v2 6= w2)

)then compute

χ2 =∑i ,j

(Oij − Eij)2

Eij

with expected values Eij under null hypothesis

• can apply to other sizes than 2× 2

• use χ2 as measure of corpus similarity on an N × 2 matrixN = list of words, 2 = two corpora being compared, recordfrequencies of words in the two corpora

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 11: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• use χ2 for identification of translation pairs in aligned corpora

• alignment: know which sentence matches which sentence in theoverall text

• count occurrences of words in matching sentences(by 2× 2-matrix: frequency of both occurring, of onebut not other, of neither)

• high value of χ2 (with respect to Eij of independencenull-hypothesis) reveals a matching translation pair with highconfidence

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 12: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Likelihood ratio another method for hypothesis testing

• compare independence hypothesis formulated as

P(w1 |w2) = P(w1 | ¬w2) = p

with dependence case

p1 = P(w1 |w2) 6= P(w1 | ¬w2) = p2

• occurrences counting: c1, c2, c12 for w1, w2 and both

• Example: assume a binomial distribution

b(k ; n, x) =

(n

k

)xk(1− x)n−k

• then c12 out of c1 bigrams are w1w2:b(c12; c1, p) in first case; b(c12; c1, p1) in second case

• c2 − c12 out of N − c1 bigrams are 6= w1w2

b(c2 − c12;N − c1, p) in first case; b(c2 − c12;N − c1, p2) in secondMAT1509HS Win2019: Linguistics Natural Language Processing

Page 13: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• likelihood of independence hypothesis is productL1 = b(c12; c1, p) · b(c2 − c12;N − c1, p);likelihood of dependence hypothesis is productL2 = b(c12; c1, p1) · b(c2 − c12;N − c1, p2)

• likelihood ratio

λ =L1L2

or in logarithmic form log λ

• this test works better than t-test and χ2-test for sparse data

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 14: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Information theoretic test for collocations:pointwise mutual information

I(X ,Y ) = log2P(X ,Y )

P(X )P(Y )= log2

P(X |Y )

P(X )= log2

P(Y |X )

P(Y )

• how much more information about the occurrence of a word w2

if know occurrence of word w1

• case of total dependence

I(X ,Y ) = logP(X ,Y )

P(X )P(Y )= log

P(X )

P(X )P(Y )= − logP(Y )

• case of perfect independence

I(X ,Y ) = logP(X ,Y )

P(X )P(Y )= log

P(X )P(Y )

P(X )P(Y )= 0

values close to zero: approximate independence

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 15: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Caveat: this measure of increased information does not explainnature of the relation

• Example: Chambre des Communes in French Canadian translatesHouse of Commons so in a corpus of text from Canadianparliament would find high I (X ,Y ) for both pairs (house,chambre) and (house, communes)

• χ2-test identifies correctly which pair is better match (because ofother unmatched occurrences)

• but I(X ,Y ) can give higher value of second

I(X ,Y ) =P(X ,Y )

P(Y |X ) + P(Y | ¬X )· 1

P(X )

depending on the entries P(Y |X ) and P(Y | ¬X ) of test matrix(for same X and for the two choices of Y being compared)

• Note: all these test not very good for low frequency cases(rare digrams)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 16: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• the above is pointwise mutual information; while usualinformation theoretic mutual information is averaged(expectation value)

I(X ,Y ) = EP(x ,y) (I(x , y)) = EP(x ,y)

(log

P(x , y)

P(x)P(y)

)

=∑

x∈X ,y∈YP(x , y) log

P(x , y)

P(x)P(y)

• the pointwise version measures reduction of uncertainty aboutoccurrence of one word knowing about occurrence of the other

• the averaged version has similar meaning, but only measures it inan averaged sense

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 17: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

More about word disambiguation

• when the same word can have several different meanings, find away to attribute the appropriate one in the context of the sentence

• also same word (in English) can often be used as a noun or averb: problem of tagging correctly as part of speech

• supervised versus unsupervised learning: know status (label) ofeach item used for training or not

• large scale training sets can be created by the use ofpseudowords (artificial conflation of two different words to create adisambiguation problem in a text)

• disambiguation algorithms use a limited representation ofcontext: e.g. up to 3 words on each side of the ambiguous word

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 18: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Supervised disambiguation

• a disambiguated corpus available for training

• Bayesian classification method: Bayesian decision rulechoose s ′ if P(s ′ | c) > P(s | c) for all other s 6= s ′ where

P(s | c) =P(c | s)

P(c)P(s)

prior probability P(s) (not knowing anything about context), theother factor is evidence about the context

s ′ = arg maxs

P(s | c) = arg maxs

P(c | s)

P(c)P(s)

= arg maxs

P(c | s)P(s) = arg maxs

(logP(c | s) + logP(s))

assign meaning s to ambiguous word w

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 19: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Naive Bayes classifier assumption

P(c | s) = P({vj : vj ∈ c} | s) =∏j

P(vj | s)

vj = words that occur in the context c of the ambiguous word

s ′ = arg maxs

logP(s) +∑j

logP(vj | s)

• the P(vj | s) and P(s) computed from labelled training corpus

P(vj | s) =C (νj , s)∑r C (νr , s)

, P(s) =C (s)

C (w)

C (ν, s) = counting of occurrences of pair

• naive assumption does not work when there are strongconditional dependencies between them

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 20: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Information theoretic supervised disambiguationinstead of using all words in a context window for disambiguation,tries to identifies a single feature that forces disambiguation

• Example: want to distinguish the fact that prendre une mesuretranslates as to take a measurement while prendre une decisiontranslates as to make a decision

• use (averaged) mutual information

I(X ,Y ) =∑

x∈X ,y∈YP(x , y) log

P(x , y)

P(x)P(y)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 21: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Flip-flop algorithm: linear-time algorithm computing bestpartition for given values of an indicator (requires labelled trainingset: supervised)

• P = {t1, . . . , tm} set of possible translations of ambiguous word;Q = {x1, . . . , xm} set of possible values of indicator;

• start with a random partition P = P1 ∪ P2

• find partition Q = Q1 ∪ Q2 maximizing I(P,Q)

• find partition P = P1 ∪ P2 maximizing I(P,Q) given partition ofQ, etc.

• with each iteration I(P,Q) non-decreasing: stop when stationary

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 22: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Dictionary based disambiguation

• D1, . . . ,Dk dictionary definitions of the meaningss1, . . . , sk of a word w

• list of words occurring in these definitions

• vj = words occurring in a context window c around w

• Evj = vocabulary definition of vj (list of words for all itsmeanings)

• score: σ(sk) = #(Dk ∩ ∪jEνj )count overlap in dictionary definitions

• choose meaning that maximizes: s = arg maxk σ(sk)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 23: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Unsupervised disambiguation

• “sense tagging” cannot be completely unsupervised (assigningmeaning to occurrences), but “sense discrimination” can

• cluster contexts of an ambiguous words into a number of(unlabeled) groups

• Context-group discrimination algorithm:parameters P(vj | sk) and P(sk)

• initialize parameters randomly

• compute likelihood of corpus C given model µ

L(C |µ) =∑i

log∑k

P(ci | sk)P(sk)

ci = individual contexts within corpus

P(ci | sk) =∏j

P(vj | sk)C(vj∈ci )

naive Bayes assumptionMAT1509HS Win2019: Linguistics Natural Language Processing

Page 24: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• estimate posterior probability

hik =P(sk)P(ci | sk)∑k P(sk)P(ci | sk)

• re-estimate parameters

P(vj | sk) =

∑i C (vj ∈ ci )hik∑

j

∑i C (vj ∈ ci )hik

P(sk) =

∑i hik∑ik hik

• stop when likelihood L(C |µ) no longer increasing significantly

• choose s ′ = arg max(

logP(sk) +∑

j P(vj | sk))

• can distinguish usage-types more fine grained than listed indictionaries

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 25: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Machine Translation

• increasingly refined levels of machine translation

• Word-by-word translations: riddled with lexical ambiguities,messes up word order

• Syntactic parsing and syntactic transfer: restores correct wordorder; ambiguities remain in ambiguities of parsing trees

• Semantic representation and semantic transfer: correcting forsyntactic parse trees ambiguities

• Interlingua: knowledge representation formalism, instead oftranslation systems for each pair of languages O(n2) translatebetween each and interlingua elevel O(n)... problem: extremelydifficult to design an efficient comprehensive representationformalism

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 26: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Text Alignment first step for Machine Translation

• matching sentences in two parallel texts in different languages

• order of subordinate sentences can vary in different languages:need to match which sentence pairs with which

• good training corpora are legislation documents in bilingual ormultilingual countries (Canada, Switzerland, Belgium,...) largesupply of documents and accurately translated for reason of legalrequirements

Length-based methods: want alignment A with highest probability

arg maxA

P(A |S ,T )

given parallel texts S ,T ; assume P(A | S ,T ) ∼∏

k P(Bk) sequenceBk of alignment beads statistically independent

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 27: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• estimate P(Bk) by comparing lengths of texts in Bk

• assign a cost function to each choice of alignment Bk andminimize (e.g. based on square of differences of lengths ofparagraphs/sentences)

• this method breaks down easily on noisy data like opticalcharacter recognition (OCR)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 28: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Offset alignment methods by signal processing techniques:

• induce alignment based on cognate words: works best forlanguages in same family with many cognate words, but can workin other cases too based on proper nouns, numbers, and other suchstrings of characters

• dot-plot: concatenate both text and make big matrix: place dotat (x , y) if there is a match between positions x and y in theconcatenated text... bitext map in off-diagonal quadrants

• can refine with counting positions in text where a word occursand a vector of distances between subsequent positions, cancompare with a candidate match: if distances too differentdiscard... works better when punctuation may be lost (OCR) andlanguages are not in same family

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 29: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

- W.A.Gale, K.W.Church, A Program for Aligning Sentences inBilingual Corpora, Computational Linguistics (1993) 19(3), 75–102

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 30: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Word Alignment: next step after sentences matched, done from abitext using association methods like χ2-test

Statistical Machine Translation: based on word alignment

P(Sf |Se) =1

Z

∑a1...am,=0

m∏j=1

P(wf ,j |we,aj )

Se ,Sf English and French sentences, ` = `(Se), m = `(Sf ) lengthsin number of words, wf ,j = j-th word of Sf , aj = position of Sethat wf ,j is aligned to, with word we,aj , and P(wf ,j |we,aj ) =translation probability (of finding wf ,j as a translation of we,aj )

• sum counts all possible alignments (aj = 0 if word with nomatching translation on other side)

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 31: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Decoder

S ′e = arg maxSe

P(Se | Sf ) = arg maxSe

P(Se)P(Sf | Se)

P(Sf )

= arg maxSe

P(Se)P(Sf |Se)

need a search algorithm (in principle searching over an infinitespace: need to reduce to a manageable search)

• Stack search: build sentence S ′e incrementally, keeping track ofpartial translation hypotheses, extend hypotheses (more words)and prune the stack of least likely extended hypotheses...

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 32: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Translation Probabilities: estimate the P(wf |we)

• initialize P(wf |we) randomly

• count occurrences in all pairs (Se , Sf ) of aligned sentencescontaining we and wf

zwf ,we =∑

(Se ,Sf )

P(wf |we)

• re-estimate probabilities

P(wf |we) =zwf ,we∑v zv ,we

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 33: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Ethical issues in machine translation

• try Google Translate between English and German:

- math teacher −→ Mathlehrer- cooking teacher −→ Kochlehrerin- physics teacher −→ Physiklehrer- French teacher −→ Franzosisch-Lehrerin

• because machine translators are primed on counting occurrencesin available text corpora, they carry along with them whatevercultural bias exists in the society that produces those texts at thetime in which they are produced

• is it possible to construct unbiased machine translators thatavoid this kind of problem? It certainly would be desirable...

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 34: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Information Retrieval

• developing algorithms for retrieving information from repositoriesof written texts

• inverted index: lists for each word all the documents thatcontain that word

• with additional position information of all the occurrences

• allows for search for phrases (combinations of words)

• stop list eliminates words not useful for search(too general such as “and”)

• stemming of words

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 35: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Ranking of relevant documents is the main problem

• Probability ranking principle: ranking documents by descrasingprobability of relevance is optimal

• view retrieval as a “greedy search” that aims at the mostvaluable document with highest P

• method assumes that documents are independent

Vector space model:

• it uses spatial proximity for semantic proximity

• very high-dimensional space with one dimension for each word inthe document collection

• measure of vector similarity

〈v ,w〉‖v‖ · ‖w‖

= cos(θ(v ,w))

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 36: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

• Term weighting: weight gives coordinates dw todocument-vectors d = (dw ) in the vector space model

- term frequency: tfi ,j = occurrences of word wi in document dj- document frequency: dfi = number of documents in thecollection in which wi occurs- collection frequency: cfi = total number of occurrences of wi inthe collection

• weight: if tfi ,j 6= 0

weight(i , j) = (1 + log(tfi ,j)) logN

dfi

and zero otherwise, N = number of documents

• note inverse document frequency: max weight to words occurringin only one document, zero weight to words occurring in all

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 37: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 38: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Term Distribution Models: alternative to weights, use model ofoccurrences of words (Zipf’s law, or a distribution like Poisson) anddecide how informative a word is

• Poisson distribution typical for events over units of a fixed size

p(k ;λ) = e−λλk

k!

mean and variance both equal to λ

• use with λ = cfi/N average number of occurrences per document

• observed deviation from a Poisson distribution can signal“clustering” of term occurrences in only certain types ofdocuments in the collection

MAT1509HS Win2019: Linguistics Natural Language Processing

Page 39: Natural Language Processingmatilde/LinguisticsToronto16.pdf · Natural Language Processing Matilde Marcolli MAT1509HS: ... Stemming: stripping o a xes and word formation and extract

Other probability distributions

• two-Poisson

p(k ;π, λ1, λ2) = πe−λ1λk1k!

+ (1− π)e−λ2λk2k!

separates documents containing w into a class with low number ofoccurrences and one with high

• K-mixture

p(k) = (1− α)δk,0 +α

β + 1(

β

β + 1)k

α, β parameters can be fit using observed mean λ = cf /N andobserved inverse document frequency N/df

MAT1509HS Win2019: Linguistics Natural Language Processing