61
Linguistica

Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Linguistica

Page 2: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Powerpoint?

This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks, John.

He also says I should enjoy my trip, and one way to do that is to not have to write as many slides while I’m here!

Page 3: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Linguistica

A C++ program that runs under Windows, Mac OS X, and Linux that is available at:

http://humanities.uchicago.edu/ faculty/goldsmith/

There are explanations, papers, and other downloadable tools available there.

Page 4: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

References (for the 1st part)

Goldsmith (2001) “Unsupervised Learning of the Morphology of a Natural Language” Computational Linguistics

Page 5: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Overview

Look at Linguistica in action:

English, French Theoretical foundations Underlying heuristics Further work

Page 6: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Linguistica

A program that takes in a text in an “unknown” language…

…and produces a morphological analysis:a list of stems, prefixes, suffixes;more deeply embedded morphological

structure;regular allomorphy

Page 7: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Linguistica

Page 8: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Actions and outlines of information

Here: lists of stems, affixes, signatures, etc.

Here: some messagesfrom the analyst to theuser.

Page 9: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Read a corpus

Brown corpus: 1,200,000 words of typical English

French Encarta or anything else you like, in a text file. Set the number of words you want read,

then select the file.

Page 10: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,
Page 11: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,
Page 12: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

A stem’s signature is the list of suffixes it appears with in the corpus,in alphabetical order.

abilit ies.y abilities, abilityaboli tion abolitionabsen ce.t absence, absentabsolute NULL.ly absolute, absolutely

List of stems

Page 13: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

List of signatures

Page 14: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Signature: NULL.ed.ing.sfor example,account accounted accounting accountsadd added adding adds

Page 15: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Signature <e>ion . NULL

composite concentrate corporate détente discriminate evacuate inflate oppositeparticipate probate prosecute tense

What is this?

composite and composition

composite composit composit + ion

It infers that ion deletes a stem-final ‘e’ before attaching.

We’ll see how we can find a more sophisticated signature…

Page 16: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Top signatures in English

Page 17: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Over-arching theory

The selection of a grammar, given the data, is an optimization problem.

Optimization means finding a maximum or minimum of some objective function

Minimum Description Length provides us with a means for understanding grammar selection as minimizing a function.

(We’ll get to MDL in a moment)

Page 18: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

What’s being minimized by writing a good morphology? The number of letters is part of it

Compare:

Page 19: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Naive Minimum Description Length

Corpus:

jump, jumps, jumping

laugh, laughed, laughing

sing, sang, singing

the, dog, dogs

total: 61 letters

Analysis:

Stems: jump laugh sing sang dog (20 letters)

Suffixes: s ing ed (6 letters)

Unanalyzed: the (3 letters)

total: 29 letters.

Notice that the description length goes UP if we analyze sing into s+ing

Page 20: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Minimum Description Length (MDL)

Rissanen (1989) (not a CL paper) The best “theory” of a set of data is the

one which is simultaneously:1. most compact or concise, and2. provides the best modeling of the data

“Most compact” can be measured in bits, using information theory

“Best modeling” can also be measured in bits…

Page 21: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Essence of MDL

0

100000

200000

300000

400000

500000

600000

700000

Best analysis Elegant theorythat works badly

Complex theorymodeled from

data

Length of morphologyLog prob of corpus

Page 22: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Description Length =

Conciseness: Length of the morphology. It’s almost as if you count up the number of symbols in the morphology (in the stems, the affixes, and the rules).

Length of the modeling of the data. We want a measure which gets bigger as the morphology is a worse description of the data.

Add these two lengths together = Description Length

Page 23: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Conciseness of the morphology

Sum all the letters, plus all the structure inherent in the description, using information theory.

Page 24: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Entropy was the weighted (by p(x)) sum of the information content or optimal compressed length (–log2 p(x)) of x. It’s called that because it is always possible to develop a compression scheme by which a symbol x, emitted with probability p(x), is represented by a placeholder of length -log2 p(x) bits.

Remember Entropy?

H(X) = − p(x)log2 p(x)x∈X

Page 25: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Optimal Compressed Length

The reason this is mentioned is that we will have lots of pieces of information in our model, and we’d like to figure out how much “space” it takes up.

Remember, we want the smallest model possible, so we are going to want the best compression for anything in our model

Also, remember this:

−log p(x) = log1

p(x)

Page 26: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Conciseness of stem list and suffix list

(ii) Suffix list λ* | f | + log[WA ]

[ f ]

⎝ ⎜

⎠ ⎟

f ∈Suffixes

(iii) Stem list : λ* | t | + log([W ]

[t])

⎝ ⎜

⎠ ⎟

t∈Stems

Number of letters in stem

cost of setting upthis entity: lengthof pointer in bits

Number of letters in suffix

= number of bits/letter < 5

Page 27: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Signature list length

log[W ]

[σ ]σ ∈Signatures

∑ list of pointers to signatures

+ log < stems(σ ) > + log < suffixes(σ ) >σ ∈Signatures

+ ( log[W ]

[t]t∈Stems(σ )

∑σ ∈Sigs

∑ + log[σ ]

[ f in σ ]f ∈Suffixes(σ )

∑ )

<X> indicates the numberof distinct elements in X

Page 28: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Length of the modeling of the data

Probabilistic morphology: the measure: -1 * log probability ( data )

where the morphology assigns a probability to any data set.

This is known in information theory as the optimal compressed length of the data (given the model).

Page 29: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Probability of a data set?

A grammar can be used not (just) to specify what is grammatical and what is not, but to assign a probability to each string (or structure).

If we have two grammars that assign different probabilities, then the one that assigns a higher probability to the observed data is the better one.

Page 30: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

This follows from the basic principle of rationality in the Universe:

Maximize the probability of the observed data.

Page 31: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

From all this, it follows:

There is an objective answer to the question: which of two analyses of a given set of data is better?

However, there is no general, practical guarantee of being able to find the best analysis of a given set of data.

Hence, we need to think of (this sort of) linguistics as being divided into two parts:

Page 32: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

An evaluator (which computes the Description Length); and

A set of heuristics, which create grammars from data, and which propose modifications of grammars, in the hopes of improving the grammar.

(Remember, these “things” are mathematical things: algorithms.)

Page 33: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Let’s step back for a minute

Why is this problem so hard at first? Because figuring out the best analysis of

any given word generally requires having figured out the rough outlines of the whole overall morphology. (Same is true for other parts of the grammar!).

How do we start?

Page 34: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

You all know the answer to this question already…

We start with Zellig Harris’ successor frequency!

Although we got some good answers, we also saw that it made lots of mistakes

So…

Page 35: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

As a boot-strapping method to construct a first approximation of the signatures: Harris’ method is pretty good. We accept only stems of 5 letters or more; Only cuts where the SuccFreq is > 1, and

where the neighboring SuccFreq is 1. (This setup was experiment 16 from the

lab on Monday)

Page 36: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Let’s look at how the work is done (in the abstract), step by step...

Page 37: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Pick a large corpus from a language --5,000 to 1,000,000 words.

Page 38: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Bootstrap heuristicFeed it into the “bootstrapping” heuristic...

Page 39: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Out of which comes a preliminary morphology,which need not be superb.Morphology

Bootstrap heuristic

Page 40: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

Feed it to the incrementalheuristics (…which wehaven’t seen yet)

Page 41: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Out comes a modifiedmorphology.

Page 42: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Is the modificationan improvement?Ask MDL!

Page 43: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Morphology

Bootstrap heuristic

modified morphology

If it is an improvement,replace the morphology...

Garbage

Page 44: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Corpus

Bootstrap heuristic

incremental heuristics

modified morphology

Send it back to theincremental heuristics again...

Page 45: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Morphology

incremental heuristics

modified morphology

Continue until there are no improvementsto try.

Page 46: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

The details of learning morphology

There is nothing sacred about the particular choice of heuristic steps

Page 47: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Steps Successor Frequency: strict Extend signatures to cases where a word

is composed of a known stem and a known suffix.

Loose fit: Look at all unanalyzed words. Look to see if they can cut: stem + suffix, where the suffix already exists. Do this in all possible ways. See if any of these lead to stems with signatures that already exist. If so, take the “best” one. If not, compute the utility of the signature using MDL.

Page 48: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Check existing signatures: Using MDL to find best stem/suffix cut. Examples…

Page 49: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Check signatures (English)

on/ve → ion/ive an/en → man/men l/tion → al/ation m/t → alism/alist, etc.

How?

Page 50: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Check signatures

Signature l/tion with stems:federa inaugura orienta substantiaWe need to compute the Description Length

of the analysis as it stands versusas it would be if we shifted varying parts of

the stems to the suffixes.

Page 51: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

“Check signatures” French:

NULL nt r >> a ant ar NULL nt >> i int ent t >> oient oit NULL r >> i ir f on ve >> sif sion sive eur ion >> seur sion ce t >> ruce rut se x >> ouse oux l ux >> al aux

me te >> ume ute eurs ion >> teurs tion f ve >> dif dive it nt >> ait ant que sme >> ïque ïsme NULL s ur >> e es eur ient nt >> aient ant f on >> sif sion nt r >> ent er

Page 52: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

100,000 tokens, 12,208 types

Zellig redux 1,403 stems

140 signatures

68 suffixes

Extend signatures

226 signatures

Loose fit 2,395 702 signatures

68 suffixes

Check signatures

2,409 730 110

Smooth stems

2,400 735 115

Page 53: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Allomorphy

Find relations among stems: find principles of allomorphy, like

“delete stem-final e before –ing” on the grounds that this simplifies the collection of Signatures:

Compare the signatures NULL.ing, and e.ing.

Page 54: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

NULL.ing and e.ing

NULL.ing: its stems do not end in –e -ing (almost) never appears after stem-

final e. (ex. singeing) So e.ing and NULL.ing can both be

subsumed under: <e>ing.NULL, where <e>ing means a

suffix ing which deletes a preceding e.

Page 55: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Find layers of affixation

Find roots (from among the Stem collection)

In other words, recursively look through our list of Stems and see if we could (or should) be analyzing them again:

readings = reading+s = read+ing+s Etc.

Page 56: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

What’s the future work?

1. Identifying suffixes through syntactic behavior ( syntax)

2. Better allomorphy ( phonology)

3. Languages with more morphemes/ word (“rich” morphology)

Page 57: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

“Using eigenvectors of the bigram graph to infer grammatical features and categories” (Belkin & Goldsmith 2002)

Page 58: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Method

Build a graph in which “similar” words are adjacent;

Compute the normalized laplacian (linear algebra -- it just sound fancy!) of that graph;

Compute the eigenvectors with the lowest non-zero eigenvalues; (more linear algebra)

Plot them.

Page 59: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Map 1,000 English words by left-hand neighbors

non-finite verbs: be, do, go, make,see, get, take, go, say, put, find, give, provide, keep, run…

finite verbs: was, had,has, would, said,could, did, might,went, thought, told, knew, took,asked…

world, way, same, united,right, system, city, case,church, problem, company,past, field, cost, department,university, rate, door,

?: and, to, in that, for, he, as, with,on, by, at, or, from…

Page 60: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Map 1,000 English words by right-hand neighbors

adjectives

social national white local politicalpersonal private strong medical finalblack French technical nuclear british

Prepositions: of in for on by at from into after through under since during against among within along across including near

Page 61: Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

End