73
Multilingual Language Processing 1 Hal Daumé III ([email protected]) Multilingual Language Processing Hal Daumé III Computer Science University of Maryland [email protected] Blair Linguistics Club 19 Nov 2014 Piyush Rai (Duke) Lyle Campbell (U Hawaii) Sujith Ravi (Google) Adam Teichert (JHU)

Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Embed Size (px)

Citation preview

Page 1: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing1 Hal Daumé III ([email protected])

MultilingualLanguage Processing

Hal Daumé IIIComputer ScienceUniversity of Maryland

[email protected]

Blair Linguistics Club

19 Nov 2014

Piyush Rai(Duke)

Lyle Campbell(U Hawaii)

Sujith Ravi(Google)

Adam Teichert(JHU)

Page 2: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP2 Hal Daumé III ([email protected])

Why study O(100) languages➢ What makes a language a

human language?

➢ What properties of “Language” can be learned from/exploited on from text

➢ Computational challenge of dealing with large, uncertain data sets

➢ You never know what language will be important tomorrow

➢ Pairwise models of language don't scale

➢ Hard to find linguists or translators in minority languages

Page 3: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP3 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

Page 4: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP4 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwichThe man eat+ a sandwich

past

Page 5: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP5 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

The man eat+ a sandwich past

Page 6: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP6 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

S

The man eat+ a sandwich past

Page 7: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP7 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

SAgent Theme

The man eat+ a sandwich past

Page 8: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP8 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

SAgent Theme

∃ a ∃ t ∃ e man(a) & sandwich(t) & eat(e,a,t) & past(e)

The man eat+ a sandwich past

Page 9: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP9 Hal Daumé III ([email protected])

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

SAgent Theme

∃ a ∃ t ∃ e man(a) & sandwich(t) & eat(e,a,t) & past(e)

The man eat+ a sandwich past

MorphologyTaggingParsingRole labelingInterpretation

Page 10: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP10 Hal Daumé III ([email protected])

A unified approach

Raw Text

Linguistic Features

AnnotatedTreebanks

VO ⊃ PrePPostP ⊃ OV

Typological Features

Parallel Data

Page 11: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP11 Hal Daumé III ([email protected])

A unified approach

Raw Text

Linguistic Features

AnnotatedTreebanks

VO ⊃ PrePPostP ⊃ OV

Typological Features

Parallel Data

AfrikaansAlbanianAmuzgoArabicArabic (Syrian)ArmenianArmenianAzerbaijaniBasqueBulgarianBurmeseByzantineCakchiquelChamorroCherokeeChinantec

CzechDanishDutchEnglishEsperantoEstonianFinnishFrenchGaelicGermanGreekGujaratiHaitian CreoleHebrewHiligaynonHindiHungarianIcelandic

IndonesianIrishItalianJacaltecoKannadaK'ekchíKlingonKoreanLatinLatvianLithuanianLow GermanMacedonianMalagasyMalayalamMamMam of TodosMandan

MandarinMaoriNahuatlNdebeleNorwegianOryaPersianPolishPortuguesePotawatomiQuichéRomanianRomaniRussianSerbianShonaSlovakSomali

SpanishSwahiliSwedishTagalogTamilThaiTurkishUkrainianUmaUrduUspantecoVietnamese

Page 12: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP12 Hal Daumé III ([email protected])

A unified approach

Raw Text

Linguistic Features

AnnotatedTreebanks

VO ⊃ PrePPostP ⊃ OV

Typological Features

Parallel Data

Page 13: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP13 Hal Daumé III ([email protected])

How does (eg) syntax work?➢ Get some linguists to annotate text with

syntactic structures

➢ Estimate a probabilistic context freegrammar from those structures

➢ Use that pCFG to parse new “test”sentences

➢ Works for any language for whichwe have annotated text!

AnnotatedTreebanks

Page 14: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP14 Hal Daumé III ([email protected])

How does (eg) syntax work?➢ Get some linguists to annotate text with

syntactic structures

➢ Estimate a probabilistic context freegrammar from those structures

➢ Use that pCFG to parse new “test”sentences

➢ Works for any language for whichwe have annotated text!

AnnotatedTreebanks

Page 15: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing15 Hal Daumé III ([email protected])

Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N

NP NPVP

S

English PCFG

Page 16: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing16 Hal Daumé III ([email protected])

Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N

NP NPVP

SLe homme a mange un sandwich savoureaux D N A V D N J

NP NPVP

SEl hombre se comio un bocadillo sabrosa D N A V D N J

NP NPVP

S

English PCFG

French PCFG

Spanish PCFG

ϴ

[Berg-Kirkpatrick & Klein; ACL10][Iwata, Mochihashi & Sawada; ACL10]

Page 17: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing17 Hal Daumé III ([email protected])

Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N

NP NPVP

SLe homme a mange un sandwich savoureaux D N A V D N J

NP NPVP

SEl hombre se comio un bocadillo sabrosa D N A V D N J

NP NPVP

S

English PCFG

French PCFG

Spanish PCFG

ϴ

[Berg-Kirkpatrick & Klein; ACL10][Iwata, Mochihashi & Sawada; ACL10]

+ 21% on averageover 8 languages

English, DutchDanish, Swedish

Spanish, PortugueseSloveneChinese

See also:Snyder, Barzilay et al....

Page 18: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing18 Hal Daumé III ([email protected])

Typology can helplanguage processing

Language processing can help typology

Statistics is the mediator

Page 19: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing19 Hal Daumé III ([email protected])

Implicational Universals

English:I eat dinner in restaurants.

French:je mange le diner dans les restaurantsI eat the dinner in the restaurants

Japanese:boku-wa bangohan-o resutoran -de taberuI -topic dinner -obj restaurants -in eat

Hindi:main raat ka khaana restra mein khaata hoonI night-of-meal restaurants in eat am

Verb-Object (VO)

Object-Verb (OV)

Prepositional (PreP)

Postpositional (PostP)

Page 20: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing20 Hal Daumé III ([email protected])

Implicational Universals

English:I eat dinner in restaurants.

French:je mange le diner dans les restaurantsI eat the dinner in the restaurants

Japanese:boku-wa bangohan-o resutoran -de taberuI -topic dinner -obj restaurants -in eat

Hindi:main raat ka khaana restra mein khaata hoonI night-of-meal restaurants in eat am

Verb-Object (VO)

Object-Verb (OV)

Prepositional (PreP)

Postpositional (PostP)

VO ⊃ PrePPostP ⊃ OV

Page 21: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing21 Hal Daumé III ([email protected])

The Typologist's Life

PreP PostPVOOV

Now, repeat for lots of feature pairs

(Greenberg, 1963) – Based on 30 diversely

sampled languages

16 0 3 11

Page 22: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing22 Hal Daumé III ([email protected])

Difficulties with Typical Approach

A ⊃ B (99%) is uninterestingwhen ∅ ⊃ B (99%)

Search process is tedious

Sampling problem whenmany languages considered

Process is inherently noisy

Page 23: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing23 Hal Daumé III ([email protected])

A Typological Database➢ 2150 Languages

➢ 35 language families➢ 275 language geni

➢ 139 Features➢ 11 feature categories

➢ Sparsely sampled➢ 85% missing data

Page 24: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing24 Hal Daumé III ([email protected])

Typological Map: VO

Page 25: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing25 Hal Daumé III ([email protected])

Typological Map: PreP

Page 26: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing26 Hal Daumé III ([email protected])

➢ Consider two features --> 2xN matrix

➢ First, generate first column withprior probability π1

➢ Next, decide if the implication holds

➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”

or if the implication doesn't hold➢ Forced to be “+” otherwise

An Initial Model VO PreP

++-++?+??+-+?+-

+?+-+++--?-+-++

Page 27: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing27 Hal Daumé III ([email protected])

➢ Consider two features --> 2xN matrix

➢ First, generate first column withprior probability π1

➢ Next, decide if the implication holds

➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”

or if the implication doesn't hold➢ Forced to be “+” otherwise

An Initial Model VO PreP

++-++?+??+-+?+-

+?+-+++--?-+-++

Problems: Cannot handle noisy data Doesn't address sampling problem

Page 28: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing28 Hal Daumé III ([email protected])

➢ Consider two features --> 2xN matrix

➢ First, generate first column withprior probability π1

➢ Next, decide if the implication holds

➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”

or if the implication doesn't hold➢ Forced to be “+” otherwise

An Initial Model VO PreP

++-++?+??+-+?+-

+?+-+++--?-+-++

Problems: Cannot handle noisy data Doesn't address sampling problem

m

π2π1 f2f1

Page 29: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing29 Hal Daumé III ([email protected])

Fixing the Noise Problem➢ Assume language-specific noise

➢ Model remains unchanged, excepta new variable causes “f” to be flipped

m π2π1

f2f1

e1 ε e2

Page 30: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing30 Hal Daumé III ([email protected])

Fixing the Sampling Problem➢ Hierarchical Bayes prior...

m π2π1

f2f1

e1 ε e2

Page 31: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing31 Hal Daumé III ([email protected])

Fixing the Sampling Problem➢ Hierarchical Bayes prior...

f2f1

e1 ε e2

f2f1

e1 ε e2

f2f1

e1 ε e2

. . .

Page 32: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing32 Hal Daumé III ([email protected])

Fixing the Sampling Problem➢ Hierarchical Bayes prior...

f2f1

e1 ε e2

f2f1

e1 ε e2

f2f1

e1 ε e2

. . .

m0

mIE

mGer mRom

mAus

mOce

Page 33: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing33 Hal Daumé III ([email protected])

Inference➢ Binomials get Beta priors

➢ m ~ Uniform➢ ~ Beta with 5% mean, 0-10% with 50% probability

➢ Everything else gets uniform priors

➢ Inference by Gibbs sampling➢ Plus a rejection sampler subroutine

Page 34: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing34 Hal Daumé III ([email protected])

Three Models

Flat – All languages independent

LingHier – Typological Hierarchy

DistHier – Obtained by clustering positionally

Page 35: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing35 Hal Daumé III ([email protected])

Automatically Extracting Implications➢ Search only over pairs with:

➢ 250 langs for which both features are known➢ 15 languages for which both hold simultaneously➢ When f1 is true, f2 is true with >50% probability

➢ Reduces space from 19,000 to 3442

➢ Sort by probability that m is true

➢ Evaluate:➢ Compare restorative accuracy versus each other➢ Compare against well-known implications

Page 36: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing36 Hal Daumé III ([email protected])

Restoration Accuracy by Model

Page 37: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing37 Hal Daumé III ([email protected])

Top Implications – LingHierPostpositions Gen-N Greenberg #2a OV Greenberg #4 OV Gen-N Greenberg #4 + Greenberg \#2a Gen-Noun Greenberg #2a (converse)

OV Greenberg #2b (converse) SV Gen-N ??? Adj-N Greenberg #18 Suffixing Clear explanation VO

Appeal to economy Dem-NVO Greenberg #3 (converse)

Adj-N Dem-N Greenberg #18 Noun-AdjSV ??? VO Greenberg #3

Prefixing Greenberg #27b N-Adj ???

Labial-velars No uvulars See paperNegative word See paperStrong prefixing VO

Suffixing ??? Final Sub. Word

Many vowels See paperPlural prefix N-Gen ??? No fricatives No tones ???

See paperDem-N

PostP

PostPPostPositions

Num-NTense Suf.Noun-RelC Lehmann

Intr. verb No question prt.Num-N Hawkins XVI (for postpositional languages) PreP

PostP Lehmann PostPPreP

Init. Subord. PreP Operator-operand principle (Lehmann) PreP

Little affixation

No pron poss afxLehmann

Subord. SuffixPostP Operator-operand principle (Lehmann)

High+Mid F.V.s

Oblig. subj. pron No pron poss afxTense Suf. Operator-operand principle (Lehmann)

Page 38: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing38 Hal Daumé III ([email protected])

Notes➢ If you think this stuff is interesting, you should read the

Dunn et al Nature paper

➢ Main claim:➢ All of this typology stuff is bogus➢ Once you account for “genetic” influences

➢ Directly contradicts what I've just told you

➢ Who is right?

Page 39: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP39 Hal Daumé III ([email protected])

Automatic Induction of Syntax➢ INPUT: A pile of text➢ OUTPUT: Syntactic structures of this text

➢ Current approaches are mostly based on dependency formalisms

The man ate a big sandwich D N V D J N

MODSUBJ

OBJMOD

MOD

Page 40: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP40 Hal Daumé III ([email protected])

Probabilistic Models of Syntax

D N V D J N

p(V|0,r)

p(N|V,l)

p(D|N,l)

p(N|V,r)

p(D|N,l)

p(J|N,l)

p(Data) = p(V|0,r) p(N|V,l) p(D|N,l) p(N|V,r) p(J|N,l) p(D|N,l)

Page 41: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP41 Hal Daumé III ([email protected])

Inferring Tags from the Structure➢ INPUT:

➢ OUTPUT:

➢ Baseline:➢ Random guessing: 4% accuracy

The man ate a big sandwich

D N V D J N

Page 42: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP42 Hal Daumé III ([email protected])

Sources of Knowledge➢ Seeds (frequent words for each tag)

➢ N: membro, milhoes, obras➢ D: as [the,2f] o [the,1m] os [the,2m]➢ V: afector, gasta, juntar➢ P: com, como, de, em

➢ Typological rules:➢ Art Noun←➢ Prp Noun→

➢ Tag knowledge:➢ Open class➢ Closed class

Page 43: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP43 Hal Daumé III ([email protected])

Preliminary Results

No Seeds Seeds0

10

20

30

40

50

60

No O/COpen/Closed

Page 44: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP44 Hal Daumé III ([email protected])

Preliminary Results: Open/Closed

No RulesArt<-N

Prp->NBoth

20

25

30

35

40

45

50

55

60

No RulesArt<-N

Prp->NBoth

20

25

30

35

40

45

50

55

60NO SEEDS SEEDS

Page 45: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing45 Hal Daumé III ([email protected])

Where does the tree come from?

Page 46: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing46 Hal Daumé III ([email protected])

A standard model for the genealogy of a populationEach organism has exactly one parent (haploid)Thus, the genealogy is a tree

Kingman's Coalescent

Page 47: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing47 Hal Daumé III ([email protected])

An infinite tree...

Page 48: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing48 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 49: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing49 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 50: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing50 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 51: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing51 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 52: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing52 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 53: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing53 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 54: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing54 Hal Daumé III ([email protected])

Graphical model on a coalescent

Page 55: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing55 Hal Daumé III ([email protected])

Understanding language relationships

Page 56: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing56 Hal Daumé III ([email protected])

Modeling errors

Page 57: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing57 Hal Daumé III ([email protected])

The Balkans

Page 58: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing58 Hal Daumé III ([email protected])

Linguistic areas

Page 59: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing59 Hal Daumé III ([email protected])

Classic linguistic areas

Page 60: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing60 Hal Daumé III ([email protected])

Model desiridata

Page 61: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing61 Hal Daumé III ([email protected])

Pitman-Yor Processes

Page 62: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing62 Hal Daumé III ([email protected])

Generative Story

Page 63: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing63 Hal Daumé III ([email protected])

Generative Story

Page 64: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing64 Hal Daumé III ([email protected])

Generative Story

Page 65: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing65 Hal Daumé III ([email protected])

Generative Story

Page 66: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing66 Hal Daumé III ([email protected])

Generative Story

Page 67: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing67 Hal Daumé III ([email protected])

Generative Story

Page 68: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing68 Hal Daumé III ([email protected])

Generative Story

Page 69: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing69 Hal Daumé III ([email protected])

Generative Story

Page 70: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing70 Hal Daumé III ([email protected])

Discovered results

Page 71: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing71 Hal Daumé III ([email protected])

Shared features

Page 72: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Multilingual Language Processing72 Hal Daumé III ([email protected])

Reconstruction accuracies

Page 73: Multilingual Language Processing - UMIACShal/tmp/blair.pdf · Bulgarian Burmese Byzantine Cakchiquel Chamorro ... Persian Polish Portuguese Potawatomi ... Multilingual Language Processing

Statistics, Typology and NLP73 Hal Daumé III ([email protected])

Conclusions + Future Steps➢ Can infer IUs from data (WALS)

➢ Old ones and new ones➢ Can handle the sampling problem

➢ Can use typology to help tagging➢ Open/closed➢ Simple features

➢ Infer tree structure, too➢ Don't assume features: just IUs!➢ Infer multiple languages simultaneously➢ Feedback from text to IUs