Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14...

Preview:

Citation preview

Optimality Theoretic Learning of

Lexical BorrowingYulia Tsvetkov Waleed Ammar Chris Dyer

src

Book the flight …

VB DT NN …

tgt

project annotations

Resource-poor NLP

annotation projection1. via word alignments2. via cross-lingual similarities

Outline

1. Motivation: lexical borrowing as a source of cross-lingual lexical similarities

2. A constraint-based model of lexical borrowing for Arabic-Swahili

3. A model of lexical borrowing improves Swahili-English MT

*unpublished work, in preparation for NAACL’15

Words that are orthographically or phonetically similar across different languages and are likely to bemutual translations

Cross-lingual lexical similarities

Whence cross-lingual lexical similarities? ● Chance (unrelated, false friends)

○ insignificant amount of words

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)

Core

Core-periphery lexicon structureItô & Mester ‘95

Periphery

English New YorkYoruba Niu YokiSwahili New YorkRussian Нью-ЙоркArabic نیویورك

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)

○ proper names○ specialized, peripheral vocabulary

Core

Periphery

English New YorkYoruba Niu YokiSwahili New YorkRussian Нью-ЙоркArabic نیویورك

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)● Genetically related words (cognates)

○ words in related languages inherited from one word in a common ancestral language

○ content words in core language lexicon

Core

Periphery

Latin nocteFrench nuitSpanish nocheItalian notte

Portuguese noiteRomanian noapte

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)● Genetically related words (cognates)● Borrowed words

○ frequent content words○ of foreign origin, but aren’t perceived as foreign

Core

Periphery

Arabic سكرArabic

*transliteratedsukkar

Latin zuccarumFrench sucreGerman ZuckerItalian zucchero

English sugar

This work: Lexical borrowing

● Foreign words (transliterations)● Genetically related words (cognates)● Borrowed words (loanwords)

Arabic سكرArabic

*transliteratedsukkar

Latin zuccarumFrench sucreGerman ZuckerItalian zucchero

English sugar

Adoption and nativization of words from another language (as a result of language contact)

Borrowing is a fundamental research topic in linguistics

Yip ‘93 (Cantonese)

Davidson & Noyer ‘97 (Huave)

Jacobs & Gussenhoven ‘00

Kang ‘03 (Korean)

Kenstowicz & Suchato ‘06 (Thai)

Adler ‘06 (Hawaiian)

Rose & Demuth ‘06

Kenstowicz ‘07 (Fijian)

Schadeberg ‘09 (Swahili)

Mwita ‘09 (Swahili)

Hurskainen ‘04 (Swahili)

Adelaar ‘10 (Malagasy)

Kenstowicz ‘06 (Yoruba)

and many more...

TransliterationKnight & Graehl ‘98

Al-Onaizan & Knight ‘02

Virga & Khudanpur ‘03

Klementiev & Roth ‘06

Tao et al. ‘06

Ravi & Knight ‘09

Ammar,Dyer & Smith ‘12

Borrowing

Prior work (in NLP)

CognatesMann & Yarowsky ‘01

Kondrak ‘01

Kondrak,Marcu & Knight ‘03

Bouchard-Côté et al. ‘09

Hall & Klein ‘10

Lexical borrowing graph

پلپل pilpil

Persian

פלפלfalafel’

Hebrew

فالفلfalāfil

Arabic

pilipili

Swahili

parpaare

Gawwada

प पलpippalī

Sanskrit

Haspelmath & Tadmor ‘09

Borrowing is pervasive!

Resource-poor languages # speakers Borrowed from resource-rich (% types)

Swahili, Zulu, Malagasy, Hausa, Tarifit, Yoruba

200 million Arabic, Spanish, English, French (>40%)

Japanese, Vietnamese, Korean, Cantonese, Thai

400 million Chinese, English (30-70%)

Hindustani, Hindi, Urdu, Bengali, Persian, Pashto

860 million Arabic, English (>40%)

1.4 billion

Case study: Arabic-Swahili borrowing

پلپل pilpil

Persian

פלפלfalafel’

Hebrew

فالفلfalāfil

Arabic

pilipili

Swahili

parpaare

Gawwada

प पलpippalī

Sanskrit

Arabic-Swahili borrowing: history● 800 A.D.-1920 Indian Ocean trading● Influence of Islam

● ~40% of Swahili types are borrowed from Arabic

*from Standard Swahili-English dictionary (Johnson ‘39)

Arabic-Swahili borrowing: examples

English ArabicSemitic

SwahiliBantu

Phonological & morphological integration

fever حمىḥummat

homa* syllable structure adaptation: CV, CVV, CVC, CVCC → V, CV* degemination - Swahili does not allow consonant clusters* vowel substitution

minister الوزیرAlwzyr

kiuwaziri

* Arabic morphology (optionally) drops* Swahili morphology is applied* vowel epenthesis to keep syllables open* vowel substitution

palace القصرAlqSr

kasiri * consonant adaptation: /tˤ/→/t/, /dˤ/→/d/, /θ/→/s/, /x/→/k/, etc* vowel epenthesis

Arabic-Swahili borrowing: our research goals

1. Given a Swahili vocabulary and an Arabic vocabulary, identify plausible donor-loanword candidates

2. Produce a ranked list of candidate donor-loanword pairs

3. Augment Swahili-English MT using Arabic-Swahili borrowing model

Arabic-Swahili borrowing model

Arabic to IPA SwahiliRank

loanword candidates

from IPAGenerate loanword candidates

1. Convert letters to phones2. Generate loanword candidates3. Rank loanword candidates

rule-based

learned

Arabic-Swahili borrowing model: from orthographic to phonetic space

Arabic to IPA SwahiliRank loanword candidates

from IPAGenerate loanword candidates

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

1. Convert letters to phones

Arabic-Swahili borrowing model: generating candidate loanwords

Arabic to IPA SwahiliRank loanword candidates

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

2. Adapt Arabic words to Swahili syllable structure, morphology and phonology

Polomé ‘67; Zawawi ‘79; Schadeberg ‘09; Mwita ‘09

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu.(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

SyllabificationSwahili Morphologicaladaptation

Arabic-to-SwahiliPhonological adaptation

Arabic affixremoval

kuttabakuttabkitabakitab...

ku.tta.ba.ku.t.ta.ba....ki.ta.ba.ki.ta.b.

ku.ta.ba. [degemination]

ku.tata.ba.[epenthesis]

ku.ta.bu. [final vowel subst.]

ki.ta.bu. [final vowel subst.]

ki.ta.bu. [epenthesis]

2. Adapt Arabic words to Swahili syllable structure, morphology and phonology

Arabic-Swahili borrowing model: generating candidate loanwords

(Littell, Price & Levin ‘14)

Arabic-Swahili borrowing model: learning candidate ranking

Arabic to IPA SwahiliRanking with Optimality Theory constraints

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

3. Produce a ranked list of candidate loanwords

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu....

Optimality Theorylanguage-universal

constraints

underlying (donor) form

pronounced forms(loanword candidates)

optimal (loanword) form

*competing, violable

constraints ranked differently

in donor and recipient

languages

Prince & Smolensky ‘08; McCarthy ‘09

Optimality Theory constraintsFaithfulness Constraints

MAX - IO - MORPH MAX - IO - CMAX - IO - V

no (donor) affix deletionno consonant deletionno vowel deletion

DEP - IO - MORPHDEP - IO - V

no (recipient) affix epenthesisno vowel epenthesis

IDENT - IO - P IDENT - IO - G IDENT - IO - EIDENT - IO - C IDENT - IO - F IDENT - IO - V

no pharyngeal consonant substitutionno glottal consonant substitutionno emphatic consonant substitutionno consonant substitutionno final vowel substitutionno vowel substitution

Faithfulness constraints impose input-output correspondence

Markedness Constraints

Optimality Theory constraints

NO-CODA ONSETPEAKSSP* COMPLEX - S* COMPLEX - C* COMPLEX - V

syllables must not have a codasyllables must have onsetsthere is only one syllabic peakcomplex onsets rise in sonorityno consonant clusters on syllable marginsno consonant clusters within a syllableno vowel clusters

Markedness constraints impose output well-formedness

Arabic to IPA SwahiliRanking with Optimality Theory constraints

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

3. Produce a ranked list of candidate loanwords

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu.

Arabic-Swahili borrowing model: learning candidate ranking

Arabic to IPA SwahiliRanking with Optimality Theory constraints

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

3. Produce a ranked list of candidate loanwords

ku.tata.ba.li.ku.tata.ba.ku.tta.ba. ki.ta.bu.ki.ta.bu.

ku.ta<DEP-V>ta<PEAK>.ba.li<DEP-MORPH>.ku.ta<DEP-V>ta<PEAK>.ba.li.ku.tta<*COMPLEX>.ba.ki.ta.bu<IDENT-IO-V>.ki.ta.bu<DEP-V>.

Arabic-Swahili borrowing model: learning candidate ranking

EVAL

Re-rank loanword candidates to promote input-output correspondence and output well-formedness

Arabicwords

Donor words to IPA

Swahiliwords

Ranking with Optimality Theory constraints

IPA to Recipient words

GEN

Generate plausible Swahili phonetic forms

SyllabificationMorphological adaptationPhonological adaptation

Arabic-Swahili borrowing model

Unweighted insertion/deletion/substitution transducers

Weighted identity transducers

1. Extract a small training set from Arabic-English and English-Swahili parallel corpora based on phonetic and semantic similarity (cf. Kondrak ‘01, cognate identification)

2. Expand the extracted training set using Arabic morph. analyzer

3. Learn OT constraint weights using Machine Learning

Arabic-Swahili borrowing model:learning constraint weights

TrainingTest

417 examples73 examples (15%), manually verified by a native Arabic speaker and using a Swahili-English dictionary

Arabic-Swahili borrowing model:evaluation

1. Model design

2. Model accuracy

3. Qualitative evaluationOT constraint ranking is consistent with linguistic accounts

Dev Test

ReachabilityAmbiguity

75885

88857

(%)(avg. candidates per input word, baseline:787,000)

Accuracy (%)

Levenshtein CRF (transliteration Ammar et al. ‘12)

8.916.4

Levenshtein Levenshtein-H (cognate Mann & Yarowsky ‘01)

19.819.7

OT uniform constraint weightsOT learned constraint weights

29.352.0

orth

ogra

phic

phon

etic

OT

Arabic-Swahili borrowing: research goals

1. Given a Swahili vocabulary and an Arabic vocabulary, identify plausible donor-loanword candidates

2. Produce a ranked list of candidate donor-loanword pairs

3. Augment Swahili-English MT using Arabic-Swahili borrowing model

AR

Arabic-English MTResource-rich 5.5M sentences

SW

safarikituruki

ysAfr travel یسافرtrky turkish تركي

Swahili-English MTLow-resource 14K sentences 5K OOV types (7.5%)

EN

??? (OOV)

BORROWINGMODEL

TRANSLATIONCANDIDATES

EN

MT experiments

BLEU

Baseline 18.0

+ OOV loanwords 18.5

1. First study on lexical borrowing in NLP

2. First study that operationalizes Optimality Theory in a downstream task

3. Swahili-English MT improvement

Summary of contributions

1. More languages

2. More MT experiments

3. Core NLP tasks: cross-lingual part-of-speech tagging

Future work

Swahili shukuruArabic shukran - شكرا

English thank you

*a study on 1,460 core words Schadeberg ‘09

Loanwords (% within sem. field)

Semantic field Total Arabic English Other

MODERN WORLD 73.6 15.1 43.7 14.8

RELIGION 55.7 47.5 - 9.2

LAW 54.6 41.1 9.4 4.1

POSSESSION 48.1 41.4 1.9 4.9

SOCIO - POLITICAL 47.5 37.9 - 9.6

EMOTIONS 46.8 39 1.6 6.2

COGNITION 46 40.6 1.5 3.9

CLOTHING 43.4 11.1 18.8 13.5

THE HOUSE 37.5 19.3 6.6 11.7

nouns 19%

adjectives 19%

verbs 15%

adverbs 14%

func. words 15%

Arabic-Swahili borrowing statistics

http://blog.oxforddictionaries.com/2014/08/which-everyday-english-words-came-from-arabic/

(book.sg.indef)

SyllabificationDonorwords

Donor words to IPA

Loanwords

Ranking with Optimality Theory constraints

Recipient Morphologicaladaptation

IPA to Recipient words

Donor-to-Recipient Phonological adaptation

Donor affixremoval

GEN EVAL

كتاباkuttaba

kitaba...

kuttabakuttabkitabakitab...

ku.tta.ba.ku.t.ta.ba....ki.ta.ba.ki.ta.b....

ku.ta.ba. [degemination]ku.tata.ba. [epenthesis]ku.ta.bu. [final vowel subst.]ki.ta.bu. [final vowel subst.]ki.ta.bu. [epenthesis]...

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu. ...

kitabuku.ta<DEP-V>ta<PEAK>.ba.li<DEP-MORPH>.ku.ta<DEP-V>ta<PEAK>.ba.li.ku.tta<*COMPLEX>.ba.ki.ta.bu<IDENT-IO-V>.ki.ta.bu<DEP-V>.vi<DEP-MORPH>.ki.ta.bu<IDENT-IO-V>.

kitabu

ARABIC SWAHILI

Arabic-Swahili borrowing model

● Syllable structure CV, CVV, CVC, CVCC → V, CV

● MorphologyArabic affixes deletion (optional) Swahili affixes concatenation

● PhonologyVowel deletion – shortening of Arabic long vowels and vowel clusters Consonant degemination – shortening of Arabic geminate consonantsSubstitution of similar phones – /tˤ/→/t/, /dˤ/→/d/, /θ/→/s/, /x/→/k/, etc.Vowel epenthesis – eliminating Arabic codas and consonant clustersFinal vowel substitution – /u/, /o/, /i/, /e/

Arabic-Swahili morphophonological adaptation