30
NLP Guest Lecture: Morphological Word Embeddings PAMELA SHAPIRO

NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

NLP Guest Lecture: Morphological Word Embeddings PAMELA SHAPIRO

Page 2: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Outline  Review:Morphology,WordEmbeddings Research:MorphologicalWordEmbeddings Application:NeuralMachineTranslation

2

Page 3: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Review: Morphology  Morphology:studiessub-wordlevelphenomena◦  e.g.prefixes,suffixes,compounding

 Morpheme:sub-wordcomponentthatbearsmeaning◦  e.g.play,-ing,-s,-er,re-

 Inflectionalmorphology:doesn’tchangecorecontentofword(includingPOS),tenses/cases◦  Basewhichinflectionalmorphologyattachestocalledlemma

 Derivationalmorphology:canchangemeaning,partofspeech

 Compounding:combiningcompletewords◦  e.g.toothbrush

 Interactswithphonology,orthography,andsyntax

3

Page 4: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Morphology Across Languages  Morphologicalpatternsvaryacrosslanguages

 Analytic:fewmorphemesperword(e.g.Chinese,Vietnamese)◦  Englishismoderatelyanalytic

 Fusional:tendto“fuse”multiplefeaturesintooneaffix◦  IncludesmostIndo-EuropeanandSemiticlanguages◦  e.g.habló,-ódenotesbothpasttenseand3rdpersonsingular

 Agglutinative:tendtohavemoremorphemesperword,moreclearlydemarcatedandregular◦  IncludesFinnish,Hungarian,Turkish◦  e.g.anlamadım'Ididnotunderstand’–verbrootanla-,negativemarker–m(a),definitepastmarker–d(i),and1stpersonindicator,-(ı)m.

4

Page 5: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Non-concatenative Morphology  Non-concatenativemorphology:additionalpropertyofSemiticlanguages,morphemesarenotjustconcatenatedtogether,include“templates”whichrootsareinsertedinto

5

Page 6: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Review: Word Embeddings  Skip-gramobjective(word2vec)

 Learninputvectorsandoutputvectorssimultaneouslys.t.inputvectorpredictsoutputvectorsofsurroundingwords

Mikolovetal.(2013)

w(t-2)

w(t-1)

w(t+1)

w(t+2)

OUTPUTPROJECTION

w(t)

INPUT

6

Page 7: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Problem: Sparsity due to Morphology  Rarewordsdon’tgettheirvectorsupdatedenoughtohavemeaningfulrepresentations Inmorphologicallyrichlanguages,maybeonlymostcommonvariantgetsupdatedfrequently Solution:Bringmorphologicalvariantsclosertogether◦  “morphologicalwordembeddings”isaveryvaguetermthatcouldmeanmanythings,butgenerallyaddressesthisissue

 WetryacoupleapproachestothisprobleminArabic,trainingwordvectorsonArabicWikipedia

7

Page 8: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Approach #1: Bag of Character N-Grams  Fromexistingliterature(fastText)

 Noaccesstomorphologicalinformation,butmaybeabletolearnimplicitly

Bojanowskietal.(2017)

OUTPUTPROJECTIONINPUT

w(t-2)

w(t-1)

w(t+1)

w(t+2)

gi(w(t))

gi-1(w(t))

gi+1(w(t))

+

8

Page 9: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Approch #2: Lemma-Informed  Lemma-informed(morph)

 Acquirelemmasusingspecializedmorphologicalanalyzer(MADAMIRA),trainwordandlemmaembeddings 

w(t-2)

w(t-1)

w(t+1)

w(t+2)

OUTPUTPROJECTION

w(t)

l(t)

INPUT

9

Page 10: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Word Embeddings: Evaluation  Intrinsic: Assesssimilarityofwordseitherdirectlyorwithanalogies,comparingtohumanjudgment

 Extrinsic: Assessusefulnessindownstreamapplication(e.g.NeuralMachineTranslation)

10

Page 11: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Word Similarity  WordSimilarity-353testset:wordpairswithhumanjudgmentsof“relatedness”onscaleof1-10◦ e.g.tigercat7.35,noonstring0.54(averagedoverseveralhumanjudgments)

 Toevaluatesuccess:◦ Usemodel’swordvectorstodeterminecosinesimilarity(between0and1)◦ Assesscorrelationwithhumanjudgments,typicallyuseSpearman'srankcorrelationcoefficient

 WeuseaversionofthistestsetthathasbeentranslatedintoArabic(HassanandMihalcea,2009)

11

Page 12: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Results: Word Similarity

12

w(t-2)

w(t-1)

w(t+1)

w(t+2)

OUTPUTPROJECTION

w(t)

l(t)

INPUT

Page 13: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Results: Word Similarity

13

Page 14: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Extrinsic Evaluation: NMT  UsewordembeddingstoinitializesourcewordvectorsofNMTsystem

 MTBitext:AràEncorpusofTEDtalks,consideredlow-resourcesetting◦ ~175ktrainsentences◦ 2kdev◦ 2ktest

14

Page 15: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Background: Neural Machine Translation

15

vielendank!wordembeddings

BidirectionalLSTM/GRU

Page 16: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Background: Neural Machine Translation

16

wordembeddings

BidirectionalLSTM/GRU

Attention

UnidirectionalLSTM/GRU

thank

vielendank!

Page 17: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Background: Neural Machine Translation

17

wordembeddings

BidirectionalLSTM/GRU

thank

Attention

UnidirectionalLSTM/GRU

thankyou

vielendank!

Page 18: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Background: Neural Machine Translation

18

wordembeddings

BidirectionalLSTM/GRU

thankyou

Attention

UnidirectionalLSTM/GRU

thankyouvery

vielendank!

Page 19: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Background: Neural Machine Translation

19

wordembeddings

BidirectionalLSTM/GRU

thankyouvery

Attention

UnidirectionalLSTM/GRU

thankyouverymuch

vielendank!

Page 20: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Background: Neural Machine Translation

20

wordembeddings

BidirectionalLSTM/GRU

thankyouverymuch!

thankyouverymuch!

Attention

UnidirectionalLSTM/GRU

vielendank!

Page 21: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

MT Evaluation: BLEU Score Humanevaluationisexpensive,souseautomaticmetricbasedonn-gramprecision

Notperfect,butshowntocorrelatewithhumanjudgment

Higherisbetter

21

Page 22: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Results: MT

22

Page 23: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Example Sentence

23

CandiscussBPEattheendiftime(tl;drsegmentswordswithanautomaticheuristic)

Page 24: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Conclusion  IncorporatingawarenessofmorphologyintowordembeddingscanhelpinNMT

 Lemma-informed(morph)andbagofcharactern-grams(fastText)modificationstoskip-gram(word2vec)◦  Bothhelpinbothinstrinsic&extrinsicevaluation◦  However,lemmaismoreusefulforwordsimilarity◦  Meanwhile,fastTextdoesbetterwithMT

24

Page 25: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Byte-Pair Encoding  Compressionalgorithm(Gage1994,Sennrichetal.,2016):aimedatkeepingfrequentwordsintact,breakinguprare/unknownwords

 Thentranslatesthese“subwordunits”withsamemodel

 Hasbecomestandardpractice

25

Page 26: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Byte-Pair Encoding  Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)

 e.g.alloftheanimalsaregoingdownthepath

 all|of|the|animals|are|going|down|the|path

26

Page 27: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Byte-Pair Encoding  Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)

 e.g.alloftheanimalsaregoingdownthepath

 all|of|the|animals|are|going|down|the|path

 1)thàth

27

Page 28: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Byte-Pair Encoding  Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)

 e.g.alloftheanimalsaregoingdownthepath

 all|of|the|animals|are|going|down|the|path

 1)thàth

 2)alàal

28

Page 29: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Byte-Pair Encoding  Beginswitheverythingsplitupintocharacters,iterativelymergesmostfrequentpairsofsymbols,useslearnedmergeoperationsontestdata(doesn’tcrosswordboundaries)

 e.g.alloftheanimalsaregoingdownthepath

 all|of|the|animals|are|going|down|the|path

 1)thàth

 2)alàal

 3)theàthe

29

Page 30: NLP Guest Lecture: Morphological Word Embeddingsmorphological analyzer (MADAMIRA), train word and lemma embeddings w(t-2) w(t-1) w(t+1) w(t+2) PROJECTION OUTPUT w(t) l(t) INPUT 9 Word

Byte-Pair Encoding  TypicalrangeofBPE:15k-100k

30