52
Neural Machine Transla/on Wei Xu (many slides from Greg Durrett and Emma Strubell)

Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

NeuralMachineTransla/on

WeiXu(many slides from Greg Durrett and Emma Strubell)

Page 2: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:Phrase-BasedMT

Unlabeled English data

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 …

Language model P(e)

Phrase table P(f|e) P (e|f) / P (f |e)P (e)

Noisy channel model: combine scores from translation model + language model to translate foreign to

English

“Translate faithfully but make fluent English”

}

Page 3: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:HMMforAlignment

Brownetal.(1993)

Thankyou,Ishalldosogladly.e

‣ Sequen/aldependencebetweena’stocapturemonotonicity

0 2 6

Gracias,loharedemuybuengrado.f

5 7 7 7 7 8a

‣ Alignmentdistparameterizedbyjumpsize:

‣ :wordtransla/ontableP (fi|eai)

§  Wantlocalmonotonicity:mostjumpsaresmall§  HMMmodel(Vogel96)

§  Re-es>mateusingtheforward-backwardalgorithm -2 -1 0 1 2 3

P (f ,a|e) =nY

i=1

P (fi|eai)P (ai|ai�1)

Page 4: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:BeamSearchforDecoding

�4

Recall:Decoding

…didnotidx=2

Marynot

Maryno

4.2

-1.2

-2.9

idx=2

idx=2

…notgiveidx=3

…notslapidx=5

…notslapidx=6

1 2 3 4 5 6 7 8 9

‣ ScoresfromlanguagemodelP(e)+transla=onmodelP(f|e)

Page 5: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:Evalua/ngMT‣ Fluency:doesitsoundgoodinthetargetlanguage?‣ Fidelity/adequacy:doesitcapturethemeaningoftheoriginal?

‣ BLEUscore:geometricmeanof1-,2-,3-,and4-gramprecisionvs.areference,mul/pliedbybrevitypenalty

‣ Typicallyn=4,wi=1/4

‣ r=lengthofreferencec=lengthofpredic/on

‣ Doesthiscapturefluencyandadequacy?

Page 6: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:seq2seqModels

themoviewasgreat

‣ Encoder:consumessequenceoftokens,producesavector.Analogoustoencodersforclassifica/on/taggingtasks

le

<s>

‣ Decoder:separatemodule,singlecell.Takestwoinputs:hiddenstate(vectorhortuple(h,c))andprevioustoken.Outputstoken+newstate

Encoder

film

le

Decoder Decoder

Page 7: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:seq2seqModels‣ Generatenextwordcondi/onedonpreviouswordaswellashiddenstate

themoviewasgreat <s>

‣Wsizeis|vocab|x|hiddenstate|,solmaxoveren/revocabulary

Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)

P (y|x) =nY

i=1

P (yi|x, y1, . . . , yi�1)

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)

Page 8: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:BeamSearchfordecoding‣Maintaindecoderstate,tokenhistoryinbeam

la:0.4

<s>

la

le

les

le:0.3les:0.1

log(0.4)log(0.3)

log(0.1)

film:0.4

la

film:0.8

le

… lefilm

lafilm

log(0.3)+log(0.8)

log(0.4)+log(0.4)

‣ Keepbothfilmstates!Hiddenstatevectorsaredifferent

themoviewasgreat

Page 9: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recap:NL-to-SQLGenera/on

‣ Convertnaturallanguagedescrip/onintoaSQLqueryagainstsomeDB

‣ Howtoensurethatwell-formedSQLisgenerated?

Zhongetal.(2017)

‣ Threecomponents

‣ Howtocapturecolumnnames+constants?

‣ Pointermechanisms

Page 10: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Fairseq

Page 11: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

NeuralMTDetails

Page 12: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Encoder-DecoderMT

Sutskeveretal.(2014)

‣ SOTA=37.0—notallthatcompe//ve…

‣ Sutskeverseq2seqpaper:firstmajorapplica/onofLSTMstoNLP

‣ Basicencoder-decoderwithbeamsearch

Page 13: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Encoder-DecoderMT

‣ Betermodelfromseq2seqlectures:encoder-decoderwitha(en*onandcopyingforrarewords

themoviewasgreat

h1 h2 h3 h4

<s>

h̄1

c1

distribu/onovervocab+copying

le

eij = f(h̄i, hj)

↵ij =exp(eij)Pj0 exp(eij0)

ci =X

j

↵ijhj

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

Page 14: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Results:WMTEnglish-French

Classicphrase-basedsystem:~33BLEU,usesaddi/onaltarget-languagedata

RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)

Sutskever+(2014)seq2seqsingle:30.6BLEU

Sutskever+(2014)seq2seqensemble:34.8BLEU

‣ ButEnglish-Frenchisareallyeasylanguagepairandthere’stonsofdataforit!Doesthisapproachworkforanythingharder?

Luong+(2015)seq2seqensemblewithaten/onandrarewordhandling:37.5BLEU

‣ 12Msentencepairs

Page 15: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Results:WMTEnglish-German

‣ NotnearlyasgoodinabsoluteBLEU,butnotreallycomparableacrosslanguages

Classicphrase-basedsystem:20.7BLEU

Luong+(2014)seq2seq:14BLEU

‣ French,Spanish=easiestGerman,Czech=harderJapanese,Russian=hard(gramma/callydifferent,lotsofmorphology…)

Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU

‣ 4.5Msentencepairs

Page 16: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

MTExamples

Luongetal.(2015)

‣ NMTsystemscanhallucinatewords,especiallywhennotusingaten/on—phrase-baseddoesn’tdothis

‣ best=withaten/on,base=noaten/on

Page 17: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

MTExamples

Luongetal.(2015)

‣ best=withaten/on,base=noaten/on

Page 18: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Zhangetal.(2017)

‣ NMTcanrepeatitselfifitgetsconfused(pHorpH)

‣ Phrase-basedMTolengetschunksright,mayhavemoresubtleungramma/cali/es

MTExamples

Page 19: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

HandlingRareWords

‣Wordsareadifficultunittoworkwith:copyingcanbecumbersome,wordvocabulariesgetverylarge

‣ Character-levelmodelsdon’tworkwell

Input:_the_ecotax_portico_in_Pont-de-Buis…

Output:_le_portique_écotaxe_de_Pont-de-Buis

‣ Solu/on:“wordpieces”(whichmaybefullwordsbutmaybesubwords)‣

‣ Canhelpwithtranslitera/on;capturesharedlinguis/ccharacteris/csbetweenlanguages(e.g.,translitera/on,sharedwordroot,etc.)

Wuetal.(2016)

Page 20: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

BytePairEncoding(BPE)

‣ Countbigramcharactercooccurrences

Sennrichetal.(2016)

‣Mergethemostfrequentpairofadjacentcharacters

‣ Dothiseitheroveryourvocabulary(originalversion)oroveralargecorpus(morecommonversion)‣ Finalvocabularysizeisolenin10k~30krangeforeachlanguage

‣ Startwitheveryindividualbyte(basicallycharacter)asitsownsymbol

‣MostSOTANMTsystemsusethisonbothsource+target

Page 21: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

�21

WordPieces

‣ SentencePiecelibraryfromGoogle:unigramLM

SchusterandNakajima(2012),Wuetal.(2016),KudoandRichardson(2018)

Buildalanguagemodeloveryourcorpus

Mergepiecesthatleadtohighestimprovementinlanguagemodelperplexity

‣ Issues:whatLMtouse?Howtomakethistractable?

whilevocsize<targetvocsize:

‣ Result:wayofsegmen=nginputappropriatefortransla=on

Page 22: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Google’sNMTSystem

Wuetal.(2016)

‣ 8-layerLSTMencoder-decoderwithaten/on,wordpiecevocabularyof8k-32k

Page 23: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Google’sNMTSystem

Wuetal.(2016)

Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU

Google’sphrase-basedsystem:37.0BLEU

English-French:

Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEUGoogle’s32kwordpieces:24.2BLEU

Google’sphrase-basedsystem:20.7BLEU

English-German:

Page 24: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

HumanEvalua/on(En-Es)

Wuetal.(2016)

‣ Similartohuman-level performanceonEnglish-Spanish

Page 25: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Google’sNMTSystem

Wuetal.(2016)

GenderiscorrectinGNMTbutnotinPBMT

“sled”“walker”

Theright-mostcolumnshowsthehumanra/ngsonascaleof0(completenonsense)to6(perfecttransla/on)

Page 26: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Backtransla/on‣ Sta/s/calMTmethods(e.g.,phrase-basedMT)usedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?

Sennrichetal.(2015)

s1,t1

[null],t’1[null],t’2

s2,t2…

‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs

‣ Approach2:generatesynthe/csourceswithaT->Smachinetransla/onsystem(backtransla/on)

s1,t1

MT(t’1),t’1

s2,t2…

…MT(t’2),t’2

Page 27: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Backtransla/on

Sennrichetal.(2015)

‣ parallelsynth:backtranslatetrainingdata;makesaddi/onalnoisysourcesentenceswhichcouldbeuseful

‣ Gigaword:largemonolingualEnglishcorpus

Page 28: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

TransformersforMT

Page 29: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Recall:Self-Aten/on

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesaten/onovereachword

‣Mul/ple“heads”analogoustodifferentconvolu/onalfilters.UseparametersWkandVktogetdifferentaten/onvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

Page 30: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

committee awards Strickland advanced opticswho

Layer p

QKV

Nobel

Self-aten/on

Page 31: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

committee awards Strickland advanced opticswhoNobel

Self-aten/on

Page 32: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

optics advanced

who Strickland

awards committee

Nobel

committee awards Strickland advanced opticswhoNobel

Self-aten/on

Page 33: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

optics advanced

who Strickland

awards committee

Nobel

Layer p

QKV

A

committee awards Strickland advanced opticswhoNobel

Self-aten/on

Page 34: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Page 35: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Page 36: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

M

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Page 37: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

M

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Self-aten/on

Page 38: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Mul/-headself-aten/on

Layer p

QKV

MM1

MH

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Page 39: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

MH

M1

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Mul/-headself-aten/on

Page 40: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Layer p

QKV

MH

M1

Layer p+1

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Mul/-headself-aten/on

Page 41: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

committee awards Strickland advanced opticswhoNobelp+1

Mul/-headself-aten/on

Page 42: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Multi-head self-attention + feed forward

Multi-head self-attention + feed forward

42

Layer 1

Layer p

Multi-head self-attention + feed forwardLayer J

committee awards Strickland advanced opticswhoNobel

Mul/-headself-aten/on

Page 43: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Transformers

Vaswanietal.(2017)

‣ Encoderanddecoderarebothtransformers

‣ Decoderconsumesthepreviousgeneratedtoken(andatendstoinput),buthasnorecurrentstate

‣Manyotherdetailstogetittowork:residualconnec/ons,layernormaliza/on,posi/onalencoding,op/mizerwithlearningrateschedule,labelsmoothing….

Page 44: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

�44

Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

Page 45: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

ResidualNetwork

�45

Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

inputfrompreviouslayer

outputtonextlayer

f(x)

x

non-linearity

+x

f(x)+x

f

Page 46: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Transformers

Vaswanietal.(2017)

‣ Adamop/mizerwithvariedlearningrateoverthecourseoftraining

‣ Linearlyincreaseforwarmup,thendecaypropor/onallytotheinversesquarerootofthestepnumber

‣ Thispartisveryimportant!

Page 47: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

LabelSmoothing

‣ Insteadofusingaone-nottargetdistribu/on,createadistribu/onthathas“confidence”ofthecorrectwordandtherestofthe“smoothing”massdistributedthroughoutthevocabulary.

‣ ImplementedbyminimizingKL-divergencebetweensmoothedgroundtruthprobabili/esandtheprobabili/escomputedbymodel.

Iwenttoclassandtook_____

cats TV notes took sofa0 0 1 0 0

0.025 0.025 0.9 0.025 0.025 withlabelsmoothing

Page 48: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Transformers

Vaswanietal.(2017)

‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved

Page 50: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

Takeaways

‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers

‣Wordpiece/bytepairmodelsarereallyeffec/veandeasytouse

‣ Transformerisaverystrongmodel(whendataislargeenough);trainingcanbetricky

‣ Stateoftheartsystemsarege�ngpretygood,butlotsofchallengesremain,especiallyforlow-resourcese�ngs

Page 51: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

TextSimplifica/on

ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica/on”inACL(2020)

Page 52: Neural Machine Transla/on · translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Recap: HMM for Alignment Brown

TextSimplifica/on

94ksent.pairs

394ksent.pairs

ChaoJiang,MounicaMaddela,WuweiLan,YangZhong,WeiXu.“NeuralCRFModelforSentenceAlignmentinTextSimplifica/on”inACL(2020)