Lecture13:MachineTransla3onII
AlanRi8er(many slides from Greg Durrett)
Syntac3cMT
LevelsofTransfer:VauquoisTriangle
Slidecredit:DanKlein‣ Issyntaxa“be8er”abstrac3onthanphrases?
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]DT→[the,le]
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
DT→[the,le]
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]
DT→[the,le]
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]
DT→[the,le]NP NP
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]
DT→[the,le]NP NP
DT1 NN3 JJ2DT1 NN3JJ2
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]the yellow car
DT→[the,le]
la voiture jaune
NP NP
DT1 NN3 JJ2DT1 NN3JJ2
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]the yellow car
‣ Assumesparallelsyntaxuptoreordering
DT→[the,le]
la voiture jaune
NP NP
DT1 NN3 JJ2DT1 NN3JJ2
Syntac3cMT‣ Ratherthanusephrases,useasynchronouscontext-freegrammar:constructs“parallel”treesintwolanguagessimultaneously
NP→[DT1JJ2NN3;DT1NN3JJ2]
DT→[the,la]
NN→[car,voiture]
JJ→[yellow,jaune]the yellow car
‣ Assumesparallelsyntaxuptoreordering
DT→[the,le]
la voiture jaune
NP NP
DT1 NN3 JJ2DT1 NN3JJ2
‣ Transla3on=parsetheinputwith“half”thegrammar,readoffotherhalf
Syntac3cMT
Slidecredit:DanKlein
Syntac3cMT
Slidecredit:DanKlein
‣ Relaxthisbyusinglexicalizedrules,like“syntac3cphrases”
Syntac3cMT
Slidecredit:DanKlein
‣ Relaxthisbyusinglexicalizedrules,like“syntac3cphrases”
‣ LeadstoHUGEgrammars,parsingisslow
NeuralMTDetails
Encoder-DecoderMT
Sutskeveretal.(2014)
‣ Sutskeverseq2seqpaper:firstmajorapplica3onofLSTMstoNLP
Encoder-DecoderMT
Sutskeveretal.(2014)
‣ Sutskeverseq2seqpaper:firstmajorapplica3onofLSTMstoNLP
‣ Basicencoder-decoderwithbeamsearch
Encoder-DecoderMT
Sutskeveretal.(2014)
‣ Sutskeverseq2seqpaper:firstmajorapplica3onofLSTMstoNLP
‣ Basicencoder-decoderwithbeamsearch
Encoder-DecoderMT
Sutskeveretal.(2014)‣ SOTA=37.0—notallthatcompe33ve…
‣ Sutskeverseq2seqpaper:firstmajorapplica3onofLSTMstoNLP
‣ Basicencoder-decoderwithbeamsearch
Encoder-DecoderMT
‣ Be8ermodelfromseq2seqlectures:encoder-decoderwitha8en3onandcopyingforrarewords
themoviewasgreat
h1 h2 h3 h4
<s>
h̄1
c1
distribu3onovervocab+copying
…
le
Results:WMTEnglish-French
Results:WMTEnglish-French‣ 12Msentencepairs
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi3onaltarget-languagedata
‣ 12Msentencepairs
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi3onaltarget-languagedata
RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)
‣ 12Msentencepairs
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi3onaltarget-languagedata
RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)
Sutskever+(2014)seq2seqsingle:30.6BLEU
‣ 12Msentencepairs
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi3onaltarget-languagedata
RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)
Sutskever+(2014)seq2seqsingle:30.6BLEU
Sutskever+(2014)seq2seqensemble:34.8BLEU
‣ 12Msentencepairs
Results:WMTEnglish-French
Classicphrase-basedsystem:~33BLEU,usesaddi3onaltarget-languagedata
RerankwithLSTMs:36.5BLEU(longlineofworkhere;Devlin+2014)
Sutskever+(2014)seq2seqsingle:30.6BLEU
Sutskever+(2014)seq2seqensemble:34.8BLEU
‣ ButEnglish-Frenchisareallyeasylanguagepairandthere’stonsofdataforit!Doesthisapproachworkforanythingharder?
Luong+(2015)seq2seqensemblewitha8en3onandrarewordhandling:37.5BLEU
‣ 12Msentencepairs
Results:WMTEnglish-German
‣ NotnearlyasgoodinabsoluteBLEU,butnotreallycomparableacrosslanguages
Classicphrase-basedsystem:20.7BLEU
Luong+(2014)seq2seq:14BLEU
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU
‣ 4.5Msentencepairs
Results:WMTEnglish-German
‣ NotnearlyasgoodinabsoluteBLEU,butnotreallycomparableacrosslanguages
Classicphrase-basedsystem:20.7BLEU
Luong+(2014)seq2seq:14BLEU
‣ French,Spanish=easiestGerman,Czech=harderJapanese,Russian=hard(gramma3callydifferent,lotsofmorphology…)
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEU
‣ 4.5Msentencepairs
MTExamples
Luongetal.(2015)
‣ NMTsystemscanhallucinatewords,especiallywhennotusinga8en3on—phrase-baseddoesn’tdothis
‣ best=witha8en3on,base=noa8en3on
MTExamples
Luongetal.(2015)
‣ best=witha8en3on,base=noa8en3on
Zhangetal.(2017)
‣ NMTcanrepeatitselfifitgetsconfused(pHorpH)
‣ Phrase-basedMTosengetschunksright,mayhavemoresubtleungramma3cali3es
MTExamples
RareWords:WordPieceModels
‣ UseHuffmanencodingonacorpus,keepmostcommonk(~10,000)charactersequencesforsourceandtarget
Input:_the_ecotax_portico_in_Pont-de-Buis…
Output:_le_portique_écotaxe_de_Pont-de-Buis
Wuetal.(2016)
RareWords:WordPieceModels
‣ UseHuffmanencodingonacorpus,keepmostcommonk(~10,000)charactersequencesforsourceandtarget
‣ Capturescommonwordsandpartsofrarewords
Input:_the_ecotax_portico_in_Pont-de-Buis…
Output:_le_portique_écotaxe_de_Pont-de-Buis
Wuetal.(2016)
RareWords:WordPieceModels
‣ UseHuffmanencodingonacorpus,keepmostcommonk(~10,000)charactersequencesforsourceandtarget
‣ Capturescommonwordsandpartsofrarewords
Input:_the_ecotax_portico_in_Pont-de-Buis…
Output:_le_portique_écotaxe_de_Pont-de-Buis
‣ Subwordstructuremaymakeiteasiertotranslate
Wuetal.(2016)
RareWords:WordPieceModels
‣ UseHuffmanencodingonacorpus,keepmostcommonk(~10,000)charactersequencesforsourceandtarget
‣ Capturescommonwordsandpartsofrarewords
Input:_the_ecotax_portico_in_Pont-de-Buis…
Output:_le_portique_écotaxe_de_Pont-de-Buis
‣ Subwordstructuremaymakeiteasiertotranslate
‣Modelbalancestransla3ngandtranslitera3ngwithoutexplicitswitchingWuetal.(2016)
RareWords:BytePairEncoding
Sennrichetal.(2016)
‣ Input:adic3onaryofwordsrepresentedascharacters‣ Simplerprocedure,basedonlyonthedic3onary
RareWords:BytePairEncoding
Sennrichetal.(2016)
‣ Input:adic3onaryofwordsrepresentedascharacters‣ Simplerprocedure,basedonlyonthedic3onary
RareWords:BytePairEncoding
‣ Countbigramcharactercooccurrences
Sennrichetal.(2016)
‣ Input:adic3onaryofwordsrepresentedascharacters‣ Simplerprocedure,basedonlyonthedic3onary
RareWords:BytePairEncoding
‣ Countbigramcharactercooccurrences
Sennrichetal.(2016)
‣Mergethemostfrequentpairofadjacentcharacters
‣ Input:adic3onaryofwordsrepresentedascharacters‣ Simplerprocedure,basedonlyonthedic3onary
RareWords:BytePairEncoding
‣ Countbigramcharactercooccurrences
Sennrichetal.(2016)
‣Mergethemostfrequentpairofadjacentcharacters
‣ Input:adic3onaryofwordsrepresentedascharacters
‣ Finalsize=ini3alvocab+nummerges.Osendo10k-30kmerges
‣ Simplerprocedure,basedonlyonthedic3onary
RareWords:BytePairEncoding
‣ Countbigramcharactercooccurrences
Sennrichetal.(2016)
‣Mergethemostfrequentpairofadjacentcharacters
‣ Input:adic3onaryofwordsrepresentedascharacters
‣ Finalsize=ini3alvocab+nummerges.Osendo10k-30kmerges
‣ Simplerprocedure,basedonlyonthedic3onary
‣MostSOTANMTsystemsusethisonbothsource+target
Google’sNMTSystem
Wuetal.(2016)
‣ 8-layerLSTMencoder-decoderwitha8en3on,wordpiecevocabularyof8k-32k
Google’sNMTSystem
Wuetal.(2016)
Google’sNMTSystem
Wuetal.(2016)
Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU
Google’sphrase-basedsystem:37.0BLEU
English-French:
Google’sNMTSystem
Wuetal.(2016)
Luong+(2015)seq2seqensemblewithrarewordhandling:37.5BLEUGoogle’s32kwordpieces:38.95BLEU
Google’sphrase-basedsystem:37.0BLEU
English-French:
Luong+(2015)seq2seqensemblewithrarewordhandling:23.0BLEUGoogle’s32kwordpieces:24.2BLEU
Google’sphrase-basedsystem:20.7BLEU
English-German:
HumanEvalua3on(En-Es)
Wuetal.(2016)
‣ Similartohuman-level performanceonEnglish-Spanish
Google’sNMTSystem
Wuetal.(2016)
Google’sNMTSystem
Wuetal.(2016)
GenderiscorrectinGNMTbutnotinPBMT
Google’sNMTSystem
Wuetal.(2016)
GenderiscorrectinGNMTbutnotinPBMT
“sled”
Google’sNMTSystem
Wuetal.(2016)
GenderiscorrectinGNMTbutnotinPBMT
“sled”
Google’sNMTSystem
Wuetal.(2016)
GenderiscorrectinGNMTbutnotinPBMT
“sled”“walker”
Backtransla3on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
Backtransla3on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs
Backtransla3on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
s1,t1
[null],t’1[null],t’2
s2,t2…
…
‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs
Backtransla3on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
s1,t1
[null],t’1[null],t’2
s2,t2…
…
‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs
‣ Approach2:generatesynthe3csourceswithaT->Smachinetransla3onsystem(backtransla3on)
Backtransla3on‣ ClassicalMTmethodsusedabilingualcorpusofsentencesB=(S,T)andalargemonolingualcorpusT’totrainalanguagemodel.CanneuralMTdothesame?
Sennrichetal.(2015)
s1,t1
[null],t’1[null],t’2
s2,t2…
…
‣ Approach1:forcethesystemtogenerateT’astargetsfromnullinputs
‣ Approach2:generatesynthe3csourceswithaT->Smachinetransla3onsystem(backtransla3on)
s1,t1
MT(t’1),t’1
s2,t2…
…MT(t’2),t’2
Backtransla3on
Sennrichetal.(2015)
‣ parallelsynth:backtranslatetrainingdata;makesaddi3onalnoisysourcesentenceswhichcouldbeuseful
‣ Gigaword:largemonolingualEnglishcorpus
DilatedCNNsforMT
DilatedConvolu3ons‣ Standardconvolu3on:looksateverytokenunderthefilter‣ Dilatedconvolu3onwithgapd:looksateverydthtoken
Strubelletal.(2017)
DilatedConvolu3ons‣ Standardconvolu3on:looksateverytokenunderthefilter‣ Dilatedconvolu3onwithgapd:looksateverydthtoken
Strubelletal.(2017)
DilatedConvolu3ons‣ Standardconvolu3on:looksateverytokenunderthefilter‣ Dilatedconvolu3onwithgapd:looksateverydthtoken
w=2,d=2:gapinthefilter
Strubelletal.(2017)
DilatedConvolu3ons‣ Standardconvolu3on:looksateverytokenunderthefilter‣ Dilatedconvolu3onwithgapd:looksateverydthtoken
w=2,d=2:gapinthefilter
‣ Canchainsuccessivedilatedconvolu3onstogethertogetawiderecep3vefield(seealotofthesentence)
Strubelletal.(2017)
DilatedConvolu3ons‣ Standardconvolu3on:looksateverytokenunderthefilter‣ Dilatedconvolu3onwithgapd:looksateverydthtoken
w=2,d=2:gapinthefilter
‣ Canchainsuccessivedilatedconvolu3onstogethertogetawiderecep3vefield(seealotofthesentence)
Strubelletal.(2017)
w=3,d=1
w=3,d=2
w=3,d=4
DilatedConvolu3ons‣ Standardconvolu3on:looksateverytokenunderthefilter‣ Dilatedconvolu3onwithgapd:looksateverydthtoken
w=2,d=2:gapinthefilter
‣ Canchainsuccessivedilatedconvolu3onstogethertogetawiderecep3vefield(seealotofthesentence)
Strubelletal.(2017)
w=3,d=1
w=3,d=2
w=3,d=4
‣ Topnodesseelotsofthesentence,butwithdifferentprocessing
CNNsforMachineTransla3on
Kalchbrenneretal.(2016)
‣ “ByteNet”:operatesovercharacters(bytes)
CNNsforMachineTransla3on
Kalchbrenneretal.(2016)
‣ “ByteNet”:operatesovercharacters(bytes)‣ Encodesourcesequencew/dilatedconvolu3ons
CNNsforMachineTransla3on
Kalchbrenneretal.(2016)
‣ “ByteNet”:operatesovercharacters(bytes)‣ Encodesourcesequencew/dilatedconvolu3ons
‣ Predictnthtargetcharacterbylookingatthenthposi3oninthesourceandadilatedconvolu3onoverthen-1targettokenssofar
CNNsforMachineTransla3on
Kalchbrenneretal.(2016)
‣ “ByteNet”:operatesovercharacters(bytes)‣ Encodesourcesequencew/dilatedconvolu3ons
‣ Predictnthtargetcharacterbylookingatthenthposi3oninthesourceandadilatedconvolu3onoverthen-1targettokenssofar
‣ Todealwithdivergentlengths,tnactuallylooksatsnαwhereαisaheuris3cally-chosenparameter
CNNsforMachineTransla3on
Kalchbrenneretal.(2016)
‣ “ByteNet”:operatesovercharacters(bytes)‣ Encodesourcesequencew/dilatedconvolu3ons
‣ Predictnthtargetcharacterbylookingatthenthposi3oninthesourceandadilatedconvolu3onoverthen-1targettokenssofar
‣ Todealwithdivergentlengths,tnactuallylooksatsnαwhereαisaheuris3cally-chosenparameter
‣ Assumesmostlymonotonictransla3on
Compare:CNNsvs.LSTMs
Kalchbrenneretal.(2016)
Compare:CNNsvs.LSTMs
Kalchbrenneretal.(2016)
<s>
h̄1
c1
‣ LSTM:looksatpreviousword+hiddenstate,a8en3onoverinput
Compare:CNNsvs.LSTMs
Kalchbrenneretal.(2016)
<s>
h̄1
c1
‣ LSTM:looksatpreviousword+hiddenstate,a8en3onoverinput‣ CNN:sourceencodingatthis
posi3ongivesus“a8en3on”,targetencodinggivesusdecodercontext
A8en3onfromCNN
Kalchbrenneretal.(2016)
‣Modelischaracter-level,thisvisualiza3onshowswhichwords’scharactersimpacttheconvolu3onalencodingthemost
‣ Largelymonotonicbutdoesconsultotherinforma3on
AdvantagesofCNNs
Kalchbrenneretal.(2016)
‣ LSTMwitha8en3onisquadra3c:computea8en3onoverthewholeinputforeachdecodedtoken
AdvantagesofCNNs
Kalchbrenneretal.(2016)
‣ LSTMwitha8en3onisquadra3c:computea8en3onoverthewholeinputforeachdecodedtoken
‣ CNNislinear!
AdvantagesofCNNs
Kalchbrenneretal.(2016)
‣ LSTMwitha8en3onisquadra3c:computea8en3onoverthewholeinputforeachdecodedtoken
‣ CNNislinear!
‣ CNNisshallowertooinprinciplebuttheconvlayersareverysophis3cated(3layerseach)
English-GermanMTResults
Kalchbrenneretal.(2016)
TransformersforMT
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
x4
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
x4
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
x4
x04
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
x4
x04
scalar↵i,j = softmax(x>i xj)
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
‣Mul3ple“heads”analogoustodifferentconvolu3onalfilters.UseparametersWkandVktogetdifferenta8en3onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
‣Mul3ple“heads”analogoustodifferentconvolu3onalfilters.UseparametersWkandVktogetdifferenta8en3onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj)
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
‣Mul3ple“heads”analogoustodifferentconvolu3onalfilters.UseparametersWkandVktogetdifferenta8en3onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
Self-A8en3on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8en3onovereachword
‣Mul3ple“heads”analogoustodifferentconvolu3onalfilters.UseparametersWkandVktogetdifferenta8en3onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
Transformers
Vaswanietal.(2017)
Transformers
Vaswanietal.(2017)
‣ Posi3onalencoding:augmentwordembeddingwithposi3onembeddings,eachdimisasinewaveofadifferentfrequency.Closerpoints=higherdotproducts
Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ Posi3onalencoding:augmentwordembeddingwithposi3onembeddings,eachdimisasinewaveofadifferentfrequency.Closerpoints=higherdotproducts
Transformers
Vaswanietal.(2017)
Transformers
Vaswanietal.(2017)
‣ Encoderanddecoderarebothtransformers
Transformers
Vaswanietal.(2017)
‣ Encoderanddecoderarebothtransformers
‣ Decoderconsumesthepreviousgeneratedtoken(anda8endstoinput),buthasnorecurrentstate
Transformers
Vaswanietal.(2017)
Transformers
Vaswanietal.(2017)
‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved
Visualiza3on
Vaswanietal.(2017)
Visualiza3on
Vaswanietal.(2017)
Takeaways
‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers
‣Wordpiece/bytepairmodelsarereallyeffec3veandeasytouse
‣ Stateoftheartsystemsarege{ngpre8ygood,butlotsofchallengesremain,especiallyforlow-resourcese{ngs