View
2
Download
0
Category
Preview:
Citation preview
CS839:ProbabilisticGraphicalModels
Lecture22:TheAttentionMechanismTheoRekatsinas
1
WhyAttention?
2
• Considermachinetranslation:• Weneedtopayattentiontothewordwearecurrentlytranslating.Istheentiresequenceneededascontext?
• Thecatisblack->Lechatest noir
WhyAttention?
3
• Considermachinetranslation:• Weneedtopayattentiontothewordwearecurrentlytranslating.Istheentiresequenceneededascontext?
• Thecatisblack->Lechatest noir
• RNNsarethede-factostandardformachinetranslation• Problem:translationreliesonreadingacompletesentenceandcompressesallinformationintoafixed-lengthvectorasentencewithhundredsofwordsrepresentedbyseveralwordswillsurelyleadtoinformationloss,inadequatetranslation,etc.
• Long-rangedependenciesaretricky.
Basicencoder- decoder
4
SoftAttentionforTranslation
5
SoftAttentionforTranslation
“Ilovecoffee”->“Megustaelcafé”
Distributionoverinputwords
Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015
SoftAttentionforTranslation
SoftAttentionforTranslation
“Ilovecoffee”->“Megustaelcafé”
Distributionoverinputwords
Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015
SoftAttentionforTranslation
SoftAttentionforTranslation
“Ilovecoffee”->“Megustaelcafé”
Distributionoverinputwords
Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015
SoftAttentionforTranslation
SoftAttentionforTranslation
“Ilovecoffee”->“Megustaelcafé”
Distributionoverinputwords
Bahdanauetal,“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate”,ICLR2015
SoftAttentionforTranslation
SoftAttention
FromY.Bengio CVPR2015Tutorial
BidirectionalencoderRNN
DecoderRNN
AttentionModel
SoftAttentionContextvector(inputtodecoder):
Mixtureweights:
Alignmentscore(howwelldoinputwordsnearjmatchoutputwordsatpositioni):
SoftAttentionLuong,PhamandManning’sTranslationSystem(2015):
LuongandManningIWSLT2015
TranslationErrorRatevsHuman
HardAttention
MonotonicAttention
GlobalAttention• Blue=encoder• Red=decoder• Attendtoacontextvector.• Decodercapturesglobalinformationnotonlytheinformationfromonehiddenstate.• Contextvectortakesallcell’soutputsasinputandcomputesaprobabilitydistributionforeachtokenthedecoderwantstogenerate
LocalAttention
• Computeabestalignedpositionfirst• Thencomputeacontextvectorcenteredatthatposition
RNNforCaptioning
CNN
Image:HxWx3
Features:D
h0
Hiddenstate:H
h1
y1
h2
y2
Firstword
Secondword
d1
Distributionovervocab
d2
RNNonlylooksatwholeimage,once
WhatiftheRNNlooksatdifferentpartsoftheimageateachtimestep?
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
a1
DistributionoverLlocations
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
a1
Weightedcombinationoffeatures
DistributionoverLlocations
z1Weightedfeatures:D
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx
3
Features:LxD
h0
a1
z1
Weightedcombinationoffeatures
h1
DistributionoverLlocations
Weightedfeatures:D y1
Firstword
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
a1
z1
Weightedcombinationoffeatures
y1
h1
Firstword
DistributionoverLlocations
a2 d1
Weightedfeatures:D
Distributionovervocab
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
a1
z1
Weightedcombinationoffeatures
y1
h1
Firstword
DistributionoverLlocations
a2 d1
z2Weightedfeatures:D
Distributionovervocab
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
a1
z1
Weightedcombinationoffeatures
y1
h1
Firstword
DistributionoverLlocations
a2 d1
h2
z2 y2Weightedfeatures:D
Distributionovervocab
SoftAttentionforCaptioning
SoftAttentionforCaptioning
CNN
Image:HxWx3
Features:LxD
h0
a1
z1
Weightedcombinationoffeatures
y1
h1
Firstword
DistributionoverLlocations
a2 d1
h2
a3 d2
z2 y2Weightedfeatures:D
Distributionovervocab
SoftAttentionforCaptioning
SoftvsHardAttention
CNN
Image:HxWx3
Gridoffeatures(EachD-dimensional)
a b
c d
pa pb
pc pd
Distributionovergridlocations
pa+pb+pc+pc=1
FromRNN:
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
SoftvsHardAttention
SoftvsHardAttention
CNN
Image:HxWx3
Gridoffeatures(EachD-dimensional)
a b
c d
pa pb
pc pd
Distributionovergridlocations
pa+pb+pc+pc=1
FromRNN:
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
Contextvectorz(D-dimensional)
SoftvsHardAttention
SoftvsHardAttention
CNN
Image:HxWx3
Gridoffeatures(EachD-dimensional)
a b
c d
pa pb
pc pd
Distributionovergridlocations
pa+pb+pc+pc=1
FromRNN:
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
Contextvectorz(D-dimensional)
Softattention:SummarizeALLlocationsz=paa+pbb +pcc +pdd
Derivativedz/dpisnice!Trainwithgradientdescent
SoftvsHardAttention
SoftvsHardAttention
CNN
Image:HxWx3
Gridoffeatures(EachD-dimensional)
a b
c d
pa pb
pc pd
Distributionovergridlocations
pa+pb+pc+pc=1
FromRNN:
Xuetal,“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”,ICML2015
Contextvectorz(D-dimensional)
Softattention:SummarizeALLlocationsz=paa+pbb +pcc +pdd
Derivativedz/dpisnice!Trainwithgradientdescent
Hardattention:SampleONElocation
accordingtop,z=thatvector
Withargmax,dz/dpiszeroalmosteverywhere…
Can’tusegradientdescent;needreinforcementlearning
SoftvsHardAttention
Multi-headedAttention
Attentionisallyouneed
Attentiontricks
SoftvsHardAttentionAttentionTakeawaysPerformance:• Attentionmodelscanimprove
accuracy andreducecomputationatthesametime.
Complexity:• Therearemanydesignchoices.• Thosechoiceshaveabigeffectonperformance.• Ensemblinghasunusuallylargebenefits.• Simplifywherepossible!
SoftvsHardAttentionAttentionTakeawaysExplainability:• Attentionmodelsencodeexplanations.• Bothlocusandtrajectoryhelp
understandwhat’sgoingon.
Hardvs.Soft:• Softmodelsareeasiertotrain,hardmodelsrequirereinforcementlearning.
• Theycanbecombined,asinLuongetal.
Recommended