Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Questionansweringandmachinecomprehensionwith
neuralattentionMinjoonSeoPhDStudent
ComputerScience&EngineeringUniversityofWashington
TwoEnd-to-EndQuestionAnsweringSystemswithNeuralAttention
• BidirectionalAttentionFlow(BiDAF)• OnStanfordQuestionAnsweringDatasetandCNN/DailyMail ClozeTest
• Query-ReductionNetworks(QRN)• OnbAbI QAanddialog, DSTC2 datasets
TwoQuestionAnsweringSystemswithNeuralAttention
• BidirectionalAttentionFlow(BiDAF)• OnStanfordQuestionAnsweringDatasetandCNN/DailyMail ClozeTest
• Query-ReductionNetworks(QRN)• OnbAbI QAanddialogdatasets
QuestionAnsweringTask(StanfordQuestionAnsweringDataset,2016)
Q:WhichNFLteamrepresentedtheAFCatSuperBowl50?
A:DenverBroncos
WhyNeuralAttention?
Q:WhichNFLteamrepresentedtheAFCatSuperBowl50?
Allowsadeeplearningarchitecturetofocusonthemostrelevantphraseofthecontexttothequery
inadifferentiablemanner.
OurModel:Bi-directionalAttentionFlow(BiDAF)
Attention
Modeling
MLP+softmax
𝑖" = 0 𝑖% = 1
BarakObamaisthepresidentoftheU.S. WholeadstheUnitedStates?
Attention
(Bidirectional)AttentionFlow
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
Char/WordEmbeddingLayers
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
CharacterandWordEmbedding
• Wordembeddingisfragileagainstunseenwords• Charembeddingcan’teasilylearnsemanticsofwords• Useboth!
• CharembeddingasproposedbyKim(2015)
Seattle
SeattleCNN
+MaxPooling
concat
Embeddingvector
PhraseEmbeddingLayer
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
PhraseEmbeddingLayer• Inputs:thechar/wordembeddingofqueryandcontextwords• Outputs:wordrepresentationsawareoftheirneighbors(phrase-awarewords)
• ApplybidirectionalRNN(LSTM)forbothqueryandcontext
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
Context Query
AttentionLayer
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
AttentionLayer
• Inputs:phrase-awarecontextandquerywords• Outputs:query-awarerepresentationsofcontextwords
• Context-to-queryattention:Foreach(phrase-aware)contextword,choosethemostrelevantwordfromthe(phrase-aware)querywords• Query-to-contextattention:Choosethecontextwordthatismostrelevanttoanyofquerywords.
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
Context-to-QueryAttention(C2Q)
Q:WholeadstheUnitedStates?
C:BarakObamaisthepresidentoftheUSA.
Foreachcontextword,findthemostrelevantqueryword.
Query-to-ContextAttention(Q2C)
WhileSeattle’sweatherisveryniceinsummer,itsweatherisveryrainyinwinter,makingitoneofthemostgloomycitiesintheU.S.LAis…
Q:Whichcityisgloomyinwinter?
ModelingLayer
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
ModelingLayer
• Attentionlayer:modelinginteractionsbetweenqueryandcontext• Modelinglayer:modelinginteractionswithin(query-aware)contextwordsviaRNN(LSTM)
• Divisionoflabor:letattentionandmodelinglayerssolelyfocusontheirowntasks• Weexperimentallyshowthatthisleadstoabetterresultthanintermixingattentionandmodeling
OutputLayer
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
Training
• Minimizesthenegativelogprobabilitiesofthetruestartindexandthetrueendindex
𝑦() Trueendindexofexamplei
𝑦(* Truestartindexofexamplei
𝐩) Probabilitydistributionofstopindex
𝐩* Probabilitydistributionofstartindex
Previouswork
• Usingneuralattentionasacontroller(Xiong etal.,2016)• UsingneuralattentionwithinRNN(Wang&Jiang,2016)• Mostoftheseattentionsareuni-directional
• BiDAF (ourmodel)• usesneuralattentionasalayer,• Isseparatedfrommodelingpart(RNN),• Isbidirectional
VGG-16
Modeling Layer
Output Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
BiDAF (ours)
ImageClassifierandBiDAF
StanfordQuestionAnsweringDataset(SQuAD)(Rajpurkar etal.,2016)
• MostpopulararticlesfromWikipedia• QuestionsandanswersfromTurkers• 90ktrain,10kdev,?test(hidden)• Answermustlieinthecontext• Twometrics:ExactMatch(EM)andF1
SQuAD Results(http://stanford-qa.com)asof12pmToday
SQuAD Results
1:Rajpurkar etal.(2016)2:Yuetal.(2016)3:Yangetal.(2016)4:Wang&Jiang(2016)6:Xiong etal.(2016)
EM F1Stanford1 (baseline) 40.4 51.0IBM2 62.5 71.0CMU3 62.5 73.3SingaporeManagement4 (ensemble) 67.9 77.0IBMResearch(ensemble) 68.2 77.2SalesforceResearch6 (ensemble) 71.6 80.4MicrosoftResearchAsia (ensemble) 72.1 79.7Ours (ensemble) 73.3 81.1
50
55
60
65
70
75
80
NoCharEmbedding NoWordEmbedding NoC2QAttention NoQ2CAttention DynamicAttention FullModel
EM F1
Ablationsondevdata
InteractiveDemo
http://allenai.github.io/bi-att-flow/demo
AttentionVisualizations
There%are%13 natural%reserves%in%Warsaw%–among%others%,%Bielany Forest%,%KabatyWoods%,%Czerniaków Lake%.%About%15%kilometres (%9%miles%)%from%Warsaw%,%the%Vistula%river%'s%environment%changes%strikingly%and%features%a%perfectly%preserved%ecosystem%,%with%a%habitat%of%animals%that%includes%the%otter%,%beaver%and%hundreds%of%bird%species%.%There%are%also%several%lakes%in%Warsaw%– mainly%the%oxbow%lakes%,%like%Czerniaków Lake%,%the%lakes%in%the%Łazienkior%Wilanów Parks%,%Kamionek Lake%.%There%are%lot%of%small%lakes%in%the%parks%,%but%only%a%few%are%permanent%– the%majority%are%emptied%before%winter%to%clean%them%of%plants%and%sediments%.
Howmany
naturalreserves
arethere
inWarsaw
?
[]hundreds, few, among, 15, several, only, 13, 9natural, ofreservesare, are, are, are, are, includes[][]Warsaw, Warsaw, Warsawinter species
Where
did
Super
Bowl
50
take
place
?
Super%Bowl%50%was%an%American%football%game%to%determine%the%champion%of%the%National%Football%League%(%NFL%)%for%the%2015%season%.%The%American%Football%Conference%(%AFC%)%champion% Denver%Broncos%defeated%the%National%Football%Conference%(%NFC%)%champion%Carolina%Panthers%24–10%to%earn%their%third%Super%Bowl%title%.%The%game%was%played%on%February%7%,%2016%,%at%Levi%'s%Stadium%in%the%San%Francisco%Bay%Area%at%Santa%Clara%,%California .%As%this%was%the%50th%Super%Bowl%,%the%league%emphasized%the%"%golden%anniversary%"%with%various%goldZthemed%initiatives%,%as%well%as%temporarily%suspending%the%tradition%of%naming%each%Super%Bowl%game%with%Roman%numerals%(%under%which%the%game%would%have%been%known%as%"%Super%Bowl%L%"%)%,%so%that%the%logo%could%prominently%feature%the%Arabic%numerals%50%.
at, the, at, Stadium, Levi, in, Santa, Ana
[]
Super, Super, Super, Super, Super
Bowl, Bowl, Bowl, Bowl, Bowl
50
initiatives
EmbeddingVisualizationatWordvsPhraseLayers
January
September
August
July
May
may
effect and may result in
the state may not aid
of these may be more
Opening in May 1852 at
debut on May 5 ,
from 28 January to 25
but by September had been
Howdoesitcomparewithfeature-basedmodels?
CNN/DailyMail ClozeTest(Hermannetal.,2015)
• ClozeTest(PredictingMissingwords)• ArticlesfromCNN/DailyMail• Human-writtensummaries• Missingwordsarealwaysentities• CNN– 300karticle-querypairs• DailyMail – 1Marticle-querypairs
CNN/DailyMail ClozeTestResults
SomelimitationsofSQuAD
TwoQuestionAnsweringSystemswithNeuralAttention
• BidirectionalAttentionFlow(BiDAF)• OnStanfordQuestionAnsweringDatasetandCNN/DailyMail ClozeTest
• Query-ReductionNetworks(QRN)• OnbAbI QAanddialogdatasets
ReasoningQuestionAnswering
DialogSystem
U:CanyoubookatableinRomeinItalianCuisine
S:Howmanypeopleinyourparty?
U:Forfourpeopleplease.
S:Whatpricerangeareyoulookingfor?
DialogtaskvsQA
• DialogsystemcanbeconsideredasQAsystem:• Lastuser’sutteranceisthequery• Allpreviousconversationsarecontexttothequery• Thesystem’snextresponseistheanswertothequery
• Posesafewuniquechallenges• Dialogsystemrequirestrackingstates• Dialogsystemneedstolookatmultiplesentencesintheconversation• Buildingend-to-enddialogsystemismorechallenging
Ourapproach:Query-Reduction
<START>Sandragottheapplethere.Sandradroppedtheapple.Danieltooktheapplethere.Sandrawenttothehallway.Danieljourneyedtothegarden.
Q:Whereistheapple?
Reducedquery:
Whereistheapple?WhereisSandra?WhereisSandra?WhereisDaniel?WhereisDaniel?WhereisDaniel?à garden
A:garden
Query-ReductionNetworks• Reducethequeryintoaneasier-to-answerqueryoverthesequenceofstate-changingtriggers(sentences),invectorspace
Sandragottheapplethere.
!"
!"
#""
#"$
%""
%"$
Where isSandra?
Sandradroppedtheapple
!$
!$
#$"
#$$
%""
%$$
Danieltooktheapplethere.
!&
!&
#&"
#&$
%""
%&$
Where isDaniel?
Sandrawenttothehallway.
!'
!'
#'"
#'$
%""
%'$
Where isDaniel?
Danieljourneyedtothegarden.
!(
!(
#("
#($
%""
%($ → *+
Where isDaniel?
Whereistheapple?
#
garden
Where isSandra?
∅ ∅ ∅ ∅
QRNCell
𝛼 𝜌
1 − ×
× +
𝐱𝑡 𝐪𝑡
𝐡𝑡−1 𝐡𝑡
𝐳𝑡 𝐡𝑡
sentence query
reducedquery(hiddenstate)
updategatecandidatereducedquery
updatefunc reductionfunc
CharacteristicsofQRN
• Updategatecanbeconsideredaslocalattention• QRNchoosestoconsider/ignoreeachcandidatereducedquery• Thedecisionismadelocally(asopposedtoglobalsoftmax attention)
• SubclassofRecurrentNeuralNetwork(RNN)• Twoinputs,hiddenstate,gatingmechanism• Abletohandlesequentialdependency(attentioncannot)
• Simplerrecurrentupdateenablesparallelization overtime• Candidatehiddenstate(reducedquery)iscomputedfrominputsonly• Hiddenstatecanbeexplicitlycomputedasafunctionofinputs
Parallelizationcomputedfrominputsonly,socanbetriviallyparallelized
Canbeexplicitlyexpressedasthegeometricsumofpreviouscandidatehiddenstates
Parallelization
CharacteristicsofQRN
• Updategatecanbeconsideredaslocalattention• SubclassofRecurrentNeuralNetwork(RNN)• Simplerrecurrentupdateenablesparallelization overtime
QRNsitsbetweenneuralattentionmechanismandrecurrentneuralnetworks,takingtheadvantageofbothparadigms.
bAbI QADataset
• 20 differenttasks• 1kstory-questionpairsforeachtask(10kalsoavailable)• Syntheticallygenerated• Manyquestionsrequirelookingatmultiplesentences• Forend-to-endsystemsupervisedbyanswersonly
What’sdifferentfromSQuAD?
• Synthetic• Morethanlexical/syntacticunderstanding• Differentkindsofinferences• induction,deduction,counting,pathfinding,etc.
• Reasoningovermultiplesentences• InterestingtestbedtowardsdevelopingcomplexQAsystem(anddialogsystem)
bAbI QAResults(1k)
0
10
20
30
40
50
60
LSTM DMN+ MemN2N GMemN2N QRN(Ours)
AvgError(%)
AvgError(%)
bAbI QAResults(10k)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MemN2N DNC GMemN2N DMN+ QRN(Ours)
AvgError(%)
AvgError(%)
DialogDatasets
• bAbI DialogDataset• Synthetic• 5differenttasks• 1kdialogsforeachtask
• DSTC2*Dataset• Realdataset• EvaluationmetricisdifferentfromoriginalDSTC2:responsegenerationinsteadof“state-tracking”• Eachdialogis800+utterances• 2407possibleresponses
bAbI DialogResults(OOV)
0
5
10
15
20
25
30
35
MemN2N GMemN2N QRN(Ours)
AvgError(%)
AvgError(%)
DSTC2*DialogResults
0
10
20
30
40
50
60
70
MemN2N GMemN2N QRN(Ours)
AvgError(%)
AvgError(%)
bAbI QAVisualization
𝑧- = Localattention(updategate)atlayerl
DSTC2(Dialog)Visualization
𝑧- = Localattention(updategate)atlayerl
Conclusion
• PresentedtwonovelapproachesforQAtasksusingneuralattention
• BidirectionalAttentionFlow:usingattentionasalayer,onbothdirections(contexttoquery,querytocontext)• Query-reductionNetworks:asequentialmodelthattakesadvantageofbothattentionandRNNforreasoningovermultiplesentences
Thanks!
Whydoweneedattention?
• RNNhaslong-termdependencyproblem• Vanishinggradients(Pascanu etal.,2013)• Inherentlyunstableoveralongperiodoftime(Westonetal.,2016)
• Attentionprovidesshortcutaccesstorelevantinformation• Directlyretrievesthecontextvectorfromadistantlocation
• Criticaltomostmodernsequencemodels• Machinetranslation• Questionanswering,machinecomprehension
NeuralAttentioninSequenceModeling
(Bahdanau etal.,2015)
• ApplyRNNoncontextvectors• ApplyRNNonqueryvectors• Ateachtimestep,useneuralattentiontosoft-selectasinglecontextvector• Usetheselectedcontextvector,alongwithcurrentqueryvectorandcurrenthiddenstate,toobtainthenexthiddenstate