End-to-end Dialog and Question Answering Systems · candidate answers in eq. (1) looks similar to the addressing of memory. Our key idea is thus to perform another “hop” of attention

End-to-endDialogandQuestionAnsweringSystems

Presenter:BishanYang

Outline

• End-to-endMemoryNetworks• Dialog-basedLanguageLearning• LearningthroughDialogueInteractionsbyAskingQuestions

End-to-endMemoryNetworks

Sainbayar Sukhbaatar,ArthurSzlam,JasonWeston,RobFergus.NIPS2015

Alargepartoftheslidesborrowedfromhttp://www.thespermwhale.com/jaseweston/icml2016/

Keyideas

• Long-termmemoryisrequiredtoread astoryandthene.g.answerquestionsaboutit,respondtoadialog

• Readingastoryrequiresretrievingrelevantinformationfromthememoryandreasoningovertheretrievedinformation• mayneedmultiplelookups

MemNNs forQuestionAnswering

Sammovedtogarden.Maryleftthegarden.Samwenttokitchen.Marymovedtothehallway.Samdropsapplethere.…

WhereisSam?

MemNNs forQuestionAnswering

Sammovedtogarden.Maryleftthegarden.Samwenttokitchen.Marymovedtothehallway.Samdropsapplethere.…

WhereisSam? Kitchen

Memory Module

Controller module

Input

ArchitectureOutput

supervision

Memory vectors(unordered)

Internal statevector

Question

Where is Sam?

Input story

Memory ModuleC

ontroller

kitchenAnswer

Dot product + softmax

Weighted Sum

QuestionAnswering

2: Sam went to kitchen

1: Sam movedto garden

3: Sam dropsapple there

MemoryVectorsE.g.)constructingmemoryvectorswithBag-of-Words(BoW)1. Embedeachword2. Sumembeddingvectors

E.g.) temporalstructure: specialwordsfortimeandincludetheminBoW

Memory VectorEmbedding Vectors

Time embedding

\text{1:``Samdropsapple''}\rightarrow v_\text{{\color{Red}Sam}}+v_\text{{\color{Red}drops}}+v_\text{{\color{Red}apple}}+v_

Task(1)FactoidQAwithSingleSupportingFact(“whereisactor”)

Johnisintheplayground.Bobisintheoffice.WhereisJohn?A:playground

SUPPORTINGFACT

(2)FactoidQAwithTwoSupportingFacts(“whereisactor+object”)

Johnisintheplayground.Bobisintheoffice.Johnpickedupthefootball.Bobwenttothekitchen.Whereisthefootball?A:playground

SUPPORTINGFACT

SUPPORTINGFACT

(4)TwoArgumentRelations:Subjectvs.Object

Theofficeisnorthofthebedroom.Thebedroomisnorthofthebathroom.Whatisnorthofthebedroom?A:officeWhatisthebedroomnorthof?A:bathroom

(6)Yes/NoQuestions(withasinglesupportingfact)

Johnisintheplayground.Danielpicksupthemilk.IsJohnintheclassroom?A:noDoesDanielhavethemilk?A:yes

(8)Lists/Sets

Danielpicksupthefootball.Danieldropsthenewspaper.Danielpicksupthemilk.WhatisDanielholding?A:milk,football

Danielpickedupthefootball.Danieldroppedthefootball.Danielgotthemilk.Danieltooktheapple.HowmanyobjectsisDanielholding?A:two

(7)Counting

(14)Timemanipulation

IntheafternoonJuliewenttothepark.YesterdayJuliewasatschool.Juliewenttothecinemathisevening.WheredidJuliegoafterthepark?A:cinema

(15)BasicDeduction

Sheepareafraidofwolves.Catsareafraidofdogs.Miceareafraidofcats.Gertrudeisasheep.WhatisGertrudeafraidof?A:wolves

(17)PositionalReasoning

Thetriangleistotherightofthebluesquare.Theredsquareisontopofthebluesquare.Theredsphereistotherightofthebluesquare.Istheredspheretotherightofthebluesquare?A:yesIstheredsquaretotheleftofthetriangle?A:yes

(18)Reasoningaboutsize

Thefootballfitsinthesuitcase.Thesuitcasefitsinthecupboard.Theboxofchocolatesissmallerthanthefootball.Willtheboxofchocolatesfitinthesuitcase?A:yes

Tasks3(threesupportingfacts)and6(Yes/No)areprerequisites.

(19)PathFinding

Thekitchenisnorthofthehallway.Thedeniseastofthehallway.Howdoyougofromdentokitchen?A:west,north

TASK N-grams LSTMs MemN2N MemoryNetworks

StructSVM+coref+srl

T1.Single supportingfact 36 50 PASS PASS PASS

T2.Twosupporting facts 2 20 87 PASS 74

T3.Three supportingfacts 7 20 60 PASS 17

T4.Two argumentsrelations 50 61 PASS PASS PASS

T5.Threearguments relations 20 70 87 PASS 83

T6.Yes/no questions 49 48 92 PASS PASS

T7.Counting 52 49 83 85 69

T8.Sets 40 45 90 91 70

T9.Simple negation 62 64 87 PASS PASS

T10.Indefinite knowledge 45 44 85 PASS PASS

T11.Basiccoreference 29 72 PASS PASS PASS

T12.Conjunction 9 74 PASS PASS PASS

T13.Compound coreference 26 PASS PASS PASS PASS

T14.Timereasoning 19 27 PASS PASS PASS

T15.Basic deduction 20 21 PASS PASS PASS

T16.Basic induction 43 23 PASS PASS 24

T17.Positionalreasoning 46 51 49 65 61

T18.Size reasoning 52 52 89 PASS 62

T19.Pathfinding 0 8 7 36 49

T20.Agent’smotivation 76 91 PASS PASS PASS

Weakly supervisedTraining on 1k stories Supervised Supp. Facts

Attentionduringmemorylookups

Dialog-basedLanguageLearningJasonWeston.NIPS2016

Alargepartoftheslidesborrowedfromhttps://robonlp2017.github.io/slides/jason_weston.pptx

WhyDialog?

• Learningfrominteractionwithhumans• learnfromlanguagefeedback• learnfromaskingquestions(whichquestions?whentoask?)

Keyideas

• Achatbot canimprovebyunderstandingateacher’sfeedback/response• Learningtopredictthefeedback(forwardmodeling)

• Achatbot canimprovebyaskingquestions

TraditionalLabeledSupervision

Marywenttothehallway.Johnmovedtothebathroom.Marytravelledtothekitchen.WhereisMary?A:playground (-)WhereisJohn? A:bathroom (+)

LearningFromHumanResponses

Marywenttothehallway.Johnmovedtothebathroom.Marytravelledtothekitchen.WhereisMary?A:playgroundNo,that'sincorrect.WhereisJohn? A:bathroomYes,that'sright!

Ifyoucanpredictthis,youaremostofthewaytoknowinghowtoanswercorrectly.

HumanResponsesGiveLotsofInfo

Marywenttothehallway.Johnmovedtothebathroom.Marytravelledtothekitchen.WhereisMary?A:playgroundNo,theansweriskitchen.WhereisJohn? A:bathroomYes,that'sright!

Muchmoresignalthanjust“No”orzeroreward.

HumanResponsesGiveLotsofInfo

Marywenttothehallway.Johnmovedtothebathroom.Marytravelledtothekitchen.WhereisMary?A:playgroundNo,theansweriskitchen,becauseMarywenttothekitchen.WhereisJohn? A:bathroomYes,that'sright!

Memory Module

Controller module

Input

Output Predict Response to Answer

addressing

read

addressing

read

Internal state Vector (initially: query)

addressing

Candidate(Answers(

read

Memory vectors

m

m

q

q

Answer (action taken)

ForwardPredictionMemoryNetwork

“Unsupervised”ForwardModel:doesnotrequirelabeledsupervision

newstateofworld

memories relevant to the last utterance x, i.e. the most relevant have large values of p1i . The outputmemory representation o1 is then constructed using the weighted sum of memories, i.e. weightedby p

1. The memory output is then added to the original input, u1 = R1(o1 + q), to form the newstate of the controller, where R1 is a d⇥ d rotation matrix2. The attention over the memory can thenbe repeated using u1 instead of q as the addressing vector, yielding:

o2 =X

i

p

2imi, p

2i = Softmax(u>

1 mi),

The controller state is updated again with u2 = R2(o2 + u1), where R2 is another d ⇥ d matrix tobe learnt. In a two-hop model the final output is then defined as:

a = Softmax(u>2 Ay1, . . . , u

>2 AyC) (1)

where there are C candidate answers in y. In our experiments C is the set of actions that occur inthe training set for the bAbI tasks, and for MovieQA it is the set of words retrieved from the KB.

Having described the basic architecture, we now detail the possible training strategies we can employfor our tasks.

Imitation Learning This approach involves simply imitating one of the speakers in observed di-alogs, which is essentially a supervised learning objective3. This is the setting that most existing di-alog learning, as well as question answer systems, employ for learning. Examples arrive as (x, c, a)triples, where a is (assumed to be) a good response to the last utterance x given context c. In ourcase, the whole memory network model defined above is trained using stochastic gradient descentby minimizing a standard cross-entropy loss between a and the label a.

Reward-based Imitation If some actions are poor choices, then one does not want to repeatthem, that is we shouldn’t treat them as a supervised objective. In our setting positive reward isonly obtained immediately after (some of) the correct actions, or else is zero. A simple strategy isthus to only apply imitation learning on the rewarded actions. The rest of the actions are simplydiscarded from the training set. This strategy is derived naturally as the degenerate case one obtainsby applying policy gradient [31] in our setting where the policy is fixed (see end of Sec. 3). In morecomplex settings (i.e. where actions that are made lead to long-term changes in the environment anddelayed rewards) applying reinforcement learning algorithms would be necessary, e.g. one couldstill use policy gradient to train the MemN2N but applied to the model’s own policy, as used in [25].

Forward Prediction An alternative method of training is to perform forward prediction: the aimis, given an utterance x from speaker 1 and an answer a by speaker 2 (i.e., the learner), to predictx, the response to the answer from speaker 1. That is, in general to predict the changed state of theworld after action a, which in this case involves the new utterance x.

To learn from such data we propose the following modification to memory networks, also shownin Fig. 3 (b): essentially we chop off the final output from the original network of Fig. 3 (a) andreplace it with some additional layers that compute the forward prediction. The first part of thenetwork remains exactly the same and only has access to input x and context c, just as before. Thecomputation up to u2 = R2(o2 + u1) is thus exactly the same as before.

At this point we observe that the computation of the output in the original network, by scoringcandidate answers in eq. (1) looks similar to the addressing of memory. Our key idea is thusto perform another “hop” of attention but over the candidate answers rather than the memories.Crucially, we also incorporate the information of which action (candidate) was actually selected inthe dialog (i.e. which one is a). After this “hop”, the resulting state of the controller is then used todo the forward prediction.

Concretely, we compute:

o3 =X

i

p

3i (Ayi + �

⇤[a = yi]), p

3i = Softmax(u>

2 Ayi), (2)

2Optionally, different dictionaries can be used for inputs, memories and outputs instead of being shared.3Imitation learning algorithms are not always strictly supervised algorithms, they can also depend on the

agent’s actions. That is not the setting we use here, where the task is to imitate one of the speakers in a dialog.

6

Table 1: Test accuracy (%) on the Single Supporting Fact bAbI dataset for various supervisionapproachess (training with 1000 examples on each) and different policies ⇡acc. A task is successfullypassed if � 95% accuracy is obtained (shown in blue).

MemN2N MemN2N MemN2Nimitation reward-based forward MemN2Nlearning imitation (RBI) prediction (FP) RBI + FP

Supervision Type ⇡acc = 0.5 0.1 0.01 0.5 0.1 0.01 0.5 0.1 0.01 0.5 0.1 0.011 - Imitating an Expert Student 100 100 100 100 100 100 23 30 29 99 99 1002 - Positive and Negative Feedback 79 28 21 99 92 91 93 54 30 99 92 963 - Answers Supplied by Teacher 83 37 25 99 96 92 99 96 99 99 100 984 - Hints Supplied by Teacher 85 23 22 99 91 90 97 99 66 99 100 1005 - Supporting Facts Supplied by Teacher 84 24 27 100 96 83 98 99 100 100 99 1006 - Partial Feedback 90 22 22 98 81 59 100 100 99 99 100 997 - No Feedback 90 34 19 20 22 29 100 98 99 98 99 998 - Imitation + Feedback Mixture 90 89 82 99 98 98 28 64 67 99 98 979 - Asking For Corrections 85 30 22 99 89 83 23 15 21 95 90 8410 - Asking For Supporting Facts 86 25 26 99 96 84 23 30 48 97 95 91Number of completed tasks (� 95%) 1 1 1 9 5 2 5 5 4 10 8 8

where �⇤ is a d-dimensional vector, that is also learnt, that represents in the output o3 the action thatwas actually selected. After obtaining o3, the forward prediction is then computed as:

x = Softmax(u>3 Ax1, . . . , u

>3 AxC)

where u3 = R3(o3 + u2). That is, it computes the scores of the possible responses to the answer aover C possible candidates. The mechanism in eq. (2) gives the model a way to compare the mostlikely answers to x with the given answer a, which in terms of supervision we believe is critical. Forexample in question answering if the given answer a is incorrect and the model can assign high pi tothe correct answer then the output o3 will contain a small amount of �⇤; conversely, o3 has a largeamount of �⇤ if a is correct. Thus, o3 informs the model of the likely response x from the teacher.

Training can then be performed using the cross-entropy loss between x and the label x, similar tobefore. In the event of a large number of candidates C we subsample the negatives, always keepingx in the set. The set of answers y can also be similarly sampled, making the method highly scalable.

A major benefit of this particular architectural design for forward prediction is that after trainingwith the forward prediction criterion, at test time one can “chop off” the top again of the model toretrieve the original memory network model of Fig. 3 (a). One can thus use it to predict answers agiven only x and c. We can thus evaluate its performance directly for that goal as well.

Finally, and importantly, if the answer to the response x carries pertinent supervision informationfor choosing a, as for example in many of the settings of Sec. 3 (and Fig. 1), then this will bebackpropagated through the model. This is simply not the case in the imitation, reward-shaping [24]or reward-based imitation learning strategies which concentrate on the x, a pairs.

Reward-based Imitation + Forward Prediction As our reward-based imitation learning uses thearchitecture of Fig. 3 (a), and forward prediction uses the same architecture but with the additionallayers of Fig 3 (b), we can learn jointly with both strategies. One simply shares the weights acrossthe two networks, and performs gradient steps for both criteria, one of each type per action. Theformer makes use of the reward signal – which when available is a very useful signal – but fails touse potential supervision feedback in the subsequent utterances, as described above. It also effec-tively ignores dialogs carrying no reward. Forward prediction in contrast makes use of dialog-basedfeedback and can train without any reward. On the other hand not using rewards when available is aserious handicap. Hence, the mixture of both strategies is a potentially powerful combination.

5 Experiments

We conducted experiments on the datasets described in Section 3. As described before, for eachtask we consider a fixed policy for performing actions (answering questions) which gets questionscorrect with probability ⇡acc. We can thus compare the different training strategies described in Sec.4 over each task for different values of ⇡acc. Hyperparameters for all methods are optimized on thevalidation sets. A summary of the results is reported in Table 1 for the bAbI dataset and Table 2 forMovieQA. We observed the following results:

7

Table 1: Test accuracy (%) on the Single Supporting Fact bAbI dataset for various supervisionapproachess (training with 1000 examples on each) and different policies ⇡acc. A task is successfullypassed if � 95% accuracy is obtained (shown in blue).


Supervision Type ⇡acc = 0.5 0.1 0.01 0.5 0.1 0.01 0.5 0.1 0.01 0.5 0.1 0.011 - Imitating an Expert Student 100 100 100 100 100 100 23 30 29 99 99 1002 - Positive and Negative Feedback 79 28 21 99 92 91 93 54 30 99 92 963 - Answers Supplied by Teacher 83 37 25 99 96 92 99 96 99 99 100 984 - Hints Supplied by Teacher 85 23 22 99 91 90 97 99 66 99 100 1005 - Supporting Facts Supplied by Teacher 84 24 27 100 96 83 98 99 100 100 99 1006 - Partial Feedback 90 22 22 98 81 59 100 100 99 99 100 997 - No Feedback 90 34 19 20 22 29 100 98 99 98 99 998 - Imitation + Feedback Mixture 90 89 82 99 98 98 28 64 67 99 98 979 - Asking For Corrections 85 30 22 99 89 83 23 15 21 95 90 8410 - Asking For Supporting Facts 86 25 26 99 96 84 23 30 48 97 95 91Number of completed tasks (� 95%) 1 1 1 9 5 2 5 5 4 10 8 8

where �⇤ is a d-dimensional vector, that is also learnt, that represents in the output o3 the action thatwas actually selected. After obtaining o3, the forward prediction is then computed as:

x = Softmax(u>3 Ax1, . . . , u

>3 AxC)

where u3 = R3(o3 + u2). That is, it computes the scores of the possible responses to the answer aover C possible candidates. The mechanism in eq. (2) gives the model a way to compare the mostlikely answers to x with the given answer a, which in terms of supervision we believe is critical. Forexample in question answering if the given answer a is incorrect and the model can assign high pi tothe correct answer then the output o3 will contain a small amount of �⇤; conversely, o3 has a largeamount of �⇤ if a is correct. Thus, o3 informs the model of the likely response x from the teacher.

Training can then be performed using the cross-entropy loss between x and the label x, similar tobefore. In the event of a large number of candidates C we subsample the negatives, always keepingx in the set. The set of answers y can also be similarly sampled, making the method highly scalable.

A major benefit of this particular architectural design for forward prediction is that after trainingwith the forward prediction criterion, at test time one can “chop off” the top again of the model toretrieve the original memory network model of Fig. 3 (a). One can thus use it to predict answers agiven only x and c. We can thus evaluate its performance directly for that goal as well.

Finally, and importantly, if the answer to the response x carries pertinent supervision informationfor choosing a, as for example in many of the settings of Sec. 3 (and Fig. 1), then this will bebackpropagated through the model. This is simply not the case in the imitation, reward-shaping [24]or reward-based imitation learning strategies which concentrate on the x, a pairs.

Reward-based Imitation + Forward Prediction As our reward-based imitation learning uses thearchitecture of Fig. 3 (a), and forward prediction uses the same architecture but with the additionallayers of Fig 3 (b), we can learn jointly with both strategies. One simply shares the weights acrossthe two networks, and performs gradient steps for both criteria, one of each type per action. Theformer makes use of the reward signal – which when available is a very useful signal – but fails touse potential supervision feedback in the subsequent utterances, as described above. It also effec-tively ignores dialogs carrying no reward. Forward prediction in contrast makes use of dialog-basedfeedback and can train without any reward. On the other hand not using rewards when available is aserious handicap. Hence, the mixture of both strategies is a potentially powerful combination.

5 Experiments

We conducted experiments on the datasets described in Section 3. As described before, for eachtask we consider a fixed policy for performing actions (answering questions) which gets questionscorrect with probability ⇡acc. We can thus compare the different training strategies described in Sec.4 over each task for different values of ⇡acc. Hyperparameters for all methods are optimized on thevalidation sets. A summary of the results is reported in Table 1 for the bAbI dataset and Table 2 forMovieQA. We observed the following results:

7

Figure 1: Sample dialogs with differing supervision signals (tasks 1 to 10). In each case the sameexample is given for simplicity. Black text is spoken by the teacher, red text denotes responses bythe learner, blue text is provided by an expert student (which the learner can imitate), (+) denotespositive reward external to the dialog (e.g. feedback provided by another medium, such as a nod ofthe head from the teacher).

Task 1: Imitating an Expert Student Task 2: Positive and Negative FeedbackMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:playgroundWhere is John? A:bathroom No, that’s incorrect.

Where is John? A:bathroomYes, that’s right! (+)

Task 3: Answers Supplied by Teacher Task 4: Hints Supplied by TeacherMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:bedroom Where is Mary? A:bathroomNo, the answer is kitchen. No, they are downstairs.Where is John? A:bathroom Where is John? A:kitchenCorrect! (+) No, they are upstairs.

Task 5: Supporting Facts Supplied by Teacher Task 6: Partial FeedbackMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:kitchenYes, that’s right! (+) Yes, that’s right!Where is John? A:hallway Where is John? A:bathroomNo, because John moved to the bathroom. Yes, that’s correct! (+)

Task 7: No Feedback Task 8: Imitation and Feedback MixtureMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:kitchenYes, that’s right! Where is John? A:bathroomWhere is John? A:bathroom That’s right! (+)Yes, that’s correct!

Task 9: Asking For Corrections Task 10: Asking For Supporting FactsMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:kitchenYes, that’s right! (+) Yes, that’s right! (+)Where is John? A:hallway Where is John? A:hallwayNo, that’s not right. A:Can you help me? No, that’s not right. A:Can you help me?Bathroom. A relevant fact is John moved to the bathroom.

have positive supervision. This could clearly be a problem when the learner is unskilled: it willsupply incorrect answers and never (or hardly ever) receive positive responses.

Answers Supplied by Teacher In Task 3 the teacher gives positive and negative feedback as inTask 2, however when the learner’s answer is incorrect, the teacher also responds with the correction.For example if “where is Mary?” is answered with the incorrect answer “bedroom” the teacherresponds “No, the answer is kitchen”’, see Fig. 1 Task 3. If the learner knows how to use this extrainformation, it effectively has as much supervision signal as with Task 1, and much more than forTask 2.

Hints Supplied by Teacher In Task 4, the corrections provided by the teacher do not providethe exact answer as in Task 3, but only a useful hint. This setting is meant to mimic the real life

3

Figure 1: Sample dialogs with differing supervision signals (tasks 1 to 10). In each case the sameexample is given for simplicity. Black text is spoken by the teacher, red text denotes responses bythe learner, blue text is provided by an expert student (which the learner can imitate), (+) denotespositive reward external to the dialog (e.g. feedback provided by another medium, such as a nod ofthe head from the teacher).

Task 1: Imitating an Expert Student Task 2: Positive and Negative FeedbackMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:playgroundWhere is John? A:bathroom No, that’s incorrect.

Where is John? A:bathroomYes, that’s right! (+)

Task 3: Answers Supplied by Teacher Task 4: Hints Supplied by TeacherMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:bedroom Where is Mary? A:bathroomNo, the answer is kitchen. No, they are downstairs.Where is John? A:bathroom Where is John? A:kitchenCorrect! (+) No, they are upstairs.

Task 5: Supporting Facts Supplied by Teacher Task 6: Partial FeedbackMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:kitchenYes, that’s right! (+) Yes, that’s right!Where is John? A:hallway Where is John? A:bathroomNo, because John moved to the bathroom. Yes, that’s correct! (+)

Task 7: No Feedback Task 8: Imitation and Feedback MixtureMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:kitchenYes, that’s right! Where is John? A:bathroomWhere is John? A:bathroom That’s right! (+)Yes, that’s correct!

Task 9: Asking For Corrections Task 10: Asking For Supporting FactsMary went to the hallway. Mary went to the hallway.John moved to the bathroom. John moved to the bathroom.Mary travelled to the kitchen. Mary travelled to the kitchen.Where is Mary? A:kitchen Where is Mary? A:kitchenYes, that’s right! (+) Yes, that’s right! (+)Where is John? A:hallway Where is John? A:hallwayNo, that’s not right. A:Can you help me? No, that’s not right. A:Can you help me?Bathroom. A relevant fact is John moved to the bathroom.

have positive supervision. This could clearly be a problem when the learner is unskilled: it willsupply incorrect answers and never (or hardly ever) receive positive responses.

Answers Supplied by Teacher In Task 3 the teacher gives positive and negative feedback as inTask 2, however when the learner’s answer is incorrect, the teacher also responds with the correction.For example if “where is Mary?” is answered with the incorrect answer “bedroom” the teacherresponds “No, the answer is kitchen”’, see Fig. 1 Task 3. If the learner knows how to use this extrainformation, it effectively has as much supervision signal as with Task 1, and much more than forTask 2.

Hints Supplied by Teacher In Task 4, the corrections provided by the teacher do not providethe exact answer as in Task 3, but only a useful hint. This setting is meant to mimic the real life

3

Results

Table 2: Test accuracy (%) on the MovieQA dataset dataset for various supervision approaches.Numbers in bold are the winners for that task and choice of ⇡acc.


Supervision Type ⇡acc = 0.5 0.1 0.01 0.5 0.1 0.01 0.5 0.1 0.01 0.5 0.1 0.011 - Imitating an Expert Student 80 80 80 80 80 80 24 23 24 77 77 772 - Positive and Negative Feedback 46 29 27 52 32 26 48 34 24 68 53 343 - Answers Supplied by Teacher 48 29 26 52 32 27 60 57 58 69 65 624 - Hints Supplied by Teacher 47 29 26 51 32 28 58 58 42 70 54 325 - Supporting Facts Supplied by Teacher 47 28 26 51 32 26 43 44 33 66 53 406 - Partial Feedback 48 29 27 49 32 24 60 58 58 70 63 627 - No Feedback 51 29 27 22 21 21 60 53 58 61 56 508 - Imitation + Feedback Mixture 60 50 47 63 53 51 46 31 23 72 69 699 - Asking For Corrections 48 29 27 52 34 26 67 52 44 68 52 3910 - Asking For Supporting Facts 49 29 27 52 34 27 51 44 35 69 53 36Mean Accuracy 52 36 34 52 38 34 52 45 40 69 60 50

• Imitation learning, ignoring rewards, is a poor learning strategy when imitating inaccurateanswers, e.g. for ⇡acc < 0.5. For imitating an expert however (Task 1) it is hard to beat.

• Reward-based imitation (RBI) performs better when rewards are available, particularly inTable 1, but also degrades when they are too sparse e.g. for ⇡acc = 0.01.

• Forward prediction (FP) is more robust and has stable performance at different levels of⇡acc. However as it only predicts answers implicitly and does not make use of rewardsit is outperformed by RBI on several tasks, notably Tasks 1 and 8 (because it cannot dosupervised learning) and Task 2 (because it does not take advantage of positive rewards).

• FP makes use of dialog feedback in Tasks 3-5 whereas RBI does not. This explains why FPdoes better with useful feedback (Tasks 3-5) than without (Task 2), whereas RBI cannot.

• Supplying full answers (Task 3) is more useful than hints (Task 4) but hints still help FPmore than just yes/no answers without extra information (Task 2).

• When positive feedback is sometimes missing (Task 6) RBI suffers especially in Table 1.FP does not as it does not use this feedback.

• One of the most surprising results of our experiments is that FP performs well overall,given that it does not use feedback, which we will attempt to explain subsequently. This isparticularly evident on Task 7 (no feedback) where RBI has no hope of succeeding as it hasno positive examples. FP on the other hand learns adequately.

• Tasks 9 and 10 are harder for FP as the question is not immediately before the feedback.

• Combining RBI and FP ameliorates the failings of each, yielding the best overall results.

One of the most interesting aspects of our results is that FP works at all without any rewards. InTask 2 it does not even “know” the difference between words like “yes” or “’correct” vs. wordslike “wrong” or “incorrect”, so why should it tend to predict actions that lead to a response like“yes, that’s right”? This is because there is a natural coherence to predicting true answers thatleads to greater accuracy in forward prediction. That is, you cannot predict a “right” or “wrong”response from the teacher if you don’t know what the right answer is. In our experiments our policies⇡acc sample negative answers equally, which may make learning simpler. We thus conducted anexperiment on Task 2 (positive and negative feedback) of the bAbI dataset with a much more biasedpolicy: it is the same as ⇡acc = 0.5 except when the policy predicts incorrectly there is probability0.5 of choosing a random guess as before, and 0.5 of choosing the fixed answer bathroom. In thiscase the FP method obtains 68% accuracy showing the method still works in this regime, althoughnot as well as before.

6 Conclusion

We have presented a set of evaluation datasets and models for dialog-based language learning. Theultimate goal of this line of research is to move towards a learner capable of talking to humans, suchthat humans are able to effectively teach it during dialog. We believe the dialog-based languagelearning approach we described is a small step towards that goal.

8

LearningthroughDialogueInteractionsbyAskingQuestions

Lietal., ICLR‘17.

Alargepartoftheslidesborrowedfromhttps://nlp.berkeley.edu/files/2016/11/jiwei_li_berkeley.pptx

QAaboutMovies

BladeRunner,directed_by,RidleyScottBladeRunner,release_year,1982BladeRunner,written_by,PhilipK.Dick

WhatyearwasthemovieBladeRunnerreleased? 1982

Whendoes a bot need to ask questions ?

Teacherquestion:“WhatisTomHanksin?”

1.TextClarification- queryhowtointerprettextofteachere.g.Whatdoyoumean?/DoyoumeanwhichmoviedidTomHanksstarin?

2.KnowledgeOperation- queryhowtoperformreasoningstepsnecessarytoanswere.g.DoesithavesomethingtodowithTomHanksstarringinForrestGump?

3.KnowledgeAcquisition: querytogainmissingknowledgenecessarytoanswere.g.Idon’tknow,whatdidhestarin?

Task 1

Case 1: Question Clarification

Task 2

QuestionParaphrasing

Task3(askforrelevantknowledge) Task4(verification)

Case 2: KnowledgeOperation.

Case 3: Knowledge Acquisition .

…Otherquestions/Otheranswers

Not in the KB

Case 3: Knowledge Acquisition .

1.Off-linesupervisedsettings

Training

…Otherquestions/OtheranswersInput

Output

…Otherquestions/OtheranswersInput

Output

MemN2N+ForwardPrediction

TrainingSettings:1.NeverAskingQuestions(TrainQA)2.AlwaysAskingQuestion(TrainAQ)

TestSettings:1.NeverAskingQuestions(TestQA)2.AlwaysAskingQuestion(TestAQ)

?????

?????MakePredictions

Results

1. Asking questions always helps at test time.

Results

1. Asking questions always helps at test time.2. Only asking questions at training time does not help

Results

1. Asking questions always helps at test time.2. Only asking questions at training time does not help3. TrainAQ+TrainAQ performs the best

Setting2: Reinforcement Learning

Shall I ask a question ???


Ask a question or not …..

If Yes

- CostAQ

+1

Reward:



If Yes

- CostAQ

Reward:

-1



If No

+1

If Yes

- CostAQ

+1



If Yes

-1

If No

+1



If Yes If No

-1

+1



If Yes

-1

+1

Published as a conference paper at ICLR 2017

(the knowledge base facts that the bot has access to), and outputs a label. We refer readers to theAppendix for more details about MemN2N.

Offline Supervised Settings: The first learning strategy we adopt is the reward-based imitationstrategy (denoted vanilla-MemN2N) described in (Weston, 2016), where at training time, the modelmaximizes the log likelihood probability of the correct answers the student gave (examples withincorrect final answers are discarded). Candidate answers are words that appear in the memories,which means the bot can only predict the entities that it has seen or known before.

We also use a variation of MemN2N called “context MemN2N” (Cont-MemN2N for short) where wereplace each word’s embedding with the average of its embedding (random for unseen words) andthe embeddings of the other words that appear around it. We use both the preceeding and followingwords as context and the number of context words is a hyperparameter selected on the dev set.

An issue with both vanilla-MemN2N and Cont-MemN2N is that the model only makes use of thebot’s answers as signals and ignores the teacher’s feedback. We thus propose to use a model thatjointly predicts the bot’s answers and the teacher’s feedback (denoted as TrainQA (+FP)). The bot’sanswers are predicted using a vanilla-MemN2N and the teacher’s feedback is predicted using theForward Prediction (FP) model as described in (Weston, 2016). We refer the readers to the Appendixfor the FP model details. At training time, the models learn to jointly predict the teacher’s feedbackand the answers with positive reward. At test time, the model will only predict the bot’s answer.

For the TestModelAQ setting described in Section 4, the model needs to decide the question toask. Again, we use vanilla-MemN2N that takes as input the question and contexts, and outputs thequestion the bot will ask.

Online RL Settings: A binary vanilla-MemN2N (denoted as P

RL

(Question)) is used to decidewhether the bot should or should not ask a question, with the teacher replying if the bot does asksomething. A second MemN2N is then used to decide the bot’s answer, denoted as P

RL

(Answer).P

RL

(Answer) for QA and AQ are two separate models, which means the bot will use differentmodels for final-answer prediction depending on whether it chooses to ask a question or not.7

We use the REINFORCE algorithm (Williams, 1992) to update P

RL

(Question) andP

RL

(Answer). For each dialogue, the bot takes two sequential actions (a1, a2): to ask or notto ask a question (denoted as a1); and guessing the final answer (denoted as a2). Let r(a1, a2)denote the cumulative reward for the dialogue episode, computed using Table 1. The gradient toupdate the policy is given by:

p(a1, a2) = P

RL

(Question)(a1) · PRL

(answer)(a2)

rJ(✓) ⇡ r log p(a1, a2)[r(a1, a2)� b]

(1)

where b is the baseline value, which is estimated using another MemN2N model that takes as inputthe query x and memory C, and outputs a scalar b denoting the estimation of the future reward.The baseline model is trained by minimizing the mean squared loss between the estimated reward b

and actual cumulative reward r, ||r � b||2. We refer the readers to (Ranzato et al., 2015; Zaremba& Sutskever, 2015) for more details. The baseline estimator model is independent from the policymodels and the error is not backpropagated back to them.

In practice, we find the following training strategy yields better results: first train onlyP

RL

(answer), updating gradients only for the policy that predicts the final answer. After the bot’sfinal-answer policy is sufficiently learned, train both policies in parallel8. This has a real-world anal-ogy where the bot first learns the basics of the task, and then learns to improve its performance viaa question-asking policy tailored to the user’s patience (represented by cost

AQ

) and its own abilityto asnwer questions.

7An alternative is to train one single model for final answer prediction in both AQ and QA cases, similar tothe TrainMix setting in the supervised learning setting. But we find training AQ and QA separately for the finalanswer prediction yields a little better result than the single model setting.

8 We implement this by running 16 epochs in total, updating only the model’s policy for final answers inthe first 8 epochs while updating both policies during the second 8 epochs. We pick the model that achieves thebest reward on the dev set during the final 8 epochs. Due to relatively large variance for RL models, we repeateach task 5 times and keep the best model on each task.

9

MemN2N MemN2Nquestionaskinganswerprediction

If No

Summary of results:

Asking questions helps improve performance.

Questionaskinggivesbetterbotsatquestionansweringthannotasking!

Increasedquestionaskingcostdecreasesthechanceofthestudentasking.

Astheprobabilityofquestion-askingdeclines,theaccuracyforpoorandmediumstudentsdrops.

AstheAQcostincreasesgradually,goodstudentswillstopaskingquestionsearlierthanthemediumandpoorstudents.

Documents

End-to-end Dialog and Question Answering Systems · candidate answers in eq. (1) looks similar to the addressing of memory. Our key idea is thus to perform another “hop” of attention