1
Multi-Level Memory for Task Oriented Dialogs Revanth Reddy 1 , Danish Contractor 2 , Dinesh Raghu 2 , Sachindra Joshi 2 1 Indian Institute of Technology, Madras 2 IBM Research AI, New Delhi Motivation I End-to-end task-oriented dialog systems use neural memory architectures to incorporate external knowledge. I Current models break down the external KB results into the form of Subject-Relation-Object triples. I This makes it hard for the memory reader to infer relationships across otherwise connected attributes. I Existing models like Mem2Seq use a shared memory for copying entities from dialog context, as well as the KB results, thereby making inference harder. Figure: Sample dialog and its corresponding external KB. t Figure: Triple store in current models. I We separate the memory used to store tokens from the input context and the results from the knowledge base. I We propose a novel multi-level memory architecture which encodes the natural hierarchy exhibited in the KB results. I We store the queries, their results and corresponding attributes in different levels. Figure: Multi-Level memory in our model. Model Architecture Figure: Model architecture with multi-level memory Memory Representation Every query q i is a set of key-value pairs {k q i a : v q i a , 1 < a < n q i }. q i is represented by q v i = Bag of words over the word embeddings of values (v q i a ) in q i . Each result r ij is also a set of slot-value pairs {k r ij a : v r ij a , 1 < a < n r ij }. r ij is represented by r v ij = Bag of words over the word embeddings of values (v r ij a ) in r ij . Equations Query Attention: The first level attention which is applied over the query representations. α i = exp (w T 2 tanh(W 4 [d t , h t , q v i ])) i exp (w T 2 tanh(W 4 [d t , h t , q v i ])) Result Attention: The second level attention which is applied over the result representations. β ij = exp (w T 3 tanh(W 5 [d t , h t , r v ij ])) j exp (w T 3 tanh(W 5 [d t , h t , r v ij ])) Result cell Attention: The third level attention which is applied over the keys in each result. γ ijl = exp (w T 4 tanh(W 6 [d t , h t emb (k r ij l )])) l exp (w T 4 tanh(W 6 [d t , h t emb (k r ij l )])) KB copy distribution: The product of attention over the three levels gives the final attention score of the values in each result. P kb (y t = w )= X ijl :v r ij l =w α i β ij γ ijl Context copy distribution: Obtained from the at- tention scores over the input dialog context. P con (y t = w )= X ij :w ij =w a ij Copy Distribution: The copy distribution over the memory is gated sum of the copy distributions over KB and context. P c (y t )= g 2 P kb (y t )+(1 - g 2 )P con (y t ) Output Distribution: The final output distribution is gated sum of the generate distribution and the copy distribution over memory. P (y t )= g 1 P g (y t )+(1 - g 1 )P c (y t ) Experiments Our experiments on three publicly available datasets show a substantial improvement of 15-25% in both entity F1 and BLEU scores as compared to existing state-of-the-art approaches. InCar CamRest Frames Model BLEU F1 Calendar F1 Weather F1 Navigate F1 BLEU F1 BLEU F1 Attn seq2seq 11.3 28.2 36.9 35.7 10.1 7.7 25.3 3.7 16.2 Ptr-UNK 5.4 20.4 22.1 24.6 14.6 5.1 40.3 5.6 25.8 KVRet 13.2 48.0 62.9 47.0 41.3 13.0 36.5 10.7 31.7 Mem2Seq 11.8 40.9 61.6 39.6 21.7 14.0 52.4 7.5 28.5 Multi-level Memory Model 17.1 55.1 68.3 53.3 44.5 15.9 61.4 12.4 39.7 Table: Comparison of our model with baselines We investigate the gains made by (i) Using separate memory for context and KB triples (ii) Replacing KB triples with a multi-level memory. InCar CamRest Frames Model BLEU F1 Calendar F1 Weather F1 Navigate F1 BLEU F1 BLEU F1 Unified Context and KB memory (Mem2Seq) 11.8 40.9 61.6 39.6 21.7 14.0 52.4 7.5 28.5 Separate Context and KB Memory 14.3 44.2 56.9 54.1 24.0 14.3 55.0 12.1 36.5 +Replace KB Triples with Multi-level memory 17.1 55.1 68.3 53.3 44.5 15.9 61.4 12.4 39.7 Table: Ablation study: Effect of separate memory and multi-level memory design. In a human evaluation study, users were asked to score the models in terms of accuracy of in- formation in response and quality of language. CamRest Frames Info. Lang. MRR Info. Lang. MRR KVRet 2.49 4.38 0.57 2.42 3.31 0.64 Mem2Seq 2.48 3.72 0.51 1.78 2.55 0.50 Our Model 3.62 4.48 0.76 2.45 3.93 0.69 We visualize the attention weights to understand how the model is inferencing over the memory. The figures below show the attention heatmap when generating the word ‘8.86’. (a) Comparing the responses generated by various models on an example in Frames dataset. (b) Attention over the multi-level KB memory. (c) Decreasing order of attention scores over words in context. Gate Value g 1 (Generate from vocabulary) 0.08 g 2 (Copy from KB memory) 0.99 (d) Probability values of the gates Conclusion I Our model separates the context and KB memory and combines the attention on them using a gating mechanism. I The multi-level KB memory reflects the natural hierarchy present in KB results. This also allows our model to support non-sequential dialogs. I In future work, we would like to incorporate better modeling of latent dialog frames so as to improve the attention signal on our multi-level memory. I Model performance can be improved by capturing user intent better in case of non-sequential dialog flow.

Multi-Level Memory for Task Oriented Dialogs

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-Level Memory for Task Oriented Dialogs

Multi-Level Memory for Task Oriented DialogsRevanth Reddy1, Danish Contractor2, Dinesh Raghu2, Sachindra Joshi2

1Indian Institute of Technology, Madras 2IBM Research AI, New Delhi

Motivation

I End-to-end task-oriented dialog systems use neuralmemory architectures to incorporate external knowledge.

I Current models break down the external KB results into theform of Subject-Relation-Object triples.

I This makes it hard for the memory reader to inferrelationships across otherwise connected attributes.

I Existing models like Mem2Seq use a shared memory forcopying entities from dialog context, as well as the KBresults, thereby making inference harder.

Figure: Sample dialog and its corresponding external KB.

t

Figure: Triple store in current models.

I We separate the memory used to store tokens from theinput context and the results from the knowledge base.

I We propose a novel multi-level memory architecture whichencodes the natural hierarchy exhibited in the KB results.

I We store the queries, their results and correspondingattributes in different levels.

Figure: Multi-Level memory in our model.

Model Architecture

Figure: Model architecture with multi-level memory

Memory RepresentationEvery query qi is a set of key-value pairs {kqi

a : vqia , 1 < a < nqi}.

qi is represented by qvi = Bag of words over the word embeddings of values (vqi

a ) in qi.

Each result rij is also a set of slot-value pairs {k rija : v rij

a , 1 < a < nrij}.rij is represented by rv

ij = Bag of words over the word embeddings of values (v rija ) in rij.

Equations

Query Attention: The first level attention whichis applied over the query representations. αi =

exp(wT2 tanh(W4[dt, ht, qv

i ]))∑i exp(wT

2 tanh(W4[dt, ht, qvi ]))

Result Attention: The second level attention whichis applied over the result representations. βij =

exp(wT3 tanh(W5[dt, ht, rv

ij ]))∑j exp(wT

3 tanh(W5[dt, ht, rvij ]))

Result cell Attention: The third level attentionwhich is applied over the keys in each result. γijl =

exp(wT4 tanh(W6[dt, ht, φ

emb(k rijl )]))∑

l exp(wT4 tanh(W6[dt, ht, φemb(k rij

l )]))

KB copy distribution: The product of attentionover the three levels gives the final attention scoreof the values in each result.

Pkb(yt = w) =∑

ijl:vrijl =w

αiβijγijl

Context copy distribution: Obtained from the at-tention scores over the input dialog context.

Pcon(yt = w) =∑

ij:wij=w

aij

Copy Distribution: The copy distribution over thememory is gated sum of the copy distributions overKB and context.

Pc(yt) = g2Pkb(yt) + (1− g2)Pcon(yt)

Output Distribution: The final output distribution isgated sum of the generate distribution and the copydistribution over memory.

P(yt) = g1Pg(yt) + (1− g1)Pc(yt)

Experiments

Our experiments on three publicly available datasets show a substantial improvement of15-25% in both entity F1 and BLEU scores as compared to existing state-of-the-artapproaches.

InCar CamRest Frames

Model BLEU F1Calendar

F1Weather

F1Navigate

F1BLEU F1 BLEU F1

Attn seq2seq 11.3 28.2 36.9 35.7 10.1 7.7 25.3 3.7 16.2Ptr-UNK 5.4 20.4 22.1 24.6 14.6 5.1 40.3 5.6 25.8KVRet 13.2 48.0 62.9 47.0 41.3 13.0 36.5 10.7 31.7Mem2Seq 11.8 40.9 61.6 39.6 21.7 14.0 52.4 7.5 28.5Multi-level Memory Model 17.1 55.1 68.3 53.3 44.5 15.9 61.4 12.4 39.7

Table: Comparison of our model with baselines

We investigate the gains made by (i) Using separate memory for context and KB triples(ii) Replacing KB triples with a multi-level memory.

InCar CamRest Frames

Model BLEU F1Calendar

F1Weather

F1Navigate

F1BLEU F1 BLEU F1

Unified Context and KB memory (Mem2Seq) 11.8 40.9 61.6 39.6 21.7 14.0 52.4 7.5 28.5Separate Context and KB Memory 14.3 44.2 56.9 54.1 24.0 14.3 55.0 12.1 36.5+Replace KB Triples with Multi-level memory 17.1 55.1 68.3 53.3 44.5 15.9 61.4 12.4 39.7

Table: Ablation study: Effect of separate memory and multi-level memory design.

In a human evaluation study, users were askedto score the models in terms of accuracy of in-formation in response and quality of language.

CamRest FramesInfo. Lang. MRR Info. Lang. MRR

KVRet 2.49 4.38 0.57 2.42 3.31 0.64Mem2Seq 2.48 3.72 0.51 1.78 2.55 0.50Our Model 3.62 4.48 0.76 2.45 3.93 0.69

We visualize the attention weights to understand how the model is inferencing over thememory. The figures below show the attention heatmap when generating the word ‘8.86’.

(a) Comparing the responses generated by variousmodels on an example in Frames dataset.

(b) Attention over the multi-level KBmemory.

(c) Decreasing order of attentionscores over words in context.

Gate Valueg1 (Generate from vocabulary) 0.08

g2 (Copy from KB memory) 0.99

(d) Probability values of the gates

Conclusion

I Our model separates the context and KB memory and combines the attention on themusing a gating mechanism.

I The multi-level KB memory reflects the natural hierarchy present in KB results. Thisalso allows our model to support non-sequential dialogs.

I In future work, we would like to incorporate better modeling of latent dialog frames soas to improve the attention signal on our multi-level memory.

I Model performance can be improved by capturing user intent better in case ofnon-sequential dialog flow.