10
How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding in Dialogues Shang-Yu Su Pei-Chieh Yuan Yun-Nung Chen ? Department of Electrical Engineering ? Department of Computer Science and Information Engineering National Taiwan University {r05921117,b03901134}@ntu.edu.tw [email protected] Abstract Spoken language understanding (SLU) is an essential component in conversational sys- tems. Most SLU components treat each ut- terance independently, and then the following components aggregate the multi-turn informa- tion in the separate phases. In order to avoid error propagation and effectively utilize con- texts, prior work leveraged history for con- textual SLU. However, most previous models only paid attention to the related content in his- tory utterances, ignoring their temporal infor- mation. In the dialogues, it is intuitive that the most recent utterances are more important than the least recent ones, in other words, time- aware attention should be in a decaying man- ner. Therefore, this paper designs and investi- gates various types of time-decay attention on the sentence-level and speaker-level, and fur- ther proposes a flexible universal time-decay attention mechanism. The experiments on the benchmark Dialogue State Tracking Chal- lenge (DSTC4) dataset show that the proposed time-decay attention mechanisms significantly improve the state-of-the-art model for contex- tual understanding performance 1 . 1 Introduction Spoken dialogue systems that can help users to solve complex tasks such as booking a movie ticket have become an emerging research topic in artificial intelligence and natural language pro- cessing areas. With a well-designed dialogue sys- tem as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions. Today, there are several virtual intelligent assistants, such as Apple’s Siri, Google’s Home, Microsoft’s Cortana, and Ama- zon’s Echo. Recent advance of deep learning has 1 The source code is at: https://github.com/ MiuLab/Time-Decay-SLU. inspired many applications of neural models to di- alogue systems (Wen et al., 2017; Bordes et al., 2017; Dhingra et al., 2017; Li et al., 2017). A key component of a dialogue system is a spoken language understanding (SLU) module— it parses user utterances into semantic frames that capture the core meaning (Tur and De Mori, 2011). A typical pipeline of SLU is to first de- cide the domain given the input utterance, and based on the domain, to predict the intent and to fill associated slots corresponding to a domain- specific semantic template, where each utterance is treated independently (Hakkani-T¨ ur et al., 2016; Chen et al., 2016b,a; Wang et al., 2016). To over- come the error propagation and further improve understanding performance, the contextual infor- mation has been shown useful (Bhargava et al., 2013; Xu and Sarikaya, 2014; Chen et al., 2015; Sun et al., 2016). Prior work incorporated the di- alogue history into the recurrent neural networks (RNN) for improving domain classification, intent prediction, and slot filling (Xu and Sarikaya, 2014; Shi et al., 2015; Weston et al., 2015; Chen et al., 2016c). Recently, Chi et al. (2017) and Zhang et al. (2018) demonstrated that modeling speaker role information can learn the notable variance in speaking habits during conversations in order to benefit understanding. In addition, neural models incorporating atten- tion mechanisms have had great successes in ma- chine translation (Bahdanau et al., 2014), image captioning (Xu et al., 2015), and various tasks. Attentional models have been successful because they separate two different concerns: 1) decid- ing which input contexts are most relevant to the output and 2) actually predicting an output given the most relevant inputs. For example, the high- lighted current utterance from the tourist, “uh on august”, in the conversation of Figure 1 is to re- spond the question about WHEN, and the content-

How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

How Time Matters: Learning Time-Decay Attention forContextual Spoken Language Understanding in Dialogues

Shang-Yu Su† Pei-Chieh Yuan† Yun-Nung Chen?

†Department of Electrical Engineering?Department of Computer Science and Information Engineering

National Taiwan University{r05921117,b03901134}@ntu.edu.tw [email protected]

Abstract

Spoken language understanding (SLU) is anessential component in conversational sys-tems. Most SLU components treat each ut-terance independently, and then the followingcomponents aggregate the multi-turn informa-tion in the separate phases. In order to avoiderror propagation and effectively utilize con-texts, prior work leveraged history for con-textual SLU. However, most previous modelsonly paid attention to the related content in his-tory utterances, ignoring their temporal infor-mation. In the dialogues, it is intuitive thatthe most recent utterances are more importantthan the least recent ones, in other words, time-aware attention should be in a decaying man-ner. Therefore, this paper designs and investi-gates various types of time-decay attention onthe sentence-level and speaker-level, and fur-ther proposes a flexible universal time-decayattention mechanism. The experiments onthe benchmark Dialogue State Tracking Chal-lenge (DSTC4) dataset show that the proposedtime-decay attention mechanisms significantlyimprove the state-of-the-art model for contex-tual understanding performance1.

1 Introduction

Spoken dialogue systems that can help users tosolve complex tasks such as booking a movieticket have become an emerging research topicin artificial intelligence and natural language pro-cessing areas. With a well-designed dialogue sys-tem as an intelligent personal assistant, people canaccomplish certain tasks more easily via naturallanguage interactions. Today, there are severalvirtual intelligent assistants, such as Apple’s Siri,Google’s Home, Microsoft’s Cortana, and Ama-zon’s Echo. Recent advance of deep learning has

1The source code is at: https://github.com/MiuLab/Time-Decay-SLU.

inspired many applications of neural models to di-alogue systems (Wen et al., 2017; Bordes et al.,2017; Dhingra et al., 2017; Li et al., 2017).

A key component of a dialogue system is aspoken language understanding (SLU) module—it parses user utterances into semantic framesthat capture the core meaning (Tur and De Mori,2011). A typical pipeline of SLU is to first de-cide the domain given the input utterance, andbased on the domain, to predict the intent and tofill associated slots corresponding to a domain-specific semantic template, where each utteranceis treated independently (Hakkani-Tur et al., 2016;Chen et al., 2016b,a; Wang et al., 2016). To over-come the error propagation and further improveunderstanding performance, the contextual infor-mation has been shown useful (Bhargava et al.,2013; Xu and Sarikaya, 2014; Chen et al., 2015;Sun et al., 2016). Prior work incorporated the di-alogue history into the recurrent neural networks(RNN) for improving domain classification, intentprediction, and slot filling (Xu and Sarikaya, 2014;Shi et al., 2015; Weston et al., 2015; Chen et al.,2016c). Recently, Chi et al. (2017) and Zhanget al. (2018) demonstrated that modeling speakerrole information can learn the notable variance inspeaking habits during conversations in order tobenefit understanding.

In addition, neural models incorporating atten-tion mechanisms have had great successes in ma-chine translation (Bahdanau et al., 2014), imagecaptioning (Xu et al., 2015), and various tasks.Attentional models have been successful becausethey separate two different concerns: 1) decid-ing which input contexts are most relevant to theoutput and 2) actually predicting an output giventhe most relevant inputs. For example, the high-lighted current utterance from the tourist, “uh onaugust”, in the conversation of Figure 1 is to re-spond the question about WHEN, and the content-

Page 2: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

Guide: and you were saying that you wanted to come to singapore

Guide: uh maybe can i have a little bit more details like uh when will you be coming

Guide: and like who will you be coming with

Tourist: uh yes

Tourist: um i'm actually planning to visit

Tourist: uh on august

FOL-CONFIRM; FOL-INFO

QST-INFO; QST-WHEN

QST-WHO

FOL-CONFIRM

RES-WHEN

RES-WHEN

Figure 1: The human-human conversational utterances and their associated semantic labels from DSTC4.

aware contexts that can help current understandingare the first two utterances from the guide “andyou were saying that you wanted to come to singa-pore” and “un maybe can i have a little bit moredetails like uh when will you be coming”. Previouswork proposed an end-to-end time-aware attentionnetwork to leverage both contextual and tempo-ral information for spoken language understandingand achieved the significant improvement, show-ing that the temporal attention can guide the at-tention effectively (Chen et al., 2017). However,the time-aware attention function is an inflexiblehand-crafted setting, which is a fixed function oftime for assessing the attention.

This paper focuses on investigating various flex-ible time-aware attention mechanism in neuralmodels with contextual information and speakerrole modeling for language understanding. Thecontributions are three-fold:• This paper investigates different time-aware

attention mechanisms and provides guidancefor the future research about designing thetime-aware attention function.• This paper proposes an end-to-end learnable

universal time-decay mechanism with greatflexibility of modeling temporal informationfor diverse dialogue contexts.• The proposed model achieves the state-of-

the-art understanding performance in the dia-logue benchmark DSTC dataset.

2 The Proposed Framework

The model architecture is illustrated in Figure 2.First, the previous utterances are fed into the con-textual model to encode into the history summary,and then the summary vector and the current utter-ance are integrated for helping understanding. Thecontextual model leverages the attention mecha-nisms highlighted in red, which implements dif-ferent attention functions for sentence and speakerrole levels. The whole model is trained in anend-to-end fashion, where the history summary

vector and the attention weights are automaticallylearned based on the downstream SLU task. Theobjective of the proposed model is to optimize theconditional probability of the intents given the cur-rent utterance, p(y | x), by minimizing the cross-entropy loss.

2.1 Speaker Role Contextual LanguageUnderstanding

Given the current utterance x = {wt}T1 , the goalis to predict the user intents of x, which includesthe speech acts and associated attributes. We applya bidirectional long short-term memory (BLSTM)model (Schuster and Paliwal, 1997) to history en-coding in order to learn the probability distributionof the user intents.

vcur = BLSTM(x,Whis · vhis), (1)

o = sigmoid(WSLU · vcur), (2)

where Whis is a weight matrix and vhis is the his-tory summary vector, vcur is the context-awarevector of the current utterance encoded by theBLSTM, and o is the intent distribution. Note thatthis is a multi-label and multi-class classification,so the sigmoid function is employed for modelingthe distribution after a dense layer. The user intentlabels are decided based on whether the value ishigher than a threshold tuned by the developmentset.

Considering that speaker role information isshown to be useful for better understanding incomplex dialogues (Chi et al., 2017), we followthe prior work for utilizing the contexts from tworoles to learn history summary representations,vhis in (1), in order to leverage the role-specificcontextual information. Each role-dependent re-current unit BLSTMrolei receives correspondinginputs, xt,rolei , which includes multiple utterancesui (i = [1, ..., t − 1]) preceding the current utter-ance ut from the specific role, rolei, and have beenprocessed by an encoder model.

Page 3: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

Dense Layer+

Current Utterancewt wt+1 wT

… …

DenseLayer

Spoken Language

Understanding

∙ u2 ∙ u4 ∙ u5

u2

u6

Tourist u4 Guide

u1

u7

Current

Sentence-Level Time-Decay Attention

u3

u5

𝛼𝑢2 ∙ u1 ∙ u3 ∙ u6𝛼𝑢1

Role-Level Time-Decay Attention

Time-Decay Attention Function (𝛼𝑢 & 𝛼𝑟)

𝛼𝑟1𝛼𝑟2

𝛼𝑢𝑖

𝛼𝑢4 𝛼𝑢5 𝛼𝑢3 𝛼𝑢6

𝛼

𝑑

𝛼

𝑑

𝛼

𝑑

History Summary

convex linear concave

Figure 2: Illustration of the proposed time-aware attention contextual model with three types of time-decay atten-tion functions.

vhis =∑role

vhis,role (3)

=∑role

BLSTMrole(xt,role),

where xt,role are vectors after one-hot encodingthat represent the annotated intent and the attributefeatures. Note that this model requires the groundtruth annotations for history utterances for trainingand testing. Therefore, each role-based contextualmodule focuses on modeling role-dependent goalsand speaking style, and vcur from (1) would con-tain role-based contextual information.

2.2 Neural Attention Mechanism

One of the earliest work with a memory compo-nent applied to language processing is memorynetworks (Weston et al., 2015; Sukhbaatar et al.,2015), which encodes mentioned facts into vec-tors and stores them in the memory for ques-tion answering. The idea is to encode importantknowledge and store it into memory for future us-age with attention mechanisms. Attention mecha-nisms allow neural network models to selectivelypay attention to specific parts. There are alsovarious tasks showing the effectiveness of atten-tion mechanisms (Xiong et al., 2016; Chen et al.,2016c). Recent work showed that two attentiontypes (content-aware and time-aware) and two at-tention levels (sentence-level and role-level) sig-nificantly improve the understanding performancefor complex dialogues. This paper focuses on ex-panding the time-aware attention based on the in-vestigation of different time-decay functions, and

further learning an universal time-decay functionautomatically. For time-aware attention mecha-nisms, we apply it using two levels, sentence-leveland role-level structures, and Section 3 details thedesign and analysis of time-aware attention.

For the sentence-level attention, before feedinginto the contextual module, each history vector isweighted by its time-aware attention αuj for re-placing (3):

vUhis =∑role

BLSTMrole(xt,role, {αuj | uj ∈ role}).

For the role-level attention, a dialogue is disas-sembled from a different perspective on whichspeaker’s information is more important (Chiet al., 2017). The role-level attention is to de-cide how much to address on different speakerroles’ contexts (vhis,role) in order to better under-stand the current utterance. The importance of aspeaker given the contexts can be approximated tothe maximum attention value among the speaker’sutterances, αrole = maxαuj , where uj includesall contextual utterances from the speaker. Withthe role-level attention, the sentence-level historyfrom (3) can be rewritten into

vRhis =∑role

αrole · vhis,role (4)

for combining role-dependent history vectors withtheir attention weights.

2.3 End-to-End Training

The objective is to optimize SLU performance,predicting multiple speech acts and attributes de-scribed in Section 2.1. In the proposed model,

Page 4: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

all encoders, prediction models, and attentionweights can be automatically learned in an end-to-end manner.

3 Time-Decay Attention Learning

The decaying function curves can be easily sep-arated into three types: convex, linear, and con-cave, illustrated in the top-right part of Figure 2,and each type of time-decay functions expressesa time-aware perspective given dialogue contexts.Note that all attention weights will be normalizedsuch that their summation is equal to 1.

3.1 Convex Time-Decay AttentionA convex curve also known as “concave upward”,in a simple 2D Cartesian coordinate system (x, y),a convex curve f(x) means when x goes greater,the slope f ′(x) is increasing. Intuitively, recentutterances contain more salient information, andthe salience decreases very quickly when the dis-tance increases; therefore we introduce the time-aware attention mechanism that computes atten-tion weights according to the time of utterance oc-currence explicitly. We first define the time differ-ence between the current utterance and the preced-ing sentence ui as d(ui), and then simply use itsreciprocal to formulate a convex time-decay func-tion:

αconvui

=1

a · d(ui)b, (5)

where a and b are scalar parameters.The increasing slopes of the decay-curve assert

that importance of utterances should be attenuatedrapidly, and the importance of a earlier history sen-tence would be considerably compressed. Notethat Chen et al. used a fixed convex time-decayfunction (a = 1, b = 1) (Chen et al., 2017).

3.2 Linear Time-Decay AttentionA linearly decaying time-aware attention func-tion should also be taken into consideration. Ina simple 2D Cartesian coordinate system (x, y),the slopes of a linear function remain consistentwhen x changes. That is, the importance of pre-ceding utterances linearly declines as the distancebetween the previous utterance and the target ut-terance becomes larger.

αlinui

= max(e · d(ui) + f, 0), (6)

where e and f are the slope and the α-intercept ofthe linear function. Note that when the distance

d(ui) is larger than −fe , we assign the attention

value as 0.

3.3 Concave Time-Decay Attention

A concave curve also called “concave downward”,in contrast to convex curves, in a simple 2D Carte-sian coordinate system (x, y), a concave curvef(x) means that the slope f ′(x) is decreasingwhen x goes greater. Intuitively, the attentionweight decreases relatively slow when the distanceincreases. To implement this idea, we design aButterworth filter-like low-distance pass filter(Butterworth, 1930) that is similar to the concavetime-decay function in the beginning of the curve.

αconcui

=1

1 + (d(ui)D0

)n, (7)

where D0 is the cut-off distance and n is the orderof filter. The decreasing slopes of the decay-curveassert that the importance of utterances shouldweaken gradually, and the importance of a earlierhistory sentence would still be considerably com-pressed. Moreover, it is more likely to preservethe information in the multiple recent utterancesinstead of focusing only on the most recent one.

3.4 Universal Time-Decay Attention

As mentioned previously, there are three types ofdecaying curves: convex, linear, concave, eachtype represents a different perspective on dialoguecontexts and models different contextual patterns.However, because the contextual patterns may bediverse, a single type of function could not fit thecomplex behavior well. Hence, we propose a flex-ible and universal time-decay attention function bycomposing three types of attentional curves:

αunivui

= w1 · αconvui

+ w2 · αlinui

+ w3 · αconcui

(8)

=w1

a · d(ui)b+ w2(e · d(ui) + f)

+w3

1 + (d(ui)D0

)n,

where wi are the weights of time-decay attentionfunctions. Because the framework can be trainedin an end-to-end manner, all parameters (wi, a,b, e, f , D0, n) can be automatically learned toconstruct a flexible time-decay function. With thecombination of different curves and the adjustableweights, the proposed universal time-decay atten-tion function expresses the flexibility of not being

Page 5: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

LU Model Sentence-Level Attention Role-Level AttentionConv. Lin. Conc. Univ. Conv. Lin. Conc. Univ.

(a) DSTC4-Best2 61.4(b) Naıve LU 70.18(c) No Attention Context 74.52(d) Content-Aware Context 73.69 74.28(e) Time-Aware (Hand) 75.95† 74.12 74.26 76.41† 76.73† 76.11† 76.01† 76.68†

(f) (E2E) 76.04† 74.25 74.32 76.67† 76.69† 76.26† 76.08† 76.75†

(g) Content+Time (Hand) 74.71† 73.40 73.28 75.48† 76.70† 76.24† 76.03† 76.61†

(h) (E2E) 74.94† 73.79 73.47 75.83† 76.51† 75.76† 76.22† 76.74†

Table 1: The understanding performance reported on F-measure in DSTC4, where the context length is 7 for eachspeaker (%). † indicates the significant improvement compared to all baseline methods. Hand: hand-crafted; E2E:end-to-end trainable.

strictly decaying; that is, the model can automati-cally learn a properly oscillating curve in order tomodel the diverse and complex contextual patternsusing the attention mechanism.

4 Experiments

To evaluate the proposed model, we conduct thelanguage understanding experiments on human-human conversational data.

4.1 Setup

The experiments are conducted using the DSTC4dataset, which consist of 35 dialogue sessions ontouristic information for Singapore collected fromSkype calls between 3 tour guides and 35 tourists,these 35 dialogs sum up to 31,034 utterances and273,580 words (Kim et al., 2016). All recorded di-alogues with the total length of 21 hours have beenmanually transcribed and annotated with speechacts and semantic labels at each turn level. Thespeaker information (guide and tourist) is also pro-vided. Unlike previous DSTC series collectedhuman-computer dialogues, human-human dia-logues contain rich and complex human behaviorsand bring much difficulty to all the tasks. Giventhe complex dialogue patterns and longer contexts,DSTC4 is a suitable benchmark dataset for evalu-ation. We randomly selected 28 dialogues as thetraining set, 5 dialogues as the testing set, and 2dialogues as the validation set.

We choose the mini-batch Adam as the op-timizer with the batch size of 256 examples.The size of each hidden recurrent layer is 128.We use pre-trained 200-dimensional word embed-dings GloV e (Pennington et al., 2014). We onlyapply 30 training epochs without any early stop

approach. We focus on predicting multiple la-bels including intents and attributes, so the eval-uation metric is an average F1 score for balanc-ing recall and precision in each utterance. The ex-periments are shown in Table 1, where we reportthe average results over five runs. We include thebest understanding performance (row (a)) from theparticipants of DSTC4 in IWSDS 2016 for refer-ence (Kim et al., 2016). The one-tailed t-test isperformed to validate the significance of improve-ment, and the numbers with markers indicate thesignificant improvement with p < 0.05.

4.2 Effectiveness of Time-Decay AttentionTo evaluate the proposed time-decay attention,we compare the performance with the naıve LUmodel without any contextual information (row(b)), the contextual model without any atten-tion mechanism (row (c)), and the one usingthe content-aware attention mechanism (row (d)),where the attention can be learned at sentence androle levels. It is intuitive that the model with-out considering contexts (row (b)) performs muchworse than the contextual ones for dialogue mod-eling. The proposed time-aware results are shownin the rows (e)-(h), where the rows (e)-(f) use onlythe time-aware attention while the rows (g)-(h)model both content-aware and time-aware atten-tion mechanisms together. It is obvious that al-most all time-aware results are better than threebaselines.

In order to investigate the performance of vari-ous time-decay attention functions, for each curvewe apply two settings: 1) Hand: hand-craftedhyper-parameters (rows (e) and (g)) and 2) E2E:end-to-end training for parameters (rows (f) and(h)). In the hand-crafted setting, the hyper-

Page 6: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

parameters a = 1, b = 1, e = −0.125, f =1, D0 = 5, n = 3 are adopted3. Table 1 showsthat among three types of the sentence-level time-decay attention, only the convex time-decay at-tention significantly outperforms the baselines, in-dicating that an unsuitable time-decay attentionfunction is barely useful. For both settings, theconvex functions perform best among the threetypes of time-decay functions. Also, the end-to-end trainable setting results in better performancefor most cases.

For our proposed universal time-decay atten-tion mechanism, the same settings are conducted:1) composing fixed versions for three types oftime-decay functions weighted by learned param-eters wi and 2) fully trainable parameters for alltime-decay functions. These two settings providedifferent levels of flexibility in fitting dialoguecontextual attention, and the experimental resultsshow that two settings both outperform all othertime-decay attention functions.

For sentence-level attention, the end-to-endtrainable universal time-decay attention achievesbest performance (rows (f) and (h)), where theflexible time-aware attention (rows (f) and (h)) ob-tains 2.9% relative improvement compared to themodel without the attention mechanism (row (c))and the model using content-aware attention only(row (d)). For role-level attention, all types oftime-decay functions significantly improve the re-sults. The probably reason may be that modelingtemporal importance for each sentence is more dif-ficult and less accurate, and speaker roles in thedialogues provide informative cues for the modelto connect the temporal importance from the samespeakers together; therefore, the conversationalpatterns can be considered to additionally improvethe understanding results. The further analysis isdiscussed in Section 4.3. Similarly, the best re-sults are also from the end-to-end trainable uni-versal time-decay function.

The significant improvement achieved by theuniversal functions indicates that our model caneffectively learn a suitable attention functionthrough this flexible setting and derive a propercurve to fit the temporal tendency to help themodel preserve the essence and drop unimpor-tant parts in the dialogue contexts. To further in-vestigate what the universal time-decay attention

3The chosen parameters are based on the domain knowl-edge about dialogue properties.

learns, we inspect the learned weights wi and findthat the convex attention function almost domi-nates the whole function. In other words, ourmodel automatically learns that the convex time-decay attention is more suitable for modeling con-texts from the dialogue data than the other twotypes. Therefore, we can conclude that in complexdialogues, the recent utterances contain majorityof salient information for spoken language under-standing, where the attention decay trend followsa convex curve.

We analyze the content-aware attention im-pact by comparing the results between time-awareonly (rows (e)-(f)) and content and time-awarejointly (rows (g)-(h)). The content-aware atten-tion (row (d)) fails to focus on the important con-texts for improving understanding performance inthe complex dialogues and even performs slightlyworse than the contextual model without attention(row (c)). Without a delicately-designed attentionmechanism, it is not guaranteed that incorporatingan additional content-aware attention would bringbetter performance and the experimental resultsshow that a simple and coarse content-aware atten-tion barely provides any usable information giventhe complex dialogues. Therefore, we focus onwhether our time-aware attention mechanisms cancompensate the poor attention learned from thecontent-aware model. In other words, we are notgoing to verify whether our time-aware attentionmechanisms could collaborate with the content-aware attention mechanism, instead, we focus onexamining how much our proposed time-aware at-tention could mitigate the detriment of the content-aware attention. By comparing the results be-tween time-aware only (rows (e)-(f)) and contentand time-aware jointly (rows (g)-(h)), we find thatour universal time-decay attention keeps the im-provement without too much performance drop byinvolving the learned temporal attention. Namely,our proposed attention mechanism can capturetemporal information precisely, and it thereforecan counteract the harmful impact of inaccuratecontent-aware attention.

4.3 Effectiveness of Role-Level Attention

For role-level attention, Table 1 shows that allresults with various time-decay attention mecha-nisms are better than the one with only content-aware attention (row (d)). However, linear andconcave time-decay functions do not provide addi-

Page 7: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

LU Model Context Length3 5 7

No Attention Contextual 74.75 74.69 (-) 74.52 (-)Content-Aware Contextual 74.04 73.90 (-) 73.69 (-)Time-Aware (Hand) 76.05 76.34 (+) 76.41 (+)Time-Aware (E2E) 76.26 76.43 (+) 76.67 (+)Content+Time (Hand) 75.16 75.27 (+) 75.48 (+)Content+Time (E2E) 75.82 75.92 (+) 75.83 (-)

Table 2: The sentence-level performance reported onF1 of the proposed universal time-decay attention un-der different context length settings (%). The symbols‘+’ and ‘-’ indicate the performance trends.

tional improvement when we model the attentionat the sentence level. The probable reason maybe that it is difficult to model attention for indi-vidual sentences given the unsuitable time-decayfunctions. That is, if designs of attention func-tions are unsuitable for dialogue contexts, the en-coded sentence embeddings would be weightedby improper attention values. On the other hand,for role-level attention, each speaker role is as-signed an attention value to represent their im-portance in the conversational interactions. Pre-vious work (Chi et al., 2017; Chen et al., 2017)also demonstrated the effectiveness of consider-ing speaker interactions for better understandingperformance. By introducing role-level atten-tion, the sentence-level attentional weights can besmoothed to avoid inappropriate values. Surpris-ingly, even though learning sentence-level tempo-ral attention is difficult, our proposed universaltime-decay attention can achieve similar perfor-mance for sentence-level and role-level attention(76.67% and 76.75% from the row (f)), furtherdemonstrating the strong adaptability of fitting di-verse dialogue contexts and the capability of cap-turing salient information.

4.4 Robustness to Context Lengths

It is intuitive that longer context brings richer in-formation; however, it may obstruct the atten-tion learning and result in poor performance be-cause more information should be modeled andaccurate estimation is not trivial. Because whenmodeling dialogues, we have no idea about howmany contexts are enough for better understand-ing, the robustness to varying context lengths isimportant for the contextual model design. Here,we compare the results using different contextlengths (3, 5, 7) for detailed analysis in Table 2,where the number is for each speaker. The mod-els without attention and the content-aware mod-

Parameter Time-Aware (E2E Trainable)Sentence Role

w1 0.758 1.078w2 0.544 -0.378w3 -0.302 0.300a 0.888 0.841b 0.969 1.084e -0.320 -0.129f 0.640 0.993D0 4.873 4.980n 2.977 2.755

Table 3: The converged values of end-to-end trainableparameters from the proposed universal time-decay at-tention models. The values are averaged over five runs.

els become slightly worse with increasing contextlengths. However, our proposed universal time-decay attention model mostly achieves better per-formance when including longer contexts, demon-strating not only the flexibility of adapting diversecontextual patterns but also the robustness to vary-ing context lengths.

4.5 Universal Time-Decay Attention Analysis

This paper proposes a flexible time-decay atten-tion mechanism by composing three types of time-aware attention functions in different decayingtendencies, where each decaying curves reflect aspecific perspectives on distribution over salientinformation in dialogue contexts. The proposeduniversal time-decay attention shows great capa-bility of modeling diverse dialogue patterns in theexperiments and therefore proves that our pro-posed method is a general design of time-decayattention. In our design, we endow the attentionfunction with flexibility by employing many train-able parameters and hence it can automaticallylearn a properly decaying curve for fitting the dia-logue contexts better.

To further analyze the combination of differ-ent time-decay attention functions, we inspect theconverged values of the trainable parameters fromthe proposed universal time-decay attention mod-els in Table 3. Under the end-to-end trainable set-ting, the initialization of the trainable parametersare the same as the hand-crafted ones (wi = 1, a =1, b = 1, e = −0.125, f = 1, D0 = 5, n = 3).In the experiments, the models automatically fig-ure out that convex time-decay attention functionshould have a higher weight than others for bothsentence-level or role-level models (w1 > w2 andw1 > w3). Namely, in dialogue contexts, the re-cent utterances contain most information related

Page 8: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

anything else (FOL-CONFIRM)

Okay (FOL-ACK)

so we can eat there (FOL-EXPLAIN)

Okay (FOL-ACK)

okay thank you (FOL-THANK)

and how about anything else that where we can go for visit (QST-RECOMMEND)

so maybe (FOL-INFO)

yes so maybe at the same time if you are going to climb bukit timah (FOL-RECOMMEND)

you can also bring along some snacks with you (FOL-RECOMMEND)

just also be careful do not put your food items in plastic bag (FOL-INFO)

put them inside your bag because there will be some monkeys on the hill (FOL-INFO)

and they may disturb you (FOL-INFO)

Tourist Guideis there any restaurant when we (FOL-CONFIRM)

okay i mentioned earlier on i would like to recommend the zoo

Target Sentence

Content-Aware:FOL-INFO

Content + Universal Time-Decay:RES-RECOMMEND

Content + Universal Time-Decay AttentionContent-Aware Attention

and they will think that you know your plastic bag would have contained food (FOL-INFO)

Figure 3: The visualization of the attention weights enhanced by the proposed time-decay function compared withthe weights learned by the content-aware attention model.

to the current utterance, which is aligned with ourintuition.

4.6 Qualitative Analysis

From the above experiments, the proposed time-decay attention mechanisms significantly improvethe performance on both sentence and role lev-els. To further understand how the time-decayattention changes the content-aware attention, wedig deeper into the learned attentional values forsentences and illustrate the visualization in Fig-ure 3. The figure shows a partial dialogue be-tween the tourist (left) and the guide (right), wherethe color shades indicate the learned attention in-tensities of sentences. It can be found that thelearned content-aware attention (red; row (c)) fo-cuses on the incorrect sentence (“so we can eatthere” (FOL-EXPLAIN)) and hence predicts thewrong label, FOL-INFO. The reason may bethat with a coarse and simple design of content-aware attention mechanism, the attention functionmay not provide additional benefit for improve-ment. By additionally leveraging our proposeduniversal time-decay attention methods, the re-sult (blue; row (g)) shows that the adjusted at-tention pays the highest attention on the most re-cent utterance and thereby predicts the correct in-tent, RES-RECOMMEND. It can be found that ourproposed time-decay attention can effectively turnthe attention to the correct contexts in order tocorrectly predict the dialogue act and attribute.Therefore, the proposed attention mechanisms are

demonstrated to be effective for improving un-derstanding performance in such complex human-human conversations.

5 Conclusion

This paper designs and investigates various time-decay attention functions based on an end-to-endcontextual language understanding model, wheredifferent perspectives on dialogue contexts are an-alyzed and a flexible and universal time-decay at-tention mechanism is proposed. The experimentson a benchmark human-human dialogue datasetshow that the understanding performance can beboosted by simply introducing the proposed time-decay attention mechanisms for guiding the modelto focus on the salient contexts following a con-vex curve. Moreover, the proposed universal time-decay mechanisms are easily extensible to multi-party conversations and showing the potential ofleveraging temporal information in NLP tasks ofdialogues.

Acknowledgements

We would like to thank reviewers for their insight-ful comments on the paper. The authors are sup-ported by the Ministry of Science and Technologyof Taiwan, Google Research, Microsoft Research,and MediaTek Inc..

Page 9: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .

Anshuman Bhargava, Asli Celikyilmaz, DilekHakkani-Tur, and Ruhi Sarikaya. 2013. Easycontextual intent prediction and slot detection. InProceedings of ICASSP. pages 8337–8341.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.2017. Learning end-to-end goal-oriented dialog. InProceedings of ICLR.

Stephen Butterworth. 1930. On the theory of filter am-plifiers. Wireless Engineer 7(6):536–541.

Po-Chun Chen, Ta-Chung Chi, Shang-Yu Su, and Yun-Nung Chen. 2017. Dynamic time-aware attentionto speaker roles and contexts for spoken languageunderstanding. In Proceedings of ASRU. pages 554–560.

Yun-Nung Chen, Dilek Hakanni-Tur, Gokhan Tur,Asli Celikyilmaz, Jianfeng Guo, and Li Deng.2016a. Syntax or semantics? knowledge-guidedjoint semantic frame parsing. In Proceedings of2016 IEEE Spoken Language Technology Workshop.pages 348–355.

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur,Asli Celikyilmaz, Jianfeng Gao, and Li Deng.2016b. Knowledge as a teacher: Knowledge-guided structural attention networks. arXiv preprintarXiv:1609.03286 .

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur,Jianfeng Gao, and Li Deng. 2016c. End-to-endmemory networks with knowledge carryover formulti-turn spoken language understanding. In Pro-ceedings of INTERSPEECH. pages 3245–3249.

Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky,and Anatole Gershman. 2015. Leveraging behav-ioral patterns of mobile applications for personalizedspoken language understanding. In Proceedings ofICMI. pages 83–86.

Ta-Chung Chi, Po-Chun Chen, Shang-Yu Su, and Yun-Nung Chen. 2017. Speaker role contextual model-ing for language understanding and dialogue policylearning. In Proceedings of IJCNLP. pages 163–168.

Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.Towards end-to-end reinforcement learning of dia-logue agents for information access. In Proceedingsof ACL. pages 484–495.

Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frameparsing using bi-directional RNN-LSTM. In Pro-ceedings of INTERSPEECH. pages 715–719.

Seokhwan Kim, Luis Fernando DHaro, Rafael EBanchs, Jason D Williams, and Matthew Henderson.2016. The fourth dialog state tracking challenge. InProceedings of IWSDS.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedingsof The 8th International Joint Conference on Natu-ral Language Processing. pages 733–743.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of EMNLP. vol-ume 14, pages 1532–1543.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Transactionson Signal Processing 45(11):2673–2681.

Yangyang Shi, Kaisheng Yao, Hu Chen, Yi-Cheng Pan,Mei-Yuh Hwang, and Baolin Peng. 2015. Contex-tual spoken language understanding using recurrentneural networks. In Proceedings of ICASSP. pages5271–5275.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In Proceed-ings of NIPS. pages 2431–2439.

Ming Sun, Yun-Nung Chen, and Alexander I. Rud-nicky. 2016. An intelligent assistant for high-leveltask understanding. In Proceedings of IUI. pages169–174.

Gokhan Tur and Renato De Mori. 2011. Spoken lan-guage understanding: Systems for extracting seman-tic information from speech. John Wiley & Sons.

Zhangyang Wang, Yingzhen Yang, Shiyu Chang, QingLing, and Thomas S Huang. 2016. Learning a deep lencoder for hashing. In Proceedings of IJCAI. pages2174–2180.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic,Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes,David Vandyke, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In Proceedings of EACL. pages 438–449.

Jason Weston, Sumit Chopra, and Antoine Bordesa.2015. Memory networks. In Proceedings of ICLR.

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visualand textual question answering. arXiv preprintarXiv:1603.01417 .

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In Proceedings of ICML. pages 2048–2057.

Page 10: How Time Matters: Learning Time-Decay Attention for Contextual … · 2018. 4. 16. · How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding

Puyang Xu and Ruhi Sarikaya. 2014. Contextual do-main classification in spoken language understand-ing systems using recurrent neural network. In Pro-ceedings of ICASSP. pages 136–140.

Rui Zhang, Honglak Lee, Lazaros Polymenakos, andDragomir Radev. 2018. Addressee and response se-lection in multi-party conversations with speaker in-teraction rnns. In Proceedings of AAAI.