arXiv:2009.13902v1 [cs.CL] 29 Sep 2020b, however, is pre-occupied and replies sarcasti-cally (u 4). This enrages P a to appropriate an an-gry response (u 6). In this dialogue, emotional

Utterance-level Dialogue Understanding: An Empirical Study

Deepanway Ghosal†, Navonil Majumder†, Rada Mihalcea4, Soujanya Poria†

† Singapore University of Technology and Design, Singapore4 University of Michigan, USA

[email protected]{navonil_majumder, sporia}@sutd.edu.sg

[email protected]

Abstract

The recent abundance of conversational dataon the Web and elsewhere calls for effectiveNLP systems for dialog understanding.Complete utterance-level understanding oftenrequires context understanding, defined bynearby utterances. In recent years, a numberof approaches have been proposed for variousutterance-level dialogue understanding tasks.Most of these approaches account for thecontext for effective understanding. In thispaper, we explore and quantify the role ofcontext for different aspects of a dialogue,namely emotion, intent, and dialogue actidentification, using state-of-the-art dialogunderstanding methods as baselines. Specif-ically, we employ various perturbations todistort the context of a given utterance andstudy its impact on the different tasks andbaselines. This provides us with insightsinto the fundamental contextual controllingfactors of different aspects of a dialogue. Suchinsights can inspire more effective dialogueunderstanding models, and provide support forfuture text generation approaches. The imple-mentation pertaining to this work is availableat https://github.com/declare-lab/

dialogue-understanding.

1 Introduction

Human-like conversational systems are a long-standing goal of Artificial Intelligence (AI). How-ever, the development of such systems is not a triv-ial task, as we often participate in dialogues byrelying on several factors such as emotions, sen-timent, prior assumptions, intent, or personalitytraits. In Fig. 1, we illustrate a dialogue-generationmechanism that leverages these key variables. Inthis illustration, P represents the personality ofthe speaker; S represents the speaker-state; I de-notes the intent of the speaker; E refers to thespeaker’s emotional-aware state, and U refers to

Topic

time t

StX

ItX

EtX

UtXU ≤ t−1

X,Y

PX

It−2X

t + 1

Person Xt - 1

Person YPerson Y

Figure 1: Dyadic conversation–between person X andY–are governed by interactions between several latentfactors such as intents, emotions.

the observed utterance. Speaker personality andthe topic always influence these variables. At turnt, the speaker conceives several pragmatic con-cepts, such as argumentation logic, viewpoint, andinter-personal relationships—which we collec-tively represent using the speaker-state S (Hovy,1987). Next, the intent I of the speaker is for-mulated based on the current speaker-state andprevious intent of the same speaker (at t − 2).These two factors influence the emotion of thespeaker. Finally, the intent, the speaker state, andthe speaker’s emotion jointly manifest as the spo-ken utterance. It is thus not surprising that thelandscape of dialogue understanding research em-braces several challenging tasks, such as emotionrecognition in conversations (ERC), dialogue in-tent classification, user state representation, andothers. These tasks are often performed at ut-terance level and can be conjoined together un-der the umbrella of utterance-level dialogue un-derstanding. Due to the fast-growing research in-

arX

iv:2

009.

1390

2v5

[cs

.CL

] 2

2 O

ct 2

020

https://github.com/declare-lab/dialogue-understanding

https://github.com/declare-lab/dialogue-understanding

I think I mentioned once before that I’ve only had three minute liquor glasses of brandy this whole evening. Can you pass my tickets?

Intent: find ticket Emotion: angryOh shut up! I do not remember of any tickets.

Emotion: angry

The one that I bought last week and you kept them. Get me those now.

Intent: find ticket Emotion: frustrated

Not very funny, dear. You’d better have some more brandy.

Emotion: angry

coreferential information

Very good idea. I will.

Emotion: angry Tone: sarcastic

inter-speaker dependency

labeldependency


Can I get your phone number, please?

Intent: find phone number Emotion: neutral

Uh. Yeah, I just gave it to the automated thing like five times.

Emotion: frustrated

I am going to need it again. I need to look at your file. Please calm down Sir. I am here to help you.

Intent: find phone number Emotion: frustrated

Okay. 3236975066. Thank you!

Emotion: neutral

coreferential information

Thank you! Let me check it right way.

Emotion: neutral


labeldependency


coreferential information inter-s

peaker

dependency

labelshift due to other speaker’s response

Figure 2: Role of Context in Utterance Level Dialogue Understanding.

terest in dialogue understanding, several novel ap-proaches have lately been proposed (Qin et al.,2020; Rashkin et al., 2019; Xing et al., 2020; Lianet al., 2019; Wang et al., 2020; Saha et al., 2020)to address these tasks by adopting speaker-specificand contextual modeling. However, to the bestof our knowledge, no unified baselines have beenestablished for varied utterance-level dialogue un-derstanding tasks that allow comparison and anal-ysis of these tasks under the same framework. Inthis work, the purpose of using a unified baselinefor all the utterance-level dialogue understandingtasks is to compare the characteristics of this base-line across different tasks and datasets. As a re-sult, we can also learn interesting attributes of thedatasets and task which we discuss in detail in Sec-tion 6. Recently, Sankar et al. (2019) attempted tomeasure the efficacy of multi-turn contextual in-formation in dialogue generation by probing themodels tasked to generate dialogues given multi-turn contexts. According to them, the baseline di-alogue models used in their work are not capableto efficiently utilize the long multi-turn sequencesfor dialogue generation as they are rarely sensi-tive to most perturbations which diverges from thefindings of this work.

Conversational Context Modeling. Context isat the core of NLP research. According to severalrecent studies (Peters et al., 2018; Devlin et al.,2018), contextual sentence and word embeddingscan improve the performance of the state-of-the-art NLP systems by a significant margin.

The notion of context can vary from problemto problem. For example, while calculating word

representations, the surrounding words carry con-textual information. Likewise, to classify a sen-tence in a document, other neighbouring sentencesare considered as its context. In Poria et al. (2017),surrounding utterances are treated as context andthey experimentally show that contextual evidenceindeed aids in classification.

Similarly in the tasks such as conversationalemotion or intent detection, to determine the emo-tion of an utterance at time t, the preceding utter-ances at time < t can be considered as its con-text. However, computing this context represen-tation often exhibits major difficulties due to emo-tional dynamics.

The dynamics of conversations consists of twoimportant aspects: self and inter-personal de-pendencies (Morris and Keltner, 2000). Self-dependency, also known as intra inertia, dealswith the aspect of influence that speakers have onthemselves during conversations (Kuppens et al.,2010). On the other hand, inter-personal depen-dencies relate to the influences that the counter-parts induce into a speaker. Conversely, during thecourse of a dialogue, speakers also tend to mir-ror their counterparts to build rapport (Navarrettaet al., 2016). This phenomenon is illustrated inFig. 3. Here, Pa is frustrated over her long termunemployment and seeks encouragement (u1,u3).Pb, however, is pre-occupied and replies sarcasti-cally (u4). This enrages Pa to appropriate an an-gry response (u6). In this dialogue, emotional in-ertia is evident in Pb who does not deviate fromhis nonchalant behavior. Pa, however, gets emo-tionally influenced by Pb. Modeling self and inter-personal relationship and dependencies may also

I don’t think I can do this anymore. [ frustrated ]

Well I guess you aren’t trying hard enough. [ neutral ]

Its been three years. I have tried everything. [ frustrated ]

Maybe you’re not smart enough. [ neutral ]

Just go out and keep trying. [ neutral ]

I am smart enough. I am really good at what I do. I just don’t know how to make

someone else see that. [anger]

Person BPerson A

u1

u3

u6

u2

u4

u5

Figure 3: An abridged dialogue from the IEMOCAPdataset.

depend on the topic of the conversation as wellas various other factors like argument structure,interlocutors’ personality, intents, viewpoints onthe conversation, attitude towards each other etc..Hence, analyzing all these factors are key for a trueself and inter-personal dependency modeling thatcan lead to enriched context understanding.

The contextual information can come from bothlocal and distant conversational history. While theimportance of local context is more obvious, asstated in recent works, distant context often playsa less important role in understanding the utter-ances. Distant contextual information is usefulmostly in the scenarios when a speaker refers toearlier utterances spoken by any of the speakersin the conversational history. The usefulness ofcontext is more prevalent in classifying short ut-terances, like “yeah”, “okay”, “no”, that can ex-press different emotions depending on the contextand discourse of the dialogue.

Although the role of context is reasonablyclear in dialogue generation, it may not beequally transparent in the case of utterancelevel dialogue understanding. There can be oc-casions where contextual information may notprovide any useful information. In such cases,the target utterance can be sufficient for neces-sary inferences such as intent, act, and emotionprediction. However, contextual utterances in aconversation should always help in understandingan utterance at a given time as they provide keybackground information. Modeling representationof these contextual utterances are not trivial asthere can be long-chain complex coreferential or

other kinds of inferences involved in the processand sometimes various other confounding factorssuch as sarcasm, irony, etc. can make the task ex-tremely challenging (see Fig. 2). An ideal con-text modeling approach should have the ability tounderstand such factors efficiently, fuse them, andperform inference accordingly. Contextual infor-mation may not be necessary to tag all the ut-terances in a dialogue, but we must acknowl-edge that full understanding of an utterance isincomplete without its conversational context.The scope of this understanding spans beyondjust tagging utterances with only these prede-fined labels – intents, dialogue acts, emotions– to large-scale reasoning that includes diffi-cult problems like identifying causes of elicit-ing a particular response, slot filling, etc. Ef-ficient and well-formed contextual representationmodeling can also depend on the type of conver-sations i.e., modeling persuasive dialogue can bedifferent from a debate. Finding contextualizedconversational utterance representations is an ac-tive area of research. Leveraging such contex-tual clues is a difficult task. Memory networks,RNNs, and attention mechanisms have been usedin previous works (Qin et al., 2020) to grasp in-formation from the context. However, these net-works do not address most of the abovementionedaspects e.g., lack of coreference resolution, un-derstanding long-chained inferences between thespeakers. In this work, we, of course, do not at-tempt to propose a network that inherits all thesefactors rather we try to probe existing contextualand non-contextual models in utterance level dia-logue understanding and understand theirs under-neath working-principle.

In this work, we adapt, modify, and employ twostrong contextual utterance-level dialogue under-standing baselines—bcLSTM (Poria et al., 2017)and DialogueRNN (Majumder et al., 2019)—thatwe evaluate on four large dialogue classificationdatasets across five different tasks. As shown inFig. 1, conversational context, inter-speaker de-pendencies, and speaker states can play importantroles in addressing these utterance-level dialogueunderstanding tasks. bcLSTM and DialogueRNNare two such frameworks that leverage these fac-tors and thus considered as the baselines in ourexperiment. Moreover, we present several uniqueprobing strategies and experimental designs thatevaluate the role of context in utterance-level dia-

Are you looking for a place to stay, a place to eat, an attraction to visit, or do you need transportation?

Emotion: Neutral Intent: ask_clarification

Hey! I have just landed in Cambridge.

Emotion: Excited

Welcome to Cambridge. Hope you had a pleasant flight. How can I help you, madam?

Emotion: Neutral Intent: ask_clarification

Not really. I lost my luggage. Hope I will get the insurance coverage

Emotion: Sad

Oh so sorry to hear. There is a place at the first floor where you can lodge a complain.

Emotion: Sad (Empathetic) Intent: provide_information

Thanks for the information! Appreciate it. Are you a local? I’m looking for some recommendations.

Emotion: Excited Intent: find_attraction

Figure 4: Utterance level tagging of a dialogue.

logue understanding. To summarize, the purposeof this work is to decipher the role of the con-text in utterance-level dialogue understanding bythe means of different probing strategies. Thesestrategies can be easily adapted to other tasks forsimilar purposes.

The contribution of this work is five-fold:

• We setup contextual utterance-level dialogueunderstanding baselines for five differentutterance-level dialogue understanding taskswith traditional word embeddings (GloVe)and recent transformer-based contextualizedword embeddings (RoBERTa);

• We modified the existing strong baselinesLSTM and DialogueRNN by introducingresidual connections that improve these base-lines by a significant margin;

• We showcase a detailed dataset analysis ofdifferent tasks and present interesting fre-quent label transition patterns and dependen-cies;

• We perform an evaluation of two dif-ferent mini-batch construction paradigms:utterance- and dialogue-level mini-batches;

• We propose varied probing strategies to de-cipher the role of context in utterance-leveldialogue understanding.

2 Task definition

Given the transcript of a conversation along withspeaker information of each constituent utterance,the utterance-level dialogue understanding taskaims to identify the label of each utterance froma set of pre-defined labels that can be either a setof emotions, dialogue acts, intents etc. Fig. 4 il-lustrates one such conversation between two peo-ple, where each utterance is labeled by the un-derlying emotion and intent. Formally, giventhe input sequence of N number of utterances

[(u1, p1),(u2, p2), . . . ,(uN , pN)], where each utter-ance ui = [ui,1,ui,2, . . . ,ui,T ] consists of T wordsui, j and spoken by party pi, the task is to predictthe label ei of each utterance ui. In this process, theclassifier can also make use of the conversationalcontext. There are also cases where not all the ut-terances in a dialogue have corresponding labels.In this paper, we limit utterance-level dialogue un-derstanding to only tagging utterances with emo-tions, dialogue-acts, and intents. However, we donot claim these are the only tasks that give us afull understanding of the utterances in dialogues.As discussed in the introduction, the scope of thisresearch topic includes various other harder prob-lems that we do not address in this paper such asslot filling, identifying sarcasm, finding causes ofresponses, etc.

3 Models

We train all our classification models in an end-to-end setup. We first extract utterance level fea-tures with either i) a CNN module with pretrainedGloVe embeddings or ii) a pretrained RoBERTamodel. The resulting extracted features are non-contextual in nature as they are obtained fromutterances without the surrounding context. Wethen classify the utterances with one of the fol-lowing three models: i) Logistic Regression, orii) bcLSTM, or iii) DialogueRNN. Among thesemodels, the Logistic Regression model is non-contextual in nature, whereas the other two arecontextual. We expand on the feature extractor andthe classifier in more detail next.

3.1 Utterance Feature Extractor

Utterance level features are extracted using one ofthe following two methods:

GloVe CNN. A convolutional neural network(Kim, 2014) is used to extract features from the ut-terances of the conversation. We use a single con-volutional layer followed by max-pooling and a

et

ut

bcLSTMc

Utterances

DRNNcst−1 st

Et

ut

Utterances

qt−1 qt

g<t gt

et−1 et

Utterance LabelEt

Utterance Label

time t t + 1t - 1

Resid

ual C

onne

ctio

n

Resid

ual C

onne

ctio

n

Feature Extractor

Feature Extractor

Figure 5: Modified bcLSTM and DialogueRNN with residual connection.

fully-connected layer to obtain the representationof the utterance. The inputs to this network are theutterances. Each word in the utterances is initial-ized with 300-dimensional pretrained GloVe em-beddings (Pennington et al., 2014). We pass theseword to convolutional filters of sizes 1, 2, and 3,each having 100 feature maps. The output of thesefilters are then max-pooled across all the wordsof an utterance. These are then concatenated andfed to a 100 dimensional fully-connected layerfollowed by ReLU activation (Nair and Hinton,2010). The output after the activation form the fi-nal representation of the utterance.

RoBERTa. We employ the RoBERTa-Basemodel (Liu et al., 2019) to extract utterance levelfeature vectors. RoBERTa-Base follows the orig-inal BERT-Base (Devlin et al., 2018) architecturehaving 12 layers, 12 self-attention heads in eachblock, and a hidden dimension of 768 resulting ina total of 125M parameters. Let an utterance xconsists of a sequence of BPE tokenized tokensx1,x2, . . . ,xN . A special token [CLS] is appendedat the beginning of the utterance to create the in-put sequence for the model: [CLS],x1,x2, . . . ,xN .This sequence is passed through the model, andthe activation from the last layer corresponding tothe [CLS] token is used as the utterance feature.

3.2 Utterance Classifier.

The representations obtained from the UtteranceFeature Extractor are then classified using one ofthe following three methods:

Without Context Classifier. In this model, clas-sification of an utterance is performed using afully connected multi-layer perceptron layer. Thisclassification setup is non-contextual in nature asthere is no flow of information from the contex-tual utterances. This strategy translates to sim-ple GloVe CNN based or RoBERTa feature basedfine-tuning in isolation w.r.t other utterances in theconversation as we don’t take those into account.For simplicity, we call this model GloVe CNN orRoBERTa LogReg (Logistic Regression).

bcLSTM. The Bidirectional Contextual LSTMmodel (bcLSTM) (Poria et al., 2017) createscontext-aware utterance representations by captur-ing the contextual content from the surroundingutterances using a Bi-directional LSTM (Hochre-iter and Schmidhuber, 1997) network. The featurerepresentations extracted by the Utterance FeatureExtractor serves as the input to the LSTM net-work. Finally, the context-aware utterance repre-sentations from the output of the LSTM are usedfor the label classification. The contextual-LSTM

Figure 6: Label distribution of the datasets.

model is speaker independent as it doesn’t modelany speaker level dependency. An illustration ofthe bcLSTM model is shown in Fig. 7a.

cLSTM. Similar to bcLSTM but without thebidirectionality in the LSTM, this model is in-tended to ignore the presence of future utteranceswhile classifying an utterance Ut .

DialogueRNN. (Majumder et al., 2019) is a re-current network based model for emotion recogni-tion in conversations. It uses two GRUs to trackindividual speaker states and global context dur-ing the conversation. Further, another GRU isemployed to track emotion state through the con-versation. In this work, we consider the emo-tion state to be a general state which can be usedfor utterance level classification (i.e., not limitedto only emotion classification). Similar to thebcLSTM model, the features extracted by the Ut-terance Feature Extractor is the input to the Dia-logueRNN network. DialogueRNN aims to modelinter-speaker relations and it can be applied onmultiparty datasets. An illustration of the Di-alogueRNN model is shown in Fig. 7b whichshows the speaker state modeling using its recur-rent structure to count on both intra- and inter-speaker dependencies.

cLSTM, bcLSTM and DialogueRNN withResidual Connections. Deep neural networkscan often have difficulties in information proro-gation. Multi-layered RNN-like in particulars of-ten succumb to vanishing gradient problems whilemodeling long range sequences. Residual connec-tions or skip connections (He et al., 2016) are anintuitive way to tackle this problem by improv-ing information propagation and gradient flow. In-spired by the early works in residual LSTM (Wuet al., 2006; Kim et al., 2017), in our recurrent con-textual models - bcLSTM and DialogueRNN weadopt a simple strategy to introduce a residual con-nection. For each utterance, a residual connectionis formed between the output of the feature extrac-tor and the output of the bcLSTM/DialogueRNNmodule. These two vectors are added and the fi-nal classification is performed from the resultantvector.

4 Experimental Setup

4.1 Datasets

All the dialogue classification datasets that weconsider in this work consists of two-party con-versations. We benchmark the models on the fol-lowing datasets:

IEMOCAP (Busso et al., 2008) is a dataset

g1 g2 g3 g4 g5

u1 u2 u3 u4 u5

LSTMc LSTMc LSTMc LSTMc LSTMc

Classification

Utterances

Feature Extractor

Feature Extractor

Feature Extractor

Feature Extractor

Feature Extractor

(a)Speaker-state

Pers

on A

Glo

bal S

tate

GRUP

GRUP

GRUG GRUG

qA, t−1 qA, t

qB, tqB, t+1

utAttention

ct

ct+1ut+1

Input

Attention Input

gtgt+1

time t time t+1

Pers

on B

GRUE et

GRUE et+1

yt

yt+1

qA, t−1

qB, t

Not

e:

Concatenation

Dot Product

Matmul

Speaker-state

Global-state

Emotion-rep

gt−1

g1

g1

gt

Attention block for time t

ct

ut

Context-vector:

G RUG G RUGg0

g1 gt−2 gt−1

GRUE

GRUE

qA, t

qB, t+1

et−1Emotion

et+1

Classify

yt+1

yt

GRUPqi, t−1

GRUG

ut

gt−1

ct

GRULqj, t−1

gt

qi, t

qj, t

Speaker i

Listener j

Global state

GRUEet−1 et

Emotion Representation

yt

Input

Context

Output

∀j ∈ [1,M] , j ≠ i

VisualInput

vj,t

(b)

Figure 7: (a). Original bcLSTM model architecture with GloVe CNN/RoBERTa utterance feature extractor. (b).Left: Original DialogueRNN architecture. Right: Update schemes for global, speaker, listener, and emotion statesfor tth utterance in a dialogue. Here, Person i is the speaker and Persons j ∈ [1,M] and j 6= i are the listeners.

of two person conversations among ten differentunique speakers. The train and validation set dia-logues come from the first eight speakers, whereasthe test set dialogues are from the last two speak-ers. Each utterance is annotated with one of thefollowing six emotions: happy, sad, neutral, an-gry, excited, and frustrated.

DailyDialog (Li et al., 2017) is a manuallylabelled multi-utterance dialogue dataset. Thedataset covers various topics about our daily lifeand follows the natural human communication ap-proach. All utterances are labeled with both emo-tion categories and dialogue acts (intention). Theemotion can belong to one of the following seven

labels: anger, disgust, fear, joy, neutral, sad-ness, and surprise. The neutral label is the mostfrequent emotion category in this dataset, witharound 83% utterances belonging to the class. Theemotion class distribution is thus highly imbal-anced in this dataset. In comparison, the dialogueact label distribution is relatively more balanced.The act labels can belong to the following fourcategories: inform, question, directive, and com-missive.

MultiWOZ (Budzianowski et al., 2018) orMulti-Domain Wizard-of-Oz dataset is a fully-labeled collection of human-human written con-versations spanning over multiple domains and

Dataset# dialogues # utterances

train val test train val testIEMOCAP 108 12 31 5163 647 1623DailyDialog 11,118 1,000 1,000 87,179 8,069 7,740MultiWOZ 8438 1000 1000 113556 14748 14744Persuasion For Good 220 40 40 7902 1451 1511

Dataset Classification Task # classes MetricIEMOCAP Emotion 6 Weighted Avg F1

DailyDialogEmotion 6* Weighted Avg, Macro, Micro F1

Act 4 Weighted Avg, Macro F1MultiWOZ Intent 10 Weighted Avg, F1

Persuasion For GoodPersuader 11 Weighted Avg, Macro F1Persuadee 13 Weighted Avg, Macro F1

Table 1: Statistics of splits and evaluation metrics used in different datasets. Neutral* classes constitutes to 83%of the DailyDialog dataset. These are excluded when calculating the metrics in DailyDialog.

topics. The dataset has been created for task-oriented dialogue modelling and has 10,000 dia-logues, which is at-least an order bigger than pre-viously available task-oriented corpora. The dia-logues are labelled with belief states and actions.It contains conversations between an user and asystem from the following seven domains: restau-rant, hotel, attraction, taxi, train, hospital and po-lice. In this work we focus on classifying the intentof the utterances from the user which belong toone of the following categories: book restaurant,book train, find restaurant, find train, find attrac-tion, find bus, find hospital, find hotel, find police,find taxi, and None. The None utterances are notincluded in evaluation. Note that, utterances fromthe system side are not labelled and thus are notclassified in our framework.

Persuasion For Good (Wang et al., 2019)dataset is a persuasive dialogue dataset whereone participant aims to persuade the other par-ticipant to donate his/her earning using differentpersuasion strategies. The two participants aredenoted as Persuader aka ER and Persuadee akaEE respectively. In this work, we formulate ourproblem to classify the utterances of Persuaderand Persuadee separately using the full context ofthe conversation. The Persuader strategies are tobe classified into the following eleven categories:donation-information, logical-appeal, personal-story, emotion-appeal, personal-related-inquiry,foot-in-the-door, source-related-inquiry, task-related-inquiry, credibility-appeal, self-modeling,

and non-strategy-dialogue-acts. In contrast, thePersuadee strategy can belong to one of thefollowing thirteen categories: disagree-donation-more, ask-org-info, agree-donation, provide-donation-amount, personal-related-inquiry,disagree-donation, task-related-inquiry, negative-reaction-to-donation, ask-donation-procedure,positive-reaction-to-donation, neutral-reaction-to-donation, ask-persuader-donation-intention,and other-dialogue-acts.

The dataset consists of 1017 dialogues, inwhich 300 dialogues are annotated with persua-sion strategies. In this work, we use the annotateddialogues and partition them into train (220 dia-logues), validation (40 dialogues), and test (40 di-alogues) split to conduct our experiments.

Statistics Some statistics about the number ofdialogues and utterances in the four datasets arepresented in Table 1.

Label Transitions. To check whether there liesany patterns in the label sequences of the datasets,in Fig. 8 and 9, we plot frequency of the la-bel pairs (x,y) where x and y are the labels ofUst−1,t−1 and Ust ,t respectively. Figure Fig. 8 ex-plains inter-speaker label transition and Fig. 9 il-lustrates the intra-speaker label transition. Boththese plots reveal the same emotion labels appear-ing in the consecutive utterances with high fre-quency in the IEMOCAP dataset. This induceslabel dependencies and consistencies and can becalled as the label copying feature of the dataset.From our empirical analysis in Section 6.7, we

Figure 8: The heatmap of inter-speaker label transition statistics in the datasets. The color bar represents normal-ized number of inter-speaker transitions such that elements of each matrix add up to 1. Inter-speaker transitions arenot defined in MultiWOZ as system side utterances are not labeled. Note: For the DailyDialog dataset, we ignorethe neutral emotion in this figure.

confirm this property of the IEMOCAP dataset.Although not as strong as IEMOCAP, the intra-speaker label copying feature is also prevalent inthe MultiWOZ and DailyDialog (Act) dataset (re-fer to Fig. 8). Moreover, we observe interestingpatterns in DailyDialog (Act). A directive utter-ance is commonly followed by a commissive utter-ance. This indicates that utterances with acts suchas request and instruct (directive label) are fol-lowed by accepting/rejecting the request or order(commissive label). We also notice that an utter-ance with the act of questioning is commonly fol-lowed by the utterances with the act of answering(which is quite natural). Fig. 8 also corroboratesthe high frequent joint appearance of similar emo-tions in both speaker’s utterances e.g., negativeemotions — anger, frustration, sad expressed byone speaker is replied with a similar negative emo-tion by the other speaker. Interestingly, the Daily-Dialog dataset for emotion classification does notelicit any such patterns. We can attribute this to

the scripted utterances present in the IEMOCAPthat has specifically been designed to invoke moreemotional content to the utterances. On the otherhand, the DailyDialog dataset comprises natural-istic utterances that are more dynamic in nature asthey depend on interlocutors’ personality. In bothIEMOCAP and DailyDialog datasets, the repeti-tions of the same emotions can be found in con-secutive utterances of a speaker. The repetition ofthe same or similar emotions for a speaker is fre-quent and often forms long chains in IEMOCAP.However, such repetitions are much less prevalentin DailyDialog. Readers are referred to Fig. 9 for aclearer view. This two different types of datasetsused in this work is purposefully crafted in or-der to study dataset-specific nuances to attemptthe same task. In DailyDialog, approximately80% of utterances are labeled as no-emotion (seeFig. 10) which poses a difficult challenge to per-form emotion classification. These two datasetsalso differ from each other in the average dialogue

Figure 9: The heatmap of intra-speaker label transition statistics in the datasets. The color bar represents normal-ized number of intra-speaker transitions such that elements of each matrix add up to 1. Note: For the DailyDialogdataset, we ignore the neutral emotion in this figure.

Figure 10: The heatmap of intra-speaker (left) and inter-speaker (right) label transition statistics in the DailyDialogdataset including neutral emotion. The color bar represents normalized number of inter-speaker and intra-speakertransitions such that elements of each matrix add up to 1.

length. While the average number of utterancesper dialogue in the IEMOCAP dataset is more than50, the average number of utterances per dialoguein the DailyDialog dataset is just 8 which is muchshorter.

Among other semantically plausible label tran-sitions, we can see in Fig. 9, the intent book restau-rant to be frequently followed by the intent findtaxi in the MultiWOZ dataset. We believe this is

potentially one of the reasons why contextualmodels perform so well on these three datasetsand tasks compared to the rest which we discussin the subsequent sections. Further, label depen-dency and consistency can aid filtering likely la-bels given the prior labels. Notably, such patternsare not visible in the other datasets. Hence, onecan use Conditional Random Field (CRF) to findany hidden label patterns and dependencies.

4.2 Evaluation Metrics

In our experiments, we use evaluation metrics asspecified in Table 1. Weighted average (W-Avg)F1 score is the main metric in IEMOCAP emotionand MultiWOZ intent classification. However, wereport this metric in all the other tasks as well. Forthe other tasks - DailyDialog emotion, DailyDia-log act, Persuader and Persuadee strategy classifi-cation, the label distribution is highly imbalanced.Hence we also report Macro F1 scores. In Dai-lyDialog emotion classification, neutral labels areexcluded (masked) while calculating the metrics.However, this utterances are still passed in the in-put of the different recurrent models. In DailyDi-alog emotion classification, we also report Micro-F1 scores.

4.3 Training Setup

We use the 300 dimensional pretrained 840BGloVe vectors in our CNN based feature extractor.The GloVe pretrained embeddings are kept fixedthroughout the training process. For RoBERTabased feature extractor, we use the pretrainedRoBERTa-Base model which is fine-tuned duringtraining. In our bcLSTM and DialogueRNN mod-els, we use LSTM and GRU hidden sizes of Dh(100/200/1024 in Table 2) with forward and back-ward state concatenation. The GloVe based mod-els are trained with a learning rate of 1e-3 and abatch size of 32 dialogues with the Adam opti-mizer. The RoBERTa based models are trainedwith a learning rate of 1e-5 and a batch size of 4dialogues with the AdamW optimizer. All modelsare trained for 100 epochs. A chart of used hyper-parameters is shown in Table 2.

5 Results

We report results for IEMOCAP, DailyDialogdataset in Table 3 and MultiWOZ, Persuasion forGood dataset in Table 4. We ran each experimentmultiple times and report the average test scoresbased on the best validation scores.

We observe that there is a general trend of im-provement in performance when moving to theRoBERTa based feature extractor from the GloVeCNN feature extractor except in the intent predic-tion task in MultiWOZ dataset. As the RoBERTamodel has been pre-trained on a large amountof textual data and has considerably more pa-rameters, this improvement is expected. The re-sults could possibly be improved even more if

a RoBERTa-Large model is used instead of theRoBERTa-Base model that we use in this work.

We also observe that contextual models —bcLSTM and DialogueRNN perform much bet-ter than the non-contextual Logistic Regressionmodels in most cases. Context information iscrucial for emotion, act, and intent classificationand models such as bcLSTM or DialogueRNN aresome of the most prominent methods to modelthe contextual dependency between utterances andtheir labels. In IEMOCAP, DailyDialog and Mul-tiWOZ there is a sharp improvement in perfor-mance in contextual models compared to the non-contextual models. However, for the strategy clas-sification task in Persuasion for Good dataset,the improvement in contextual models is rela-tively lesser. Notably, for Persuadee classification,the RoBERTa non-contextual model achieves thebest result, outperforming the contextual models.Without the presence of residual connections, theGloVe cLSTM and GloVe bcLSTM baselines per-form poorly than the non-contextual GloVe CNNbaseline in the Persuasion for Good dataset. Thisbeckons the need for better contextual models forthis dataset. To analyze the results of the differentmodels we look at the following aspects:

Importance of the Residual Connections in theModels. It is also to be noted that the intro-duction of the residual connections generally im-proves the performance of the contextual mod-els. We obtain better performance and improvedstability during training for most of the modelswith residual connections. In particular, resid-ual connections are mostly effective in IEMOCAPand Persuasion for Good datasets that compriselong dialogues. Residual connections are usedin deep networks to aid information propagationand tackle vanishing gradient problems (Wu et al.,2006; Kim et al., 2017) in RNNs by improvinggradient flow. As multi-layered RNN-like archi-tectures often find it difficult to model long-rangedependencies in a sequence due to vanishing gra-dient problems (Pascanu et al., 2013), we conjec-ture, that could be one of the reasons why we seea great performance boost with residual connec-tions by helping propagate key information formthe CNN layers to the output of LSTM layers thatmight be lost due to the long deep sequence mod-eling in the LSTM layer. Residual connectionsalso help in combating vanishing gradient issuesby improving gradient flow. Unlike IEMOCAP

Models bcLSTM DialogueRNNDh lr bs Dg Dp De lr bs

GloVe All Datasets 100 1e-3 32 100 100 100 1e-3 32RoBERTa All Datasets except IEMOCAP 200 1e-5 4 200 200 200 1e-5 4RoBERTa IEMOCAP 1024 1e-5 4 1024 1024 1024 1e-5 4

Table 2: Hyperparameter details of the experiments. Note: lr→Learning rate; bs→Batch size.

Methods

IEMOCAP DailyDialogEmotion Emotion Act

W-Avg F1 W-Avg F1 Micro F1 Macro F1 W-Avg F1 Macro F1GloVe CNN 52.04 49.36 50.32 36.87 80.71 72.07GloVe cLSTM 59.10 52.56 53.67 38.14 83.90 78.89

w/o Residual 55.07 52.56 53.26 38.12 84.06 78.54GloVe bcLSTM 61.74 52.77 53.85 39.27 84.62 79.12

w/o Residual 58.32 54.74 56.32 39.24 84.10 78.98GloVe DialogueRNN 62.57 55.18 55.95 41.80 84.71 79.60

w/o Residual 61.32 54.50 55.29 40.05 83.98 79.16RoBERTa LogReg 54.12 52.63 52.42 40.02 82.55 75.62RoBERTa bcLSTM 62.72 56.05 56.77 43.26 85.17 82.16

w/o Residual 62.86 55.92 57.32 43.03 86.35 80.69RoBERTa DialogueRNN 64.12 59.07 59.50 45.19 86.31 82.20

w/o Residual 63.96 57.57 57.76 44.25 86.28 82.08

Table 3: Classification performance in test data for emotion prediction in IEMOCAP, emotion prediction in Dai-lyDialog, and act prediction in DailyDialog. Scores of the Glove-based models are reported after averaging 20different runs. RoBERTa-based models were run 5 times and we report the average scores. Test F1 scores arecalculated at best validation F1 scores.

and Persuasion for Good, in DailyDialog and Mul-tiWOZ datasets, the improvement in performancecaused by the residual connections is only littlewhich can be attributed to the relatively shorter di-alogues present in these two datasets.

Importance of Bidirectionality in the Models.The use of bidirectionality without the presenceof residual connection (as shown in Section 3)in bcLSTM improves the score by 1-3% (referto Table 3 and Table 4) across all the datasets.Contrary to IEMOCAP, we found that bidirec-tionality is less useful for emotion classificationin DailyDialog dataset which comprises of veryshort dialogues. The difference in performancebetween cLSTM and bcLSTM indicates the im-portance and need of bidirectionality in the mod-els for capturing contextual information from thefuture utterances while classifying utterance Ut .To have a better idea about this difference, oneshould compare the cLSTM in Table 3 and Table 4with row no. 33 in Table 7. Although the under-neath model architecture in these two settings is

different, they are definitely comparable as bothattempts to measure the importance of future utter-ances while classifying utterance at time t. Fromthese tables, a conclusive trend is found — futureutterances are very important in the IEMOCAPdataset but not much in the rest of the datasets.We conjecture the long dialogues and the inter-utterance label dependency and consistency in theIEMOCAP dataset is the predominant factor forthis observation. In Section 6.7, we empiricallyconfirm the existence of this label dependence andconsistence in the IEMOCAP dataset.

Does the Use of Conversational Context Help?The bcLSTM and DialogueRNN architecture havebeen primarily developed as emotion/sentimentclassifiers in utterance level dialogue classificationtasks. To explain the performance of these mod-els across a diverse range of dialogue classifica-tion tasks, it is important to understand the natureof the task itself. In emotion classification, la-bels are largely dependent on other contextual ut-terances. In IEMOCAP, except for some cases of

MethodsMultiWOZ Persuasion For Good

Intent Persuader PersuadeeW-Avg F1 W-Avg F1 Macro F1 W-Avg F1 Macro F1

GloVe CNN 84.30 67.15 54.33 58.00 41.03GloVe cLSTM 95.03 68.75 54.36 59.46 41.62

w/o Residual 95.12 64.62 49.08 54.87 36.36GloVe bcLSTM 96.14 69.26 55.27 61.18 42.19

w/o Residual 96.21 67.20 52.75 55.02 37.72GloVe DialogueRNN 96.32 68.96 56.29 61.11 42.18

w/o Residual 96.08 68.77 54.20 58.72 39.06RoBERTa LogReg 85.70 71.98 60.36 63.45 51.74RoBERTa bcLSTM 95.46 71.85 61.05 64.14 50.11

w/o Residual 95.61 71.06 58.72 62.73 44.74RoBERTa DialogueRNN 95.61 72.91 62.03 64.33 49.22

w/o Residual 95.29 72.45 60.49 64.21 49.71

Table 4: Classification performance in test data for intent prediction in MultiWOZ, persuader and persuadee strat-egy prediction in Persuasion for Good. Scores of the Glove-based models are reported after averaging 20 differentruns. RoBERTa-based models were run 5 times and we report the average scores. Test F1 scores are calculated atbest validation F1 scores.

emotion shift (sudden change from positive groupof emotion to negative group or vice versa) la-bels are correlated and often appears in contin-uation. bcLSTM and DialogueRNN can modelthis inter-relation between utterances and performmuch better than non-contextual models. For in-tent classification in MultiWOZ, the label depen-dency between utterances is even stronger. Multi-WOZ is a task oriented dialogue dataset between auser and a system. When the user has some queryor some information need, all utterances tend tohave the same intent label until the query gets re-solved. Hence, all contextual models can performthis task relatively easily (evident from the veryhigh F1 scores). We also observe this trend inDailyDialog emotion and act classification tasks.In both of these tasks, contextual information ishelpful and provides a significant improvement inperformance. However, in Persuasion for Gooddataset, the introduction of context is much lesshelpful. The persuader and persuadee strategy la-bels are relatively more difficult to model evenwith contextual information.

Can Speaker-specific Model like DialogueRNNHelp in Varied Dialogue Classification Tasks?DialogueRNN is a speaker specific model whichdistinguishes different speaker in a conversationby keeping track of individual speaker states.Identifying speaker level context is fundamentalfor the task of emotion recognition in conversa-

tions (Ghosal et al., 2019; Zhang et al., 2019). It isthus expected that DialogueRNN would performbetter than non-contextual or bcLSTM models inthis task as evidenced in Table 3. Additionally Di-alogueRNN produces the best results or close tothe best results in the other tasks as well. In intentclassification or persuasion strategy classification,we only classify utterances coming from only oneof the parties (user party in MultiWOZ intent, per-suader or persuadee party in persuasion strategyclassification). This might explain why the per-formance of DialogueRNN is occasionally lowerthan bcLSTM in some of these tasks. But, overallit can be concluded that speaker specific modellingis indeed important and helps in various dialogueclassification tasks.

Variance in the Results. As deep learning mod-els tend to yield varying results across multi-ple training runs, we trained each model mul-tiple times and report the average score in Ta-ble 3 and Table 4. In general, we observed thatthe RoBERTa-based models show lesser variancecompared to the GloVe-based models.

Variance in the Glove-based models: The ob-served variance is higher for emotion classifica-tion in IEMOCAP and DailyDialog as comparedto act and intent classification in DailyDialog andMultiWOZ, respectively. Both baseline models –Glove CNN and bcLSTM show standard deviationof about 1.28% in the IEMOCAP dataset across

different runs. In the Persuasion for Good dataset,for both persuader’s and persuadee’s act classifi-cation tasks, the deviation remains around 1.6%when we consider the Macro-F1 metric. How-ever, for the Weighted-F1 metric, the performanceis relatively stable as upon accumulating multipleruns the standard deviation is about 0.99% acrossthe baselines. A similar trend is also prevalentin the DailyDialog dataset for emotion classifica-tion. In this task, the baselines – Glove CNN andbcLSTM show standard deviation of about 1.19%when Weighted-F1 and Micro-f1 are considered.According to Macro-F1 metric, however, thesebaselines are exposed to relatively higher standarddeviation of 2.88%. This is likely to be a conse-quence of severe label imbalance in the dataset,that is having 80% neutral utterances. We have ob-served that a majority of these neutral samples donot exhibit neutral emotion. Therefore, this poorlabeling quality may have precipitated this largevariance in the results. On the other hand, thebaseline models perform consistently in the intentand act classification tasks in MultiWOZ and Dai-lyDialog datasets respectively showing standarddeviation of around 0.55% across different runs.When comparing among the baselines, we foundhigher variances in the results obtained with theGlove CNN than the bcLSTM.

One possible reason behind the variances inthe results of the GloVe-based models could bethe end-to-end training setup that renders themodel deeper. The original bcLSTM and Dia-logueRNN model employed a two-stage trainingmethod where the utterance feature extractor isfirst pretrained and then kept unchanged duringthe contextual model training. This setting maymake those original models more stable. Simi-larly, we think, in our end-to-end setup, a more so-phisticated training regime could result in a lesservariance of the results. For example, the utter-ance feature extractor could be trained only for thefirst few epochs and then kept frozen during sub-sequent epochs of the training. Due to this highvariance in the end-to-end Glove-based models,the future works on these datasets and tasks whichemploy this setting should report the average re-sults of multiple runs for a fair comparison of themodels.

Variance in the RoBERTa-based models: TheRoBERTa based models show much lesser vari-ance in performance across different runs. In par-

ticular, the standard deviations in the results ofRoberta-based bcLSTM are 0.57 on the IEMO-CAP, 0.08, and 0.48 in the DailyDialog for emo-tion and act classification tasks, respectively, 0.07in the MultiWoz dataset, 0.9 and 1.04 in thePersuasion for Good dataset for persuader’s andpersuadee’s act classification tasks respectively.RoBERTa-based DialogueRNN shows a similartrend. We surmise that this is the case because thefeature extractor’s weights are initialized from apretrained checkpoint. Thus, the feature extractoralready provides meaningful features from the be-ginning of training and is not required to be trainedfrom scratch, resulting in greater stability in theperformance.

5.1 Mini-Batch Formation TechniqueTo understand the sensitivity of the training to ut-terance distribution across mini-batches, we con-sider two scenarios (Fig. 11):

• Dialogue distribution across mini-batches,

• Utterance distribution across mini-batches.

The first scenario keeps all the constituent ut-terances of a dialogue in a single mini-batch thatallows classifying them using a single RNN passwith O(n) cost (n is the number utterances ina dialogue)—this is the default mode of opera-tion for bcLSTM and DialogueRNN. In contrast,the second scenario may distribute the target ut-terances to be classified across different mini-batches. This leads to one RNN pass per utter-ance, that is considerably computationally costlierthan the previous scenario—O(n2). Further, therepresentation of a particular context utterance islikely to vary based on the mini-batch due to theupdate of network parameters by a fraction of theirgradients per mini-batch. This curtails joint train-ing of all the constituent utterances in a dialogue,precipitating poorer performance as evidenced byFig. 12. We found when the batch size is small,baselines with dialogue level mini-batch performbetter than their counterparts with utterance-levelmini-batch configuration. However, the differencegradually reduces with the increase in the batchsize. In fact, in several datasets (as shown inFig. 12, with the batch size equal to or greater than16, the performance difference between these twodifferent configurations are not significant.

Key Consideration. When comparing twomodels tasked for utterance classification in

u1 u2 u3 u4 u5

v1 v2 v3 v4w1 w2 w3

Dialogue 1

Dialogue 2

Dialogue 3

u1 u2 u3 u4 u5

Mini-Batch 1

v1 v2 v3 v4

Mini-Batch 2

w1 w2 w3

u1 u2 u3 u4 u5

u1 u2 u3 u4 u5

Mini-Batch 1

w1 w2 w3

u1 u2 u3 u4 u5

Mini-Batch 2

w2w1 w3

v1 v2 v3 v4

Dialogue-Based Mini-Batches

Utterance-Based Mini-Batches

u1 u2 u3 u4 u5

Mini-Batch 3

w2w1 w3

v1 v2 v3 v4

u1 u2 u3 u4 u5

Mini-Batch 4

v1 v2 v3 v4

v1 v2 v3 v4

Input Dialogues

Figure 11: Two different types of mini-batch construction from the input dialogues. Faded utterances in utterance-based mini-batches indicate context utterances, which are not present in the calculation of loss.

dialogues, the mini-batch construction processshould be paid careful attention. As we haveexplained above, the batch size is a crucialhyperparameter when juxtaposing a modelwith utterance-level mini-batch with anothermodel with dialogue-level mini-batch. Whilewe acknowledge the fact that most of the GloVeembedding-based models tend to be small andusing a larger batch size in those models forutterance-level mini-batch construction is not aproblem, we think this can be a colossal issue incontextualized word embedding-based modelssuch as BERT. Models like BERT have billionsof parameters and using large batch sizes to trainthese models can be one of the bottlenecks. Insuch scenarios, we recommend using dialogue-level mini-batch which can still perform decentlywith relatively smaller batch sizes as can be seenin Fig. 12 and be computationally inexpensive

compared to utterance-level mini-batch.The experiments conducted in Section 6 are

based on utterance-based mini-batch training. Wemake different contextual modifications for eachtarget utterance in a dialogue. Hence, utterance-based training and evaluation is necessary.

6 Analysis

We set up different scenarios to probe the GloVebcLSTM and GloVe CNN baselines because thesetwo are conceptually much more straightforwardthan DialogueRNN. For example, in addition tocontext, DialogueRNN also tracks the speakerstates based on the utterances. Thus, the pertur-bations in the input would influence speaker mod-eling along with the context modeling. This mayresult in more complex deviations than bcLSTM,which are more difficult to analyze. Simple mod-els are likely to be more interpretable. E.g., ow-

Figure 12: Sensitivity of GloVe bcLSTM and GloVe CNN to Utterance Distribution across Mini-Batches.

ing to DialogueRNN’s complexity, we need to per-form different levels of ablation studies, such asspeaker GRU removal, listener state’s addition,and removal, to explain its behavior.

We stick to GloVe word embeddings asRoBERTa-based word embeddings are trained us-ing the masked language model (MLM) objectivethat is already very powerful in modeling crosssentential context representation as demonstratedby other works (Liu et al., 2019; Lewis et al.,2019). Hence, to conduct a fair comparison be-tween non-contextual and contextual models andfurther, for an easier apprehension on the roleof contextual information in utterance-level dia-logue understanding, we resort to the GloVe CNNmodel. Additionally, GloVe CNN is computation-ally more efficient. In all the experiments ex-plained from subsection Section 6.3 onward, weuse the utterance-level mini-batch setting as it al-lows contextual baseline model — bcLSTM to beflexible to context corruption and altering. In thefollowing subsections, we use GloVe bcLSTM andbcLSTM interchangeably.

For the analyses requiring training, we trainedall the models at least 5 times and report their themean test score. For the remaining analyses, weevaluated the test results using saved models atcheckpoints of different runs and report the aver-age scores. The trends in these results, delineatedin the following sections, were found consistent at

these checkpoints, i.e., although models in differ-ent runs yield varied performance on the originaltest data, they behave similarly to the same inputperturbation. As most of the results presented inthe following sections are obtained from the ex-periments conducted in the utterance-level mini-batch setting, there is a disparity between these ta-bles with Table 3 and 4. In utterance-level mini-batch setup, we obtain better performance in someof the tasks which is reported in the subsequentsections. The readers are recommended to refer toTable 3 and 4 for the baseline results which wereobtained by averaging the outcomes of more than20 runs. The rows in the following tables whereno perturbation were applied to the inputs can beused as points of reference to analyze the effect ofperturbations at the inputs.

6.1 Classification in Shuffled Context

To analyze the importance of context, we shuf-fle the utterance order of a dialogue and try toclassify the correct label from the shuffled utter-ance sequence. For example, a dialogue havingutterance sequence of {u1,u2,u3,u4,u5} is shuf-fled to {u5,u1,u4,u2,u3}. This shuffling is car-ried out randomly, resulting in an utterance se-quence whose order is different from the originalsequence. We design three such shuffling experi-ments: i) dialogues in train and validation sets areshuffled, dialogues in test set are kept unchanged,

GloVe bcLSTM IEMOCAP DailyDialogContext Shuffling Strategy Emotion Emotion ActTrain Val Test OP W-Avg F1 W-Avg F1 Micro F1 Macro F1 W-Avg F1 Macro F1

7 7 7 7 61.74 52.77 53.85 39.27 84.62 79.46X X 7 7 59.74 52.29 51.97 36.87 81.82 74.887 7 X 7 57.63 48.35 50.32 34.58 76.65 66.81X X X 7 59.82 52.17 52.92 37.69 81.84 74.62X X X X 59.47 49.16 51.67 32.53 81.29 73.83

Table 5: Performance of GloVe bcLSTM models in IEMOCAP and DailyDialog for various shuffling strategies.In Train, Val, Test column Xdenotes shuffled context and 7denotes unchanged context. In OP column Xdenotesadditional order prediction objective.

GloVe bcLSTM MultiWOZ Persuasion for goodContext Shuffling Strategy Intent Persuader PersuadeeTrain Val Test OP W-Avg F1 W-Avg F1 Macro F1 W-Avg F1 Macro F1

7 7 7 7 96.14 69.26 55.27 61.18 42.19X X 7 7 91.34 68.06 54.91 59.27 41.527 7 X 7 67.91 65.30 50.69 55.07 37.17X X X 7 90.78 66.32 53.60 58.46 40.96X X X X 90.67 67.60 53.50 58.69 41.62

Table 6: Performance of GloVe bcLSTM models in MultiWOZ and Persuasion for good for various shufflingstrategies. In Train, Val, Test column Xdenotes shuffled context and 7denotes unchanged context. In OP columnXdenotes additional order prediction objective.

ii) dialogues in train and validation sets are keptunchanged, whereas dialogues in test set are shuf-fled, iii) dialogues in train, validation and test setsare all shuffled.

We analyze these shuffling strategies in theGloVe bcLSTM model. In theory, the recurrentnature of the bcLSTM model allows it to be ca-pable of modelling contextual information fromthe very beginning of the utterance sequence tothe very end. However, when classifying an ut-terance, the most crucial contextual informationcomes from the neighbouring utterance. In an al-tered utterance context, the model would find itdifficult to predict the correct labels because theoriginal neighbouring utterances may not be in im-mediate context after shuffling. This kind of per-turbation would make the context modelling lessefficient, and performance is likely to drop com-pared to their non-shuffled context counterparts.This is empirically shown in Table 5 and Table 6.

We observe that, whenever there is some shuf-fling in train, validation, or test set, the perfor-mance decreases a few points in both the datasetsacross all the tasks and all the evaluation metrics.Notably, the performance drop is highest when the

dialogues in train and validation sets are kept un-changed and dialogues in test set are shuffled.

6.2 Classification in Shuffled Context withOrder Prediction

In some of the shuffling strategies, we enforcean additional utterance order prediction (OP) ob-jective to see how it affects the results. We as-sume that if the network learns to predict how theoriginal order of the utterances has been shuffled,then it may improve the main utterance level di-alogue classification task as well. In this setup,the order prediction objective is realized throughthe same bcLSTM network and an additional fullyconnected layer with softmax activation on top.In the previous example, when {u1,u2,u3,u4,u5}is shuffled to {u5,u1,u4,u2,u3}, the additional ob-jective is to predict the shuffled sequence order asclass labels (in the new fully connected softmaxlayer). Here, the sequence order labels to be pre-dicted are {5, 1, 4, 2, 3} respectively. The networkis thus trained with utterance order prediction ob-jective and the main classification objective (emo-tion, act, intent or persuasion strategy prediction)jointly.

We report results of this additional objectivewith the shuffling strategy iii) i.e., train, validation,and test set are all shuffled. Results are reported inTable 5 and Table 6. In most of the tasks, the addi-tional order prediction objective doesn’t help. Weonly observe some improvement in performancein the strategy classification tasks in Persuasion forGood.

6.3 Controlled Context DroppingIn Table 3 and Table 4 we observe a large improve-ment in performance in the contextual models(bcLSTM, DialogueRNN) compared to the non-contextual CNN models. Now, we intend to ana-lyze why this improved performance is observedand how recurrent models such as bcLSTM usescontextual dialogue information from past and fu-ture utterances effectively.

To understand this effect, we make a compre-hensive analysis of the GloVe bcLSTM model anddesign an experimental study with controlled con-text dropping. In the default setting of bcLSTM,the full dialogue history and future is available tothe model. Now, through a number of differentexperimental settings, we vary and limit the con-textual information that is available to the modeland study how the results are affected by it. Pleaserefer to Fig. 13 for a visual representation of thecontrolled context dropping method.

Method. For each target utterance ut in a dia-logue, we control the contextual information avail-able to it in as following:

• We control the availability of contextual ut-terances from the past, or the future, or both.

• While we are controlling the availability ofonly past utterances, we vary it in one of thefollowing ways:

– Dropping the 5 previous utterances. Inother words, access only to u0, . . . ,ut−6.

– Dropping all previous utterances.– Dropping all previous utterances except

the previous 5. In other words, accessonly to ut−5, . . . ,ut−1.

• Similarly, we modify the utterance accesswhile controlling future, or both past and fu-ture.

• We vary the control over past, future utter-ances in different combinations during train-ing (train, val set) and evaluation (test set).

Observations. We report the results for con-trolled context dropping experiments in Table 7.We observe that long distance context is muchmore important for emotion classification inIEMOCAP compared to the rest of the tasks. Row48 in Table 7 refers to the experimental settingwhere full conversational context is used duringtraining, but only 5 contextual utterances from thepast and future is used in the test set during eval-uation. In this setting, the performance drop inIEMOCAP emotion classification is significantlyhigher than the other tasks indicating the impor-tance of long distant contextual utterances.

Dropping all the future utterances further wors-ens the results in all the tasks. The reductionin performance is most significant in IEMOCAP(Row 47). We also observe some interesting re-sults for the setting of training on full context andevaluating by removing all the past context (Row44) or removing all the future context (Row 47).For DailyDialog emotion and act classification,there is a significant difference in performance inthese two kinds of context control settings. Com-pletely removing the past context results in muchpoorer performance signifying the importance ofcontextual information flow from past utterancesin DailyDialog.

The configuration in row 12 removes all the pastutterances and keeps all future utterances duringtraining. However, during evaluation, all futureutterances are discarded and predictions are basedonly on past utterances. Row 30 is also based ona similar setup where training is performed onlyon past utterances but evaluation is based only onfuture utterances. This contextual disparity duringtraining and evaluation causes the bcLSTM modelto perform very poorly across all the tasks.

The Persuasion for Good dataset contains dia-logues with large number of utterances. However,we found in our experiments that a window sizeof 5 contextual utterances is generally sufficient inproducing good results (Table 7 Row 41). FromTable 7, we also observe that the bcLSTM modelperforms better than the GloVe CNN model on allthe datasets apart from Persuasion for Good un-der the various context control configurations. Forstrategy classification in Persuasion for Good, anyperturbation beyond window 5 exposes the modelto noise and causes the performance to drop belowGloVe CNN baseline.

In any conversational classification setup, past

Train, Val Test IEMOCAP Dailydialog Dailydialog MultiWOZ Persuasion Persuasion

# Past Future Past Future Emotion Emotion Act Intent ER EEW-Avg F1 Macro F1 Macro F1 W-Avg F1 Macro F1 Macro F1

1 -5 – -5 – 60.45 37.38 73.11 93.97 54.54 40.442 -5 – -ALL – 58.05 36.04 72.79 93.18 53.41 40.383 -5 – -5 -5 58.29 34.91 71.41 76.92 53.53 40.04 -5 – – -5 58.65 34.78 70.54 78.09 53.37 40.15 -5 – – -ALL 50.94 35.05 70.83 85.72 51.18 38.116 -5 – +5 +5 55.92 37.63 72.92 94.06 53.12 39.097 -5 – – – 60.64 37.09 72.7 94.51 54.62 41.568 -ALL – -5 – 57.77 36.67 73.09 93.37 52.68 38.689 -ALL – -ALL – 57.39 36.84 73.13 93.39 53.97 41.0210 -ALL – -5 -5 55.23 34.34 71.22 73.28 51.65 38.6211 -ALL – – -5 55.8 33.73 71.27 75.6 51.43 38.8912 -ALL – – -ALL 45.43 33.72 71.5 83.11 46.83 40.1713 -ALL – – +5 53.9 36.66 72.96 93.56 53.9 41.0614 -ALL – – – 58.11 36.23 72.99 93.68 52.81 40.2715 -5 -5 -5 – 59.89 34.13 72.65 88.66 53.94 39.6616 -5 -5 -ALL – 55.34 33.13 72.3 84.76 51.49 38.5817 -5 -5 -5 -5 59.31 36.28 72.67 88.51 53.73 40.3418 -5 -5 – -5 59.26 36.94 72.73 89.09 53.52 40.9119 -5 -5 – -ALL 53.88 36.5 72.66 84.93 54.38 38.8320 -5 -5 +5 +5 55.91 34.68 72.48 86.04 53.27 39.5421 -5 -5 – – 60.1 34.78 72.59 90.24 53.91 40.0322 – -5 -5 – 58.16 31.35 62.56 78.32 54.24 37.3823 – -5 -ALL – 52.45 31.3 60.51 81.57 55.81 37.7724 – -5 -5 -5 58.14 33.21 62.62 78.66 54.73 38.8225 – -5 – -5 60.79 38.54 79.11 95.13 55.57 40.3426 – -5 – -ALL 54.41 39.34 79.08 95.1 55.59 42.7927 – -5 +5 +5 56.23 37.5 78.92 94.81 55.73 39.4528 – -5 – – 60.86 37.1 79.08 95.11 55.35 40.2529 – -ALL -5 – 53.41 33.82 62.41 77.66 53.05 39.0530 – -ALL -ALL – 44.34 32.74 60.47 79.51 53.43 40.6331 – -ALL -5 -5 53.67 34.43 62.35 77.16 52.91 39.0532 – -ALL – -5 56.58 40.22 78.35 95.06 55.37 41.433 – -ALL – -ALL 56.66 40.02 78.42 95.09 56.0 42.4134 – -ALL +5 +5 53.46 39.96 78.43 94.75 55.61 41.8335 – -ALL – – 56.61 40.63 78.4 95.09 56.01 41.6536 +5 +5 -5 – 57.62 34.58 63.44 88.5 53.63 41.8137 +5 +5 -ALL – 53.41 33.95 60.95 90.35 54.42 44.1438 +5 +5 -5 -5 56.77 33.54 62.64 73.27 52.66 39.2339 +5 +5 – -5 58.73 38.43 78.21 90.03 55.02 44.3540 +5 +5 – -ALL 55.03 38.4 77.68 93.01 51.26 41.5141 +5 +5 +5 +5 58.62 39.94 79.56 95.86 56.06 45.4642 +5 +5 – – 60.01 39.93 79.71 95.87 55.21 45.7243 – – -5 – 59.17 36.27 63.24 87.55 53.77 43.1944 – – -ALL – 53.86 35.69 60.97 89.59 54.49 44.2945 – – -5 -5 57.64 34.79 62.37 71.09 54.0 39.6846 – – – -5 59.5 39.6 77.44 89.82 55.46 43.8547 – – – -ALL 52.56 39.37 76.9 93.47 54.84 42.9948 – – +5 +5 57.31 40.74 78.87 95.81 54.79 45.2549 – – – – 61.9 41.16 79.46 96.22 56.28 44.83

Table 7: Results for controlled context dropping experiments in different settings. ER and EE denote Persuader andPersuadee strategy classification, respectively. In past (future) columns, -5 =⇒ dropping immediate 5 utterancesfrom the past (future), -All =⇒ dropping all utterances from the past (future), +5 =⇒ keeping only the immediate5 utterances from the past (future), – =⇒ keeping all utterances from the past (future). Scores are W-Avg F1 inIEMOCAP Emotion and MultiWOZ Intent; Macro F1 in the rest.

contextual information is elemental for recurrentmodels to understand the flow of the dialogue. Ad-ditionally, from the results in Table 7, we can alsoconclude that future utterances provide key con-textual information for the various classificationtasks. If the bcLSTM model is not trained on fullcontext, then there is a significant drop in perfor-mance even if we evaluate on the full context setup(Row 14, 35).

6.4 Speaker-specific Context ControlTo further evaluate the intra- and inter-speaker de-pendence and relation across the different tasks,we adopted two different settings as follows –

• w/o inter: when classifying a target utterancefrom speaker A, we drop the utterances of thespeaker B from the context and vice versa.

• w/o intra: when classifying a target utter-

u1 u2 u3 u4 u5

Original Dialogue

u6 u7 u8 u9 u10 u11 u12 u13

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13-3 Past

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13+3 Past

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13-3 Future

+3 Future u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13

Speaker 1

Speaker 2Target

Utterance

Figure 13: Illustration of controlled context drops of three utterances.

ance from speaker A, we only keep utterancesof the speaker B and drop all other utterancesof speaker A from the context and vice versa.

These two different settings are visually illustratedin Section 6.4.

Utterances of the Non-target Speaker are Im-portant. The first setting coerces bcLSTM toonly rely on the target speaker’s (speaker of thetarget utterance) context in prediction. The resultsare reported in Table 8 and 9. As expected, perfor-mance drops are observed for all the datasets butIEMOCAP and DailyDialog for emotion recogni-tion, reinforcing the fact that the contextual utter-ances from the non-target speakers are important.Performance drop in DailyDialog dataset for actclassification is noticeably the steepest. To un-derstand the behavior on the IEMOCAP dataset,we need to refer to the Fig. 8 and 9. In Fig. 9all the diagonal cells have high intensity, therebyindicating a pattern of the speakers maintainingthe same emotion along a dialogue. On the other

hand, Fig. 8 illustrates inter-speaker label transi-tions that maintain emotion consistency—emotionof a speaker is reciprocated by the same or an-other emotion of the same sentiment by the non-target speaker. Notably, Fig. 9 shows distinctivelyhigher density in its diagonal cells as comparedto the diagonal cells in Fig. 8. This suggests thatthe speakers in the IEMOCAP dataset repeat thesame emotion along consecutive utterances. Thistendency of the IEMOCAP dataset, as showed inFig. 9, overwhelms the inter-speaker patterns de-picted in Fig. 8. Consequently, this induces adataset bias. Hence, removing other interlocu-tor’s utterances from the context makes it easierand less confusing for the bcLSTM model to learnrelevant contextual representations for the predic-tion. Contrary to this, although exist, repetitions ofsame or similar emotions in consecutive utterancesof a speaker are less prevalent for emotion recog-nition in the DailyDialog dataset. Hence, the ‘w/ointer’ setting does not improve the performance asgreat it does for IEMOCAP.

Methods


W-Avg F1 W-Avg F1 Micro F1 Macro F1 W-Avg F1 Macro F1GloVe CNN 52.04 49.36 50.32 36.87 80.71 72.07GloVe bcLSTM 61.74 52.77 53.85 39.27 84.62 79.12

w/o inter 63.73 52.39 52.86 39.99 81.32 74.50w/o intra 56.45 52.81 53.54 35.93 83.80 78.69

Table 8: Classification performance in test data for emotion prediction in IEMOCAP, emotion prediction in Dai-lyDialog, and act prediction in DailyDialog. Utterances from other speakers and the same speaker are absentrespectively in the w/o inter and w/o intra settings. All scores are average of 20 different runs. Test F1 scores arecalculated at best validation F1 scores.

Methods

MultiWOZ Persuasion For GoodIntent Persuader Persuadee

W-Avg F1 W-Avg F1 Macro F1 W-Avg F1 Macro F1GloVe CNN 84.30 67.15 54.45 58.00 41.03GloVe bcLSTM 96.14 69.26 55.27 61.18 42.19

w/o inter 95.05 67.81 53.24 59.44 40.63w/o intra 95.75 66.06 52.23 58.65 38.93

Table 9: Classification performance in test data for intent prediction in MultiWOZ, persuader and persuadee strat-egy prediction in Persuasion for Good. Utterances from other speakers and the same speaker are absent respectivelyin the w/o inter and w/o intra settings. All scores are average of 20 different runs. Test F1 scores are calculated atbest validation F1 scores.

Utterances of the Target Speaker are also Im-portant. ‘w/o intra’ scenario reported in Table 8and 9 exhibits the importance of the utterances ofthe non-target speaker in the classification of thetarget utterance. In DailyDialog act and Multi-WOZ intent classification, even when we removethe contextual utterances from the same speaker,the utterances from the non-target speaker pro-vides key contextual information as evidenced bythe performance in the ‘w/o intra’ setting. In thosetasks, dropping the utterances of the non-targetspeaker results in more performance degradationas compared to the case when utterances from thetarget speaker are removed from the target utter-ance’s context. This observation also supports thedialogue generation works (Zhou et al., 2017) thatmainly consider previous utterances of the non-target speaker as the context for response gen-eration. For emotion classification in DailyDia-log and strategy classification in Persuasion ForGood, the results obtained from ‘w/o intra’ settingare also relatively lesser compared to the baselinebcLSTM setting. This confirms the higher contex-tual salience of the target speaker’s utterances overthe non-target speaker’s utterances for these par-

ticular tasks. In the case of the IEMOCAP emo-tion classification, removing the target speaker’sutterances from the context causes a substantialperformance dip for the reasons stated in the lastparagraph.

Interestingly, the “w/o inter” setting in the Dai-lyDialog dataset manifests two distinct trends fortwo different tasks – act classification and emo-tion recognition. While non-target speakers’ ut-terances carry a little value for emotion recogni-tion, they are extremely beneficial for act classi-fication. This calls for task-specific context mod-eling techniques which should be the focus of thefuture works.

The Key Takeaways of this Experiment. Al-though both target and non-target speakers’ utter-ances are useful in several utterance-level dialogueunderstanding tasks, we observe some divergenttrends in some of the tasks used in our experi-ments. Hence, we surmise that a task-agnostic uni-fied context model may not be optimal in solvingall the tasks. In the future, we should strive fortask-specific contextual models as each task canhave unique futures that make it distinct from theother. One can also think of multi-task architec-

u1 u2 u3 u4 u5

Original Dialogue

u6 u7 u8 u9 u10 u11 u12 u13

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13w/o inter

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13w/o intra

Speaker 1

Speaker 2Target

Utterance

Figure 14: Speaker-specific context control schemes.

tures where two tasks can corroborate each otherin improving the overall utterance-level dialogueunderstanding performance.

Logically, dropping contextual utterances in adialogue leads to inconsistency in the context andconsequently, it should degrade the performanceof a model that relies on the context for infer-ence. Hence, given an unmodified dialogue flow,an ideal contextual model is expected to refer tothe right amount of contextual utterances relevantin inferring the label of a target utterance. In con-trast, bcLSTM shows performance improvementfor emotion classification when utterances fromthe non-target speaker are dropped (refer to the"w/o inter" row in Table 8). The performancedoes not change much for dialogue act and in-tent classification in the DailyDialog and Multi-WOZ datasets, respectively, when we drop utter-ances of the target speaker. These contrasting re-sults indicate a potential drawback of the bcLSTMmodel in efficiently utilizing contextual utterancesof both interlocutors in unmodified dialogues forthe above mentioned tasks.

6.5 Context Flipping using Style Transfer

In emotion classification tasks, the emotion ofcontextual utterances plays a vital role in deter-mining the emotion of the target utterance. We canstudy this effect quantitatively using the methodknown as textual style transfer. Style transfer of

John , I was looking through some magazines for ideas about where we might go on vacation this year.

Neutral

I've already told my buddy, Mark, that I am going hunting with him in Alaska.

Neutral

You can't be serious!

<emotion>

Modified Context with Style Transfer

Hey, I've always gone hunting or fishing on vacation. I am sorry that bothers you. I have always gone here on vacation and had great fishing experience.

Angry

Ground Originalprediction

Prediction after

modification

Angry Angry Neutral

Figure 15: Predictions under context flipping usingstyle transfer.

text is defined as transferring a piece of text (gen-erally a single sentence or a short paragraph) fromone domain (style) to another while preserving theunderlying content. In particular, we use the senti-ment style transfer method that flips the sentimentof the input text while preserving the main seman-tic content. For example, given an input sentence:Hey, I’ve always gone hunting or fishing on vaca-tion. I am sorry that bothers you. with negativesentiment would be transferred to I have alwaysgone here on vacation and had great fishing expe-

Style Transfer Strategy Window IEMOCAP DailyDialog# Past Future Target Emotion Emotion

1 7 7 7 - 61.9 41.16

2 X 7 7 3 59.47 36.693 X 7 7 5 58.28 36.134 X 7 7 10 55.58 -5 7 X 7 3 60.76 38.526 7 X 7 5 59.92 38.127 7 X 7 10 57.16 -8 X X 7 3 56.35 34.519 X X 7 5 51.96 33.20

10 X X 7 10 46.28 -

Table 10: Results for emotion classification in IEMOCAP (Weighted F1) and DailyDialog (Macro F1) with contextflipping using style transfer. In DailyDialog, we constrain the window size to 3 and 5 as there are an average of8 utterances per dialogue in the dataset. We don’t change the style of the target utterance as that would imply anillogical evaluation with the gold emotion class.

rience. with positive sentiment. Sentiment styletransfer is used because sentiment is closely re-lated to emotion and ample parallel data is avail-able between positive to negative and vice versasentiment styles.

The main objective of this study is to analyzehow affect or emotion label orientation of contex-tual utterances plays a vital role in utterance levelemotion classification. Here, we devise a methodbased on the sentiment style transfer technique tostudy this effect systematically.

Method. The YELP reviews dataset (Li et al.,2018) contains parallel sentences between differ-ent sentiment styles. We first fine-tune a pretrainedT5 seq2seq model (Raffel et al., 2019) on the par-allel source and target sentences in the dataset us-ing teacher forcing via maximum likelihood esti-mation. After training, this model is capable ofchanging the sentiment of an input sentence frompositive to negative or vice versa while preservingthe main semantic content. We call this model theStyle Transfer model.

To apply this model in our datasets, we considerthe positive group of emotions: happy, joy, excitedcorrespond to positive sentiment, and the negativegroup of emotions: sad, angry, frustrated, disgust,fear correspond to negative sentiment. We train autterance-based bcLSTM model with unchangedtrain and validation data. During evaluation, wemodify the test data using the following method:

• For each target utterance ut in a dialogue, wetake a window of w neighbouring utterances

(context) in the past, or future, or both.

• If the contextual utterance belong to positive(negative) emotion group we flip its style tonegative (positive). This is achieved using theStyle Transfer model. The contextual utter-ance is kept unchanged if it belongs to neu-tral emotion category.

• The target utterance is kept unchanged.

Observations. We analyze the results with awindow size of 3, 5, 10 in IEMOCAP and 3, 5in DailyDialog in Table 10. In IEMOCAP, styletransfer in progressively larger window sizes re-sults in bigger performance drops. With a windowsize of 10 in both past and future directions, theF1-score of 46.28 is around 15% lesser than theoriginal score of 61.9. In DailyDialog, the veryhigh frequency of neutral labels would ensure thatmost of the contextual utterances will remain un-changed. However, we still observe a drop in per-formance in various settings in Table 10. The dropwhen we modify the past utterances is relativelymore compared to modifying the future utterancessuggesting that past utterances are relatively moreimportant for the emotion classification.

The context style transfer method keeps thecontent of the target and contextual utterances un-altered while reversing the affect or label orienta-tion of the contextual utterances. This process re-sults in significantly poor performance in both theemotion classification tasks suggesting that the la-bel orientation of the contextual utterances plays a

vital role in the overall classification performance.We strengthen this label dependency hypothesiswith more extensive experiments in Section 6.7.

6.6 Attacks with Context and TargetParaphrasing

Modern machine learning systems are often sus-ceptible to attacks that slightly perturb the in-put without any drastic change in the semantics.Although prevalent in images, adversarial exam-ples also exist in neural network-based NLP ap-plications. In the context of NLP, crafting adver-sarial examples would require making character-,word-, or sentence-level modifications to the in-put text to trick the classifier into misclassifica-tion. Paraphrasing sentences is one such methodto construct effective adversarial examples (Iyyeret al., 2018). We conduct several experiments toevaluate the sensitivity of utterance-level dialogueunderstanding systems to input paraphrasing. Itshould be noted that although task-specific adver-sarial strategies could be adopted, we chose to usea general set of attacking strategies in order to un-derstand the behavior of the baseline across differ-ent tasks and datasets. This also facilitates a faircomparison among the tasks and whether there isa confounding factor that differentiates one taskfrom another under the same attacking strategies.

I've already told my buddy, Mark, that I am going hunting with him in Alaska .

<emotion>

You can't be serious! You better be serious!

Angry

Modified Context with Paraphrasing

John, I was looking through some magazines for ideas about where we might go on vacation this year. John, I 'm searching for some ideas about where I could go on vacation this year.

Neutral


Prediction after modification

Neutral Neutral Angry

Figure 16: Predictions under modified context withParaphrasing-based Attack.

Method. We use the following scheme to ana-lyze this effect:

• The input utterances are modified at eitherword or character level.

I've already told my buddy, Mark, that I am going hunting with him in Alaska.I ' and already told meu budy, Mark, that I am going hunting with hem an Alasca.

Neutral


<emotion>

Modified Context with Spelling Corruption

John, I was looking through some magazines for ideas about where we might go on vacation this year. Jhoon, I is lloking through some magazines for iteas about where we might go on vacation this tear.

Neutral


Prediction after

modification

Angry Angry Surprise

Figure 17: Predictions under modified context withSpelling-based Attack.

– For word-level modification, an averageof 3 to 4 words are selected per utteranceand masked. The pre-trained RoBERTamodel is then used to fill the maskswith the most likely candidates. Theutterance with substituted words formthe new input. We call this methodParaphrasing-based Attack.

– The character-level modification isachieved by changing the spellingof, on average, 3 to 4 words per in-put utterance. We call this methodSpelling-based Attack.

• For each utterance (ut) in a dialogue, we takea window of w immediate neighbouring utter-ances (context) on which the above modifica-tions are performed. The window is selectedas follows:

– Only past w utterances: ut−w, ..,ut−1

– Only future w utterances: ut+1, ..,ut+w

– Past w and future w utterances:ut−w, ..,ut−1,ut+1, ..,ut+w

– Past w, future w, and the target utter-ance: ut−w, ..,ut−1,ut ,ut+1, ..,ut+w

– Only the target utterance: ut

In the last case, the window is empty. In othercases, we experiment with window size w =3,5,10.

Method Strategy Window IEMOCAP Dailydialog Dailydialog MultiWOZ Persuasion Persuasion# PA/SA Past Future Target Emotion Emotion Act Intent ER EE

1 - - - - - 61.9 41.16 79.46 96.22 56.28 44.83

2 PA X 7 7 3 61.09 40.82 75.81 95.67 56.46 43.643 PA X 7 7 5 60.93 38.79 77.23 95.53 56.41 41.934 PA X 7 7 10 59.83 - - 95.23 54.89 39.895 PA 7 X 7 3 61.58 39.6 79.11 95.94 55.83 43.216 PA 7 X 7 5 60.99 39.77 79.17 95.64 55.43 40.677 PA 7 X 7 10 60.72 - - 95.77 57.12 43.368 PA X X 7 3 59.43 37.16 76.61 94.87 57.44 42.519 PA X X 7 5 58.36 38.76 76.53 94.61 53.32 43.3310 PA X X 7 10 57.29 - - 94.31 54.36 43.811 PA 7 7 X - 58.08 37.16 75.3 93.78 50.24 38.7812 PA X X X 3 56.53 23.46 73.16 91.47 47.5 37.3913 PA X X X 5 53.64 28.59 73.18 90.98 45.31 35.1614 PA X X X 10 51.33 - - 90.58 49.0 32.49

15 SA X 7 7 3 59.59 36.5 76.07 95.63 56.81 43.1416 SA X 7 7 5 59.67 36.86 76.14 95.49 57.28 42.6317 SA X 7 7 10 59.06 - - 95.44 54.87 41.7718 SA 7 X 7 3 61.11 39.3 79.42 95.94 56.15 45.4619 SA 7 X 7 5 61.05 37.53 79.31 95.87 56.96 41.220 SA 7 X 7 10 59.31 - - 95.93 56.36 42.4921 SA X X 7 3 59.14 37.04 75.77 95.34 55.73 42.3722 SA X X 7 5 56.67 35.46 76.39 95.12 56.05 41.0523 SA X X 7 10 54.2 - - 94.98 55.67 40.5124 SA 7 7 X - 53.91 30.42 75.63 94.8 50.44 38.7425 SA X X X 3 48.11 22.55 75.77 93.1 46.55 35.2126 SA X X X 5 44.32 20.58 76.39 92.81 49.08 32.6527 SA X X X 10 40.22 - - 92.72 49.04 35.64

Table 11: Results for PA: Paraphrasing-based Attack; SA: Spelling-based Attack in utterance-based GloVebcLSTM model. In DailyDialog, we constrain the window size to 3 and 5 as there are an average of 8 utter-ances per dialogue in the dataset. Scores are W-Avg F1 in IEMOCAP Emotion and MultiWOZ Intent; Macro F1in the rest.

We train an utterance-based GloVe bcLSTMand a GloVe CNN model with unadulterated trainand validation data. During evaluation, how-ever, the context and target are paraphrased as de-scribed before. The results of these experimentsfor bcLSTM and GloVe CNN are shown in Ta-ble 11 and Table 12, respectively. We show a fewexamples cases in Fig. 16 and 17.

Observations. We observe that the Spelling-based Attack is significantly more effective thanParaphrasing-based Attack in fooling the classi-fier in the emotion classification tasks. The clas-sification performance progressively deteriorateswith larger window sizes. This is expected, sincespelling change would create out-of-vocabularywords that are missing from pre-trained GloVe.Models that operate on sub-word-based vocabu-lary would possibly be more resilient to this kindof attack.

In DailyDialog act classification, Paraphrasing-based Attack or Spelling-based Attack on only fu-ture utterances doesn’t affect the results at all. Theclassification performance still remains very close

to the original score of 79.46 %. As evidenced inFig. 8, there is a strong reliance on the label andcontent of past utterance in this task. For exam-ple, a question is likely to be followed by an in-form or another question and much less likely to befollowed by a commissive utterance. Unchangedpast context thus results in performance that isvery close to the original setup. Attacking the pastutterances combined with future and/or target ut-terances results in a relatively bigger performancedrop. We also notice that the drop in performanceis relatively much lesser than the other tasks ex-cept in MultiWOZ for intent classification. This ispossibly because the act labels are mostly drivenby the sentence type and hence unlikely to be af-fected from paraphrasing or spelling-based pertur-bations. For instance, around 30% of the act la-bels are of type question, and our attacks are al-most guaranteed not to change an utterance withlabel question to something which might be clas-sified as inform, commissive, or directive. Over-all, we observe a consistent plunge in the perfor-mance when the target utterance is attacked by ei-

Method IEMOCAP Dailydialog Dailydialog MultiWOZ Persuasion Persuasion# PA SA Emotion Emotion Act Intent ER EE

1G

love

CN

N - - 51.08 38.72 71.2 84.64 54.44 39.952 X 7 39.19(↓23.27) 23.82(↓39.64) 62.93(↓13.01) 70.34(↓16.89) 42.8(↓21.38) 33.59(↓15.91)3 7 X 44.68(↓12.52) 22.7(↓41.37) 61.86(↓13.11) 74.58(↓11.88) 42.95(↓21.10) 28.99(↓27.43)

4

bcL

STM - - 61.9 41.16 79.46 96.22 56.28 44.83

5 X 7 58.08(↓6.17) 37.16(↓9.71) 75.3(↓5.23) 93.78(↓2.53) 50.24(↓10.73) 38.78(↓13.49)6 7 X 53.91(↓12.90) 30.42(↓26.09) 75.63(↓4.82) 94.8(↓4.82) 50.44(↓10.37) 38.74(↓13.58)

Table 12: Results for PA: Paraphrasing-based Attack; SA: Spelling-based Attack in GloVe CNN model and com-paring it to bcLSTM results in Table 11. IEMOCAP Emotion, MultiWOZ Intent: W-Avg F1; Rest: Macro F1.

ther paraphrase or spelling-based attacks.For intent classification in MultiWOZ, utter-

ances often have keywords which indicate the la-bel (presence of train might indicate class label offind train or book train). In these cases, if the tar-get utterance is not paraphrased, the model is stilllikely to predict the correct label. Finally, in Per-suasion for Good, both the attack methods result ina performance drop in a similar range. We also ob-serve that the attack methods are slightly more ef-fective in fooling the classifier for persuadee strat-egy classification.

In terms of window direction, we observe thatperturbations in the past or future utterances resultin a similar range of reduction in performances.One notable exception is act prediction in Dai-lyDialog, where the model continues to performnear the original score of 79.46% irrespective ofthe attack in future utterances in the window.

Performance Comparison for Attacks in GloVeCNN and GloVe bcLSTM. We summarize theperformance of GloVe CNN and GloVe bcLSTMmodels against Paraphrasing-based Attack andSpelling-based Attack in Table 12. For all thetasks, we observe a very significant drop in per-formance for GloVe CNN. For example, in emo-tion classification, the drop is around 23% and40% for Paraphrasing-based Attack in IEMOCAPand DailyDialog respectively. However, for thesame setting, the relative decrease in performanceis only around 6% and 9% for bcLSTM. We ob-serve the same trend in other tasks where it canbe seen that the bcLSTM model is much morerobust against the attacks compared to the CNNmodel. This is because contextual models such asbcLSTM are harder to fool as the context carry keyinformation regarding the semantics of the targetand salient information can be inferred about thetarget using its’ context. It is thus evident that even

when the target utterance is corrupted, bcLSTM iscapable of using contextual information to predictthe label correctly, and subsequently the decline inperformance is much lesser.

In principle, our findings in Table 12 can berelated to how transformer-based pre-trained lan-guage models work. For example, in BERT (De-vlin et al., 2018), the masked language modeling(MLM) and the next sentence prediction (NSP)objective forces the model to infer or predict thetarget using contextual information. Such con-textual models are more powerful and robust be-cause context information plays a crucial role inalmost every natural language processing task. Anobjective similar to next sentence prediction inBERT or permutation language modeling inXLNET (Yang et al., 2019) can be used forconversation level pre-training to improve sev-eral downstream conversational tasks. Such ap-proaches have been found to be useful in the past(Hazarika et al., 2019).

6.7 Label-Constrained Dialogue LevelUtterance Augmentation

In utterance level dialogue classification tasks,labels of consecutive utterances are often inter-related. Recurrent models will usually learn thispattern of label dependency during training. Toanalyze how large is the dependency with labels,we design the following experimental setup.

Method. We train an utterance-based bcLSTMmodel with unchanged train and validation data.During evaluation, we modify the test data in thefollowing way:

• For each utterance (ut) in a dialogue, con-textual utterances in the past and future withwindow size of w are going to be modi-fied. The target utterance ut will be kept un-changed.

I've already told my buddy, Mark, that I am going hunting with him in Alaska. I need to buy some flowers for my wife .

Neutral


<emotion>

Modified Context by Appending with Utterance of Same Emotion

John, I was looking through some magazines for ideas about where we might go on vacation this year. I am sorry to have kept you waiting. Mr. Johns needs a telephone call from you.

Neutral


Prediction after

modification


I've already told my buddy, Mark, that I am going hunting with him in Alaska.I am sure you can do better than that .

Neutral


<emotion>

Modified Context by Replacement with Utterance of Same Emotion

John, I was looking through some magazines for ideas about where we might go on vacation this year. Good afternoon. I come here specially to pick up my tickets. I booked it last month. This is my reservation note.

Neutral


Prediction after

modification


I've already told my buddy, Mark, that I am going hunting with him in Alaska. Yes , twice now. I think it's too high-tech , and so it's the first part that breaks.

Neutral


<emotion>

Hey, I've always gone hunting or fishing on vacation. I am sorry that bothers you. I think I'll redeem, if that's not a problem.

Angry

Modified Context by Appending with Utterance of Other Emotion


Prediction after

modification

Angry Angry Sadness

I've already told my buddy, Mark, that I am going hunting with him in Alaska.Yeah , I just got some good information off the internet .

Neutral

You can't be serious! Great idea! Peter, I could use the drink.

Angry

John, I was looking through some magazines for ideas about where we might go on vacation this year.

<emotion>

Modified Context by Replacement with Utterance of Other Emotion


Prediction after

modification

Neutral Neutral Happy

Figure 18: Predictions under Label-Constrained Dialogue Level Utterance Augmentation. We concatenate or re-place contextual utterances with other utterances belonging to same or different class category. Here, the examplesare shown for DailyDialog emotion classification.

• First, for each contextual utterance in thewindow w (ut−w, ..,ut−1,ut+1, ..,ut+w), a newutterance will be selected using the followingconstraints:

– It will be drawn from some other dia-logue in the test data.

– The label of the drawn utterance will ei-ther be same or different from the origi-nal label of the respective contextual ut-terance.

• The modification will then be performed tothe test data in one of the following two ways:

– The selected utterance will be concate-nated at the end of the original utter-ance to form the new contextual utter-ance. We call this strategy Concat.

– The selected original utterance will bereplaced with the selected utterance.The selected utterance will thus form the

new contextual utterance. We call thisstrategy Replace.

We evaluate this experimental setup with win-dow size w = 5,10,20,30,40,50 in IEMOCAP,w = 5 in DailyDialog, w = 5,10 in MultiWOZ andw = 5,10,20 in Persuasion for Good. The resultsof this experiments are shown in Table 13. Weshow a few example cases in Fig. 18.

Observations. We first look at the results in theIEMOCAP dataset, which was originally curatedfrom actors performing improvisations or scriptedscenarios, specifically selected to elicit emotionalexpressions. The scripts were written by con-sidering the affective aspect of the content. Wethink, to amplify and enrich the emotional contentof the dialogues, the utterances in the IEMOCAPdataset were scripted by enforcing strong label de-pendency among the utterances e.g., the presenceof strong negative emotion sequence is observed

Constraint Strategy Window IEMOCAP Dailydialog Dailydialog MultiWOZ Persuasion Persuasion# SL/DL Concat/Replace Emotion Emotion Act Intent ER EE

1 7 7 - 61.9 41.16 79.46 96.22 56.28 44.83

2 SL Concat 5 61.5 34.3 81.09 88.71 56.71 43.443 SL Replace 5 61.31 30.06 77.72 77.92 55.56 43.24 SL Concat 10 61.03 - - 89.31 56.3 44.025 SL Replace 10 59.92 - - 77.23 53.27 41.96 SL Concat 20 60.44 - - - 54.39 42.287 SL Replace 20 59.74 - - - 52.28 41.198 SL Concat 30 60.99 - - - - -9 SL Replace 30 59.11 - - - - -10 SL Concat 40 61.39 - - - - -11 SL Replace 40 59.53 - - - - -12 SL Concat 50 61.36 - - - - -13 SL Replace 50 59.97 - - - - -

14 DL Concat 5 52.77 32.8 68.57 81.87 54.04 38.9515 DL Replace 5 47.1 27.68 60.11 62.51 54.73 36.7416 DL Concat 10 50.21 - - 81.48 52.11 37.4917 DL Replace 10 38.87 - - 61.4 51.02 39.6618 DL Concat 20 47.25 - - - 52.49 40.5319 DL Replace 20 37.06 - - - 51.27 38.2520 DL Concat 30 46.22 - - - - -21 DL Replace 30 34.9 - - - - -22 DL Concat 40 45.69 - - - - -23 DL Replace 40 35.33 - - - - -24 DL Concat 50 47.01 - - - - -25 DL Replace 50 33.86 - - - - -

Table 13: Results for Label-Constrained Dialogue Level Utterance Augmentation in different datasets. We con-catenate or replace contextual utterances with other utterances belonging to same (SL) or different (DL) classcategory. We constrain the window size according to average count of utterances per dialogue in each dataset.Scores are W-Avg F1 in IEMOCAP Emotion and MultiWOZ Intent; Macro F1 in the rest.

in the dialogues where one of the speakers elicitsanger. This phenomenon is unique to IEMOCAPand missing in the DailyDialog dataset. Fromthe experimental results in Table 13, we can ar-gue that bcLSTM does not effectively learn torely on the context for semantic understandingimprovement of the target utterance. Rather,it basically learns to mimic the affective con-tent of the contextual utterances due to the la-bel copying feature in the training dataset ofIEMOCAP as discussed in Section 4.

Hence, we observe that in IEMOCAP, there isnot much of a drop in performance if we conductthe modifications to the dialogues with the samelabel constraint. Surprisingly, even in the Replacesetup, the performance drop is only around 2%from the original setup. This means that even ifwe replace 30, 40, or 50 utterances in the pastand future context with totally unrelated utter-ances (but belonging to the same emotion cate-gory) from other test dialogues, the performancecan still be kept near the original F1-score of61.9% (Table 13 Row 9, 11, 13). For large windowsizes in the Replace setup, even though the flow of

the dialogue is entirely absent, the F1-Score is stillaround 59% demonstrating the importance of la-bel dependency in this particular dataset. Further-more, in the different label constraint, the perfor-mance drop is significant even if we use the Con-cat strategy in a small window (Table 13 Row 14,16). Although Concat strategy retains the origi-nal utterance, some text belonging to other emo-tion label is concatenated which would confusethe model about the overall label orientation. Inthe Replace setup the performance is even poorer(Row 15, 17, 19). All this results indicate the ev-idence of label dependency especially for the taskof emotion classification.

We observe a similar kind of trend in the Per-suasion for Good dataset. In both persuaderand persuadee strategy classification, the drop inperformance is much lesser compared to Daily-Dialog emotion or MultiWOZ intent. As ob-served in Table 4, current contextual models donot provide substantial improvements over non-contextual models in this dataset. Hence, we con-jecture that bcLSTM model is unable to modelcontext effectively for classifying strategy labels

Figure 19: Classification performance with respect to the position of the utterances. Scores are Weighted-F1 inIEMOCAP, MultiWOZ and Macro F1 in the rest.

in this dataset.

Observing how the result changes by varyingthe window size, we conclude that even if the lo-cal context is augmented with same labels (SL) thelong distant context helps in the strategy classifica-tion tasks. In the different label (DL) experiments,augmenting with different labeled utterances re-sults in an adverse effect and the performance isworse than the GloVe CNN (without context clas-sifier) model in Table 4. We conclude that eventhough the true context is not greatly helpful inthis task (compared to other datasets), corruptingthe context is unfavourable for the bcLSTM modeland will result in poor performance.

In DailyDialog act, concatenation with same la-bel setup is more helpful because the act labels areprimarily driven by sentence type and concatena-tion is likely to provide a stronger signal. For ex-ample, an utterance with act label question, con-catenated with another question is unlikely to beclassified into something else. Finally, in DailyDi-alog emotion and MultiWOZ intent it can be con-cluded that content of the contextual utterances areextremely vital and corrupting them results in verypoor performances.

6.8 Influence of Utterance Positions in thePrediction

In this setting, we try to understand whether thereis a general or dataset specific trend between theprediction f1-score and position of the utterances.We want to examine if utterances at the beginningof the dialogues are relatively easier to classifycompared to the utterances at the middle or the ut-terances at the end. We show this trend of perfor-mance against position in Fig. 19.

We see that there is a decreasing trend of per-formance w.r.t position in all the tasks across thedatasets. At the beginning of a dialogue, the ini-tial utterances setup the flow of the conversation.Naturally, they have more information, they aremore self-dependent and less context dependent.However, as the dialogue progresses, utterancestend to have lesser self-sufficient information to beclassified correctly. The classification of these ut-terances thus become largely context dependant.This is evident from the plot of the GloVe CNNmodel in Fig. 19. GloVe CNN is a non-contextualmodel and there is an overall decreasing trend aswe move towards the right in the Position axis.This supports our initial hypothesis that utterancesappearing later in the dialogue are generally harderto classify on their own without the use of any con-

textual information.We see a similar kind of trend in the bcLSTM

model as well. In three of the six tasks, it usescontextual information to provide a significant im-provement over the GloVe CNN model. How-ever, RNNs are not capable of maintaining entiretyof the relevant context (e.g., coreference) acrosstime, and hence the network also loses necessarycontextual information for correct classification asthe dialogue progresses. Nevertheless, we still ob-serve that for those three tasks, barring the ex-treme right end of the graph, the gap betweenbcLSTM and GloVe CNN widens as we move to-wards higher position indices. This suggest that,w.r.t GloVe CNN, the bcLSTM model finds it lessdifficult to classify utterances at the end comparedto utterances in the beginning as a result of its con-textual nature.

We believe one other possible reason for the de-creasing trend could be because that there are onlya few dialogues with a very large number of ut-terances. The scores for the rightmost position in-dices are thus calculated over a very small numberof samples and are probably not the most accurateestimate of the overall trend. It is also a fact thatfor these far distant indices with very less num-ber of samples could potentially cause the modelsto learn any intended position-specific contextualinformation or positional encoding.

6.9 Performance for Label Shift

As illustrated in Fig. 8 and Fig. 9, a few of ourtasks of interest exhibit the label copying prop-erty which means consecutive utterances from thesame speaker or different speakers often have thesame emotion, act, or intent label. The inter-speaker and intra-speaker label copying is espe-cially prevalent in the IEMOCAP emotion tasks,the DailyDialog act tasks, and the MultiWOZ in-tent tasks. Contextual models such as bcLSTMmake correct predictions when utterances displaysuch kind of continuation of the same label. Butwhat happens when there is a change of label?Does bcLSTM continue to perform at the samelevel or is it affected from the change? To un-derstand this occurrence in more detail, we definethis event as Label Shift and look at the followingtwo different kind of shifts that could happen inthe course of a dialogue:

• Intra-Speaker Shift: The label of the utter-ance is different from the label of the previ-

ous utterance from the same speaker (refer toFig. 20).

• Inter-Speaker Shift: The label of the utter-ance is different from the label of the pre-vious utterance from the non-target speaker(refer to Fig. 20).

In these two scenarios explained above, we are in-terested to see how LSTM performs at the utter-ances were the label shift takes place.

We report results for utterances in the test datathat show Intra-Speaker Shift and Inter-SpeakerShift in Table 14. The Inter-Speaker Shift is notdefined in MultiWOZ as we don’t have intent la-bels for system utterances. We also don’t reportInter-Speaker Shift results in Persuasion for Goodas the persuader and persuadee strategy set is dif-ferent.

The emotion labels in IEMOCAP display thelargest extent of label copying in Fig. 9 and labelshift in Fig. 8. We also observe in Table 14 that la-bel shifts occur with high frequency in IEMOCAP.These are the likely reasons why we observe sig-nificant number of errors for utterances with LabelShift for this task in Table 14. The performance forboth Intra-Speaker Shift and Inter-Speaker Shiftstands at around 52.0%, much lesser than the over-all average of 61.9% in test data. Although not asstrong as IEMOCAP, the intra-speaker label copy-ing feature can also be seen in MultiWOZ intentand DailyDialog act labels. For these two tasks,we again observe a drop of performance when ei-ther Intra-Speaker Shift or Inter-Speaker Shift oc-curs.

In contrast, the extent of transition is spreadover a much larger combination of labels in Dai-lyDialog emotion and Persuasion for Good. Weobserve that the results for utterances with La-bel Shift in those tasks are in fact better than theoverall score. In DailyDialog emotion, the scoresare 44.23% and 47.77%, which is an improvementover the original 41.16%. The scores of 57.84%and 49.4% in Persuasion for Good also stand overthe scores of 56.28% and 44.83% in the originalsetup.

Sentiment Shift. We further analyze the resultsfor sentiment shift in intra- and inter-speakerlevel. For the emotion classification tasks, wegroup the different emotion labels into three broadcategories: i) positive sentiment group with emo-tions excited, happy, surprise, ii) negative sen-

I dont think I can do this anymore.

Emotion: frustrated

Well, I guess you are not trying hard enough.

Emotion: neutral

Its been three years. I have tried everything.

Emotion: frustratedMaybe you’re not smart enough.

Emotion: neutral

Just go out and keep trying.

Emotion: neutral

I am smart enough. I am really good at what I do. I just don’t know how to make someone else to see that.

Emotion: anger

Inter-speaker Label Shift

Intra-speaker Label Shift

I'm thinking about redecorating my bedroom . I bought this magazine in order to get some ideas . What do you think of this ?

Act: questionThat looks good . The room in the picture is bigger than your bedroom , so you wouldn't be able to put all the furniture in your room .

Act: inform

I'd like to have the bad and the wardrobe .

Act: questionYou would fit both of them in your bedroom . Perhaps you could also get the dressing table . I think that one would look good in your bedroom

Act: informYes, it would . It's very expensive though .

Act: question

Everything in this magazine seems expensive . You could probably find something similar in a discount furniture store .

Act: inform

Yes . I'm sure I could find something similar at one . I'd also like to get a new carpet for my bedroom .

Act: directiveYou can get cheap carpets easily . Another idea is to buy a rug . That would cover a lot of the carpet and you wouldn't have to replace the carpet . It would save you a lot of work .

Act: commissiveInter-speaker Label Shift

Intra-speaker Label Shift

Figure 20: Examples of label shift in IEMOCAP and DailyDialog datasets, respectively.

SetupIEMOCAP Dailydialog Dailydialog MultiWOZ Persuasion Persuasion

# Emotion Emotion Act Intent ER EE

1 Original 61.9 41.16 79.46 96.22 56.28 44.832 Intra-Speaker Shift 52.01 (13.2) 44.23 (1.0) 76.18 (2.9) 94.91 (1.6) 57.84 (6.9) 49.4 (4.7)3 Inter-Speaker Shift 52.37 (22.0) 47.77 (1.3) 78.80 (4.9) - - -

Table 14: Classification performance for utterances which exhibits Label Shift in test data. Numbers in parenthesisindicate the average count of the corresponding shifts per dialogue. There is no Inter-Speaker Shift in MultiWOZor Persuasion for Good as we only classify user, persuader, or persuadee utterances. Scores are W-Avg F1 inIEMOCAP Emotion and MultiWOZ Intent; Macro F1 in the rest.

timent group with emotions angry, disgust, fear,frustrated, sad, and iii) neutral sentiment groupwith emotion neutral. Sentiment shift is then de-fined as shifting from one of the three groups toany of the other two. We also define sentimentshift w/o neutral as switching from either positiveto negative group, or negative to positive group.

The results for utterances showing the perfor-mance of bcLSTM at sentiment switching in intra-and inter-speaker scenarios are reported in Ta-ble 15. In IEMOCAP, sentiment shift is nat-urally less frequent than emotion shift becauseof the emotion grouping. However, we foundthat sentiment and emotion shift results in al-most similar kind of performance which can beattributed to the label dependency between the tar-get speaker’s utterances as explained in earlier sec-tions. One noteworthy exception is inter-speakersentiment shift w/o neutral for which the W-Avg

F1 is 62.22%. One can infer the low predictionaccuracy of bcLSTM when a sentiment shift be-tween non-neutral emotions (e.g., anger, happy,sad) to neutral emotion takes place between twospeakers’ consecutive utterances.

On the other hand, due to the high frequent oc-currence of neutral emotion in the DailyDialogdataset, the number of both emotion and senti-ment shifts are very low in this dataset. The higherfrequency of the neutral emotion in this datasetalso corroborates bcLSTM to learn non-neutral toneutral emotion and sentiment shifts effectively.Hence, when we remove the neutral emotion, theperformance dipped significantly which impliesthat classifying utterances at sentiment shifts be-tween non-neutral emotions are relatively poor inboth intra- and inter-speaker scenarios.

Setup Mode IEMOCAP Dailydialog# Emotion Emotion

1 Original - 61.9 41.162 Intra-Speaker Shift Emotion Shift 52.01 (13.2) 44.23 (1.0)3 Inter-Speaker Shift Emotion Shift 52.37 (22.0) 47.77 (1.3)

4 Intra-Speaker Shift Sentiment Shift 53.21 (7.2) 44.21 (1.0)5 Inter-Speaker Shift Sentiment Shift 49.09 (13.6) 50.61 (1.3)

6 Intra-Speaker Shift (w/o neutral) Sentiment Shift 51.98 (1.1) 19.76 (0.02)7 Inter-Speaker Shift (w/o neutral) Sentiment Shift 62.22 (1.0) 45.63 (0.04)

Table 15: Classification performance for utterances which exhibits Emotion and Sentiment Shift in test data. Num-bers in parenthesis indicate the average count of the corresponding shifts per dialogue. Scores are W-Avg F1 inIEMOCAP Emotion and Macro F1 in DailyDialog.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5% occurence in train

20

40

60

80

100

test

fsco

re

IEMOCAP (Emotion)

0 2 4 6 8% occurence in train

5

10

15

20

25

30

test

fsco

re

DailyDialog (Emotion)

0 5 10 15 20 25% occurence in train

20

40

60

80

100

test

fsco

re

DailyDialog (Act)

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5% occurence in train

20

40

60

80

100

test

fsco

re

Multiwoz (Intent)


20

40

60

80

100

test

fsco

re

Persuasion for good (ER)


20

40

60

80

100

test

fsco

re

Persuasion for good (EE)Patterns of size 2Patterns of size 3Patterns of size 5

Figure 21: Performance of different n-gram label patterns in test data against their percentage of occurrence in thetraining data. We plot n = 2-, 3-, 5-gram intra-speaker label patterns here. Patterns which are more frequent in thetraining set are also likely to be predicted more correctly during evaluation.

6.10 Performance for n-gram Label Patterns

We perform a detailed study to understandwhether n-gram label patterns that are frequentlyencountered by the learning algorithm duringtraining, are more likely to be predicted correctlyduring evaluation. In this context, n-gram labelpatterns are simply the ordered list of labels fromn consecutive utterances in a dialogue. Contex-tual models such as bcLSTM will often learn thedependency of labels for patterns which it comesacross more often in the training set. We then in-tend to see how the model performs in the eval-uation set for all distinct patterns observed dur-ing the training. This distinct patterns will appearin a range of frequency in the training set, with

some being much more common than others. Byobserving the test performance against the train-ing frequency, we can understand how our modelsperform for label patterns occurring with differentfrequencies. This study is a continuation of the la-bel dependent analysis we illustrate in Fig. 8 and9, and evaluation we perform in Section 6.9.

We first tabulate the most frequent patterns inintra-speaker level in Table 16. We report the topfive frequent patterns with n = 2-, 3-grams. Thenumbers reported in the Frequency column repre-sent the percentage of occurrence. For example, inIEMOCAP, 17.09% of 2-grams in train+validationset is {frustrated, frustrated}. For some of thetasks, the percentage occurrence reported in Ta-ble 16 varies a lot even within the top five bracket,

Dataset Size Train+Val TestPattern Frequency (%) Pattern Frequency (%)

IEM

OC

AP

(Em

otio

n)

2

frustrated, frustrated 17.09 neutral, neutral 17.56neutral, neutral 15.67 frustrated, frustrated 15.69sad, sad 11.58 excited, excited 15.37angry, angry 11.3 sad, sad 12.55excited, excited 9.75 angry, angry 6.79

3

frustrated, frustrated, frustrated 12.54 neutral, neutral, neutral 14.44neutral, neutral, neutral 12.18 excited, excited, excited 13.24sad, sad, sad 10.16 frustrated, frustrated, frustrated 11.56angry, angry, angry 8.74 sad, sad, sad 11.02excited, excited, excited 8.04 angry, angry, angry 4.62

Dai

lyD

ialo

g(E

mot

ion)

2

neutral, neutral 76.33 neutral, neutral 74.17neutral, happiness 7.72 neutral, happiness 8.03happiness, happiness 5.49 happiness, happiness 5.76happiness, neutral 3.6 happiness, neutral 3.84neutral, surprise 1.54 neutral, surprise 1.22

3

neutral, neutral, neutral 70.99 neutral, neutral, neutral 67.71neutral, neutral, happiness 7.89 neutral, neutral, happiness 8.07happiness, happiness, happiness 3.56 happiness, neutral, neutral 3.65happiness, neutral, neutral 2.8 happiness, happiness, happiness 3.39neutral, happiness, neutral 2.54 neutral, happiness, neutral 2.86

Dai

lyD

ialo

g(A

ct)

2

inform, inform 27.26 inform, inform 27.53question, question 14.48 question, question 14.11question, inform 11.24 question, inform 12.02inform, question 6.64 inform, question 6.97directive, inform 6.3 directive, inform 5.75

3

inform, inform, inform 18.36 inform, inform, inform 18.09question, question, question 8.44 question, question, question 7.75question, inform, inform 5.71 question, inform, inform 6.72question, question, inform 5.46 question, question, inform 5.94inform, question, inform 3.47 inform, question, inform 3.62

Mul

tiWO

Z(I

nten

t)

2

find hospital, find hospital 16.71 find taxi, find taxi 17.58find police, find police 15.75 find hospital, find hospital 14.62find taxi, find taxi 15.27 book train, book train 13.14book train, book train 11.69 find police, find police 12.71find police, book hotel 5.73 find taxi, book restaurant 5.93

3

find hospital, find hospital, find hospital 11.82 find taxi, find taxi, find taxi 10.57find police, find police, find police 9.9 find hospital, find hospital, find hospital 10.3find taxi, find taxi, find taxi 9.9 find police, find police, find police 7.32book train, book train, book train 6.39 book train, book train, book train 7.32find taxi, find taxi, book restaurant 5.75 find taxi, find taxi, book restaurant 6.78

Pers

uasi

on(E

R) 2

non strategy dialogue acts, non strategy dialogueacts

22.68 non strategy dialogue acts, non strategy dialogueacts

23.04

credibility appeal, credibility appeal 9.55 credibility appeal, credibility appeal 8.05credibility appeal, non strategy dialogue acts 4.85 credibility appeal, non strategy dialogue acts 4.78non strategy dialogue acts, credibility appeal 4.38 non strategy dialogue acts, credibility appeal 4.53non strategy dialogue acts, donation information 3.43 donation information, non strategy dialogue acts 4.02

3

non strategy dialogue acts, non strategy dialogueacts, non strategy dialogue acts

13.55 non strategy dialogue acts, non strategy dialogueacts, non strategy dialogue acts

13.02

credibility appeal, credibility appeal, credibility ap-peal

5.34 credibility appeal, credibility appeal, credibility ap-peal

3.2

credibility appeal, credibility appeal, non strategydialogue acts

2.58 non strategy dialogue acts, credibility appeal, credi-bility appeal

2.58

non strategy dialogue acts, credibility appeal, credi-bility appeal

2.12 credibility appeal, credibility appeal, non strategydialogue acts

2.43

credibility appeal, non strategy dialogue acts, nonstrategy dialogue acts

2.01 donation information, non strategy dialogue acts,non strategy dialogue acts

2.32

Pers

uasi

on(E

E) 2

other dialogue acts, other dialogue acts 20.49 other dialogue acts, other dialogue acts 20.34positive reaction to donation, positive reaction to do-nation

10.35 positive reaction to donation, positive reaction to do-nation

9.64

other dialogue acts, positive reaction to donation 5.61 other dialogue acts, positive reaction to donation 5.91positive reaction to donation, other dialogue acts 4.61 positive reaction to donation, other dialogue acts 4.04other dialogue acts, ask org info 3.87 other dialogue acts, ask org info 3.3

3

other dialogue acts, other dialogue acts, other dia-logue acts

12.6 other dialogue acts, other dialogue acts, other dia-logue acts

11.51

positive reaction to donation, positive reaction to do-nation, positive reaction to donation

6.07 positive reaction to donation, positive reaction to do-nation, positive reaction to donation

4.8

other dialogue acts, positive reaction to donation,positive reaction to donation

2.89 other dialogue acts, positive reaction to donation,positive reaction to donation

2.56

positive reaction to donation, positive reaction to do-nation, other dialogue acts

2.51 positive reaction to donation, positive reaction to do-nation, other dialogue acts

2.43

other dialogue acts, other dialogue acts, positive re-action to donation

2.28 other dialogue acts, other dialogue acts, positive re-action to donation

2.11

Table 16: Five most frequent intra-speaker n-gram label patterns in various datasets. The numbers reportedin the Frequency column is the percentage of occurrence. For example, in IEMOCAP, 17.09% of 2-grams intrain+validation set is {frustrated, frustrated}.We report the top frequencies for train+validation set and test set inwindow sizes of 2, 3 labels.

implying that consecutive utterances having someof those label patterns are much more frequent

compared to the others. We also report the mostfrequent patterns in inter-speaker level in Ta-

Dataset SizeTrain+Val Test

Pattern Frequency(%) Pattern Frequency(%)

IEMOCAPEmotion

2

frustrated, frustrated 13.49 excited, excited 13.06sad, sad 10.43 sad, sad 11.57neutral, neutral 9.55 frustrated, frustrated 11.49excited, excited 8.6 neutral, neutral 10.05angry, angry 7.84 neutral, frustrated 7.42

DailyDialogEmotion

2

neutral, neutral 75.81 neutral, neutral 73.0happiness, happiness 6.6 happiness, happiness 6.82neutral, happiness 6.3 neutral, happiness 6.68happiness, neutral 3.52 happiness, neutral 3.86neutral, surprise 1.61 surprise, neutral 1.34

DailyDialogAct

2

question, inform 24.31 question, inform 24.15inform, inform 19.8 inform, inform 20.44inform, question 15.87 inform, question 16.0directive, commissive 10.04 directive, commissive 9.93inform, directive 6.26 inform, directive 6.37

PersuasionER to EE

2

non strategy dialogue acts, otherdialogue acts

20.34 non strategy dialogue acts, otherdialogue acts

17.39

credibility appeal, positive reac-tion to donation

5.78 credibility appeal, positive reac-tion to donation

6.68

credibility appeal, ask org info 4.79 credibility appeal, other dia-logue acts

6.68

credibility appeal, other dia-logue acts

4.49 non strategy dialogue acts, posi-tive reaction to donation

5.21

personal related inquiry, otherdialogue acts

4.29 task related inquiry, other dia-logue acts

4.42

PersuasionEE to ER

2

other dialogue acts, non strategydialogue acts

20.04 other dialogue acts, non strategydialogue acts

17.26

ask org info, credibility appeal 12.22 ask org info, credibility appeal 10.98positive reaction to donation,non strategy dialogue acts

9.86 positive reaction to donation,non strategy dialogue acts

8.58

provide donation amount, nonstrategy dialogue acts

5.04 provide donation amount, nonstrategy dialogue acts

6.28

task related inquiry, non strategydialogue acts

3.54 other dialogue acts, credibilityappeal

4.39

Table 17: Five most frequent inter-speaker 2-gram label patterns in various datasets. The numbers reported inthe Frequency column represent the percentage of occurrence. We restrict the window size to only 2 as we arereporting inter-speaker transitions here. We report the top frequencies for train+validation set and test set. AsMultiWOZ Intent does not have label for system utterances, inter-speaker patterns cannot be defined.

ble 17. The numbers in the inter-speaker level alsoindicate the presence of imbalanced n-gram labelpatterns.

Now, we collect all such intra-speaker n-grampatterns (with n = 2,3,5) appearing in the trainingdata and plot their corresponding averaged perfor-mance if and when they appear in the test data inFig. 21. The performance score is plotted againstthe percentage of occurrence in the training datain Fig. 21. Note that the percentage is reported forthe top five patterns in Table 16, but is plotted forall possible patterns in Fig. 21. The scores shownin Fig. 21 are computed from the predictions of thebcLSTM model.

The scores in DailyDialog emotion are quitepoor as most of the n-gram patterns contain oneor more neutral emotion label. However, as neu-tral emotions are not considered in our evaluationsetup, the scores are mostly in the lower range of0-15%. Apart from that, we see that there is astrong correlation between the higher frequencyof occurrence in the training set and the perfor-mance in the test set. Patterns that are encounteredmore during the training phase are also predictedwith higher F-scores during evaluation. All n = 2-, 3-, 5-gram patterns with high frequencies showhigher or at-least more than average F-scores re-ported for the whole dataset. A notable exception

is the score of {non strategy dialogue acts, nonstrategy dialogue acts} pattern in Persuasion ERclassification. Even though it constitutes around24% of 2-gram patterns in the training data, thetest score is quite poor.

However, the scores for patterns that are lessfrequent in the training set varies considerably be-tween the whole range of 0%-100%. Naturally,most of the lesser frequent patterns in the trainingdata also appear less frequently in the test data.Hence, the scores plotted for these patterns aredrawn from a very small number of samples andthus the variance is a lot higher in the left-mostpart of the plots in Fig. 21.

6.11 Sequence Tagging using ConditionalRandom Field (CRF)

On the surface, the task of utterance level dia-logue understanding looks similar to sequence tag-ging. Are there any distinct label dependency andpatterns across the tasks that are dataset agnosticand likely to be captured by CRF (Lafferty et al.,2001)? In the quest to answer this, we plug in threedifferent CRF layers on top of the bcLSTM net-work.

Global-CRF. It is a linear chain CRF used ontop of bcLSTM. In this setting, we do not considerspeaker information. It can be defined using theequations below:

P(Y |D) =1

Z(D)

n

∏i=1

φT (yi−1,yi)φE(yi,ui), (1)

Z(D) = ∑y′∈Y

n

∏i=1

φT (y′i−1,y′i)φE(y′i,ui). (2)

Global-CRFext. The linear-chain CRF is ex-tended to include not only the transition potentialfrom the previous label to the current label, butalso from the prior-to-previous label. Concisely,the current label is predicated on the previous twolabels. Therefore, the transition potential functionφT takes one extra argument yi−2. The advantagehere is it also considers the previous label from thetarget speaker should utterance i− 2 have comefrom the target speaker. This becomes useful inthe tasks where the speakers tend to retain labelfrom its last utterance. It can be defined using the

equations below:

P(Y |D) =1

Z(D)

n

∏i=1

φT (yi−2,yi−1,yi)φE(yi,ui),

(3)

Z(D) = ∑y′∈Y

n

∏i=1

φT (y′i−2,y′i−1,y

′i)φE(y′i,ui).

(4)

Speaker-CRF. In this setting, we use two dis-tinct CRFs for the two speakers in a dialogue.Inter-speaker label dependency and transitions arenot likely to be captured in this setting by theCRFs.

Negative Results. Aside from well-known se-quence tagging tasks, such as, Named EntityRecognition (NER) and Part of Speech Tag-ging, CRF does not improve the performanceof utterance-level dialogue understanding tasks.There could be multiple reasons as below:

• As shown in Fig. 1, a dialogue is governed bymultiple variables or pragmatics, e.g., topic,personal goal, past experience, expressingopinions or presenting facts based on per-sonal knowledge, and the role of the inter-locutors. Hence, the response pattern canvary depending on these variables. The per-sonality of the speakers add an extra layer ofcomplexity to this which causes speakers torespond differently under the same circum-stances. An identical utterance can be ut-tered with different emotions by two differentspeakers. CRF relies on surface label patternswhich can vary with datasets. Due to this dy-namic nature of dialogues and the presenceof latent controlling variables, the label tran-sition matrix of CRF does not learn any dis-tinct pattern that is complementary to what islearned by the feature extractor.

• Some of the datasets — IEMOCAP and Mul-tiWOZ — contain distinct label-transitionpatterns (see Fig. 8 and Fig. 9) betweenthe same and distinct speakers. We ex-pected bcLSTM w/ global-CRF to outper-form vanilla bcLSTM on these two datasets.However, we do not observe any statisti-cally significant improvement using bcLSTMw/ global-CRF over bcLSTM. We posit thatthe evident label-transition patterns that ex-ist in these two datasets are straightforward

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13

Neutral Neutral Neutral Excited Happy NeutralExcited

Happy Happy Happy Happy Excited Excited

Original Dialogue

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13

B-Neutral B-Neutral B-Neutral B-Excited I-Happy B-NeutralI-ExcitedB-Happy B-Happy B-Happy B-Happy B-Excited I-Excited

Global CRF

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13

B-Neutral B-Neutral B-Neutral B-Excited I-Happy B-NeutralI-ExcitedB-Happy B-Happy B-Happy B-Happy B-Excited I-ExcitedExtended CRF

u1 u3 u5 u7 u9 u11 u13

B-Neutral I-Neutral I-Neutral B-Excited B-Happy B-NeutralB-Excited

Speaker 1 CRF

u2 u4 u6 u8 u10 u12

B-Happy I-Happy I-Happy I-Happy I-ExcitedB-Excited

Speaker 2 CRF

Figure 22: Different CRF configurations.

to capture without a CRF. In fact, we alsotried GloveCNN with a CRF layer on it, andsurprisingly the result was not significantlyhigher than that of GloveCNN. This can beattributed to the absence of explicit contex-tual and label transition-based features in theCRF.

Results in IEMOCAP, DailyDialog and Persua-sion for Good Datasets. We observe a minorperformance improvement in the IEMOCAP andDialyDialog datasets using speaker-CRF for emo-tion recognition. This observation directly corre-

lates to the experiment under “w/o inter" settingin the Table 9. In “w/o inter" setting, contextualutterances of the speaker B are not utilized to clas-sify utterances of speaker A vice versa. The resultsdo not improve when we use speaker-level CRF onbcLSTM under the “w/o inter" setting. From theseobservations, we can conclude that CRF is notlearning any distinct label dependency and tran-sition patterns that are not learned by the featureextractor or bcLSTM alone.

Global-CRFext shows significant performanceimprovement on the Persuasion for Good dataset.Some of the key controllable factors of the dia-

Methods


W-Avg F1 W-Avg F1 Micro F1 Macro F1 W-Avg F1 Macro F1GloVe CNN 52.04 49.36 50.32 36.87 80.71 72.07GloVe bcLSTM 61.74 52.77 53.85 39.27 84.62 79.12

w/o inter 63.73 52.39 52.86 39.99 81.32 74.50w/o inter w/ speaker-CRF 62.94 52.47 54.04 39.77 81.19 74.12w/ global-CRF 61.62 53.05 53.86 39.27 83.91 79.10w/ global-CRFext 61.64 53.06 54.40 39.64 84.27 79.25w/ speaker-CRF 62.21 53.16 54.68 39.74 84.15 79.20

Table 18: Classification performance in test data for emotion prediction in IEMOCAP, emotion prediction inDailyDialog, and act prediction in DailyDialog using different CRF configurations. All scores are average of atleast 10 different runs. Test F1 scores are calculated at best validation F1 scores.

Methods

MultiWOZ Persuasion For GoodIntent Persuader Persuadee

W-Avg F1 W-Avg F1 Macro F1 W-Avg F1 Macro F1GloVe CNN 84.30 67.15 54.45 58.00 41.03GloVe bcLSTM 96.14 69.26 55.27 61.18 42.19

w/o inter 95.05 67.81 53.24 59.44 40.63w/o inter w/ speaker-CRF 94.11 68.13 54.45 58.93 40.16w/ global-CRF/speaker-CRF 95.48 68.59 55.60 61.24 42.62w/ global-CRFext 95.51 69.23 56.80 61.89 43.68

Table 19: Classification performance in test data for intent prediction in MultiWOZ, persuader and persuadeestrategy prediction in Persuasion for Good using different CRF configurations. All scores are average of atleast10 different runs. Test F1 scores are calculated at best validation F1 scores. In MultiWOZ and Persuasion forGood, the global-CRF and speaker-CRF setting are identical as we only classify utterances coming from one ofthe speakers (user in MultiWOZ, persuader or persuadee in Persuasion for Good).

logues such as topics in this dataset are fixed andcan be learned intrinsically by the classifier. Thescope of the dialogues in this dataset is very lim-ited as there are only two possible outcomes ofthe dialogues – agree to donate, and disagree todonate. Hence, there can be some label transi-tion patterns learned by the Global-CRFext usinga larger label-context window in the transition po-tential.

6.12 Case Studies

The impetus behind context modeling inutterance-level dialogue understanding tasksis to capture the missing pieces necessary tounderstand a given utterance. Context oftencontain these missing pieces. Fig. 23 illustratesthis point where the intent of the target utterancecould be resolvable by decoding the coreferenceof someplace as restaurant. Similarly, in Fig. 24,the correct intent could be determined through thecontext given by the previous utterance from the

same speaker.

Hi! I'd like to find a restaurant located in the centre please.

Intent: find restaurant

Welcome to Cambridge. Hope you had a pleasant flight. How can I help you, madam?

Modern European, please, and I'd like someplace moderately priced.

Intent: find restaurant

There are two restaurants that fit your criteria which are de luca cucina and bar and riverside brasserie. Which one do you prefer?

Figure 23: Context dependency by implicit mention inMultiWOZ dataset; red border indicates target utter-ance.

The role of context is pivotal in the case of sar-casm. The target utterance in Fig. 25 can onlybe correctly construed as sarcastic through theconsideration of contextual utterances. The con-text is two way in this particular case. Firstly,

I need to find a guesthouse with a 3 star rating

Intent: find hotel

We have four such guesthouses. Do you have a preferred location?

No, I don't. I want one that includes free wifi.

Intent: find hotel

Bridge Gue House, Hamilton Lodge, and Hobsons House are all available if you'd like one of those?

contextualdependency

Figure 24: Context dependency in MultiWOZ dataset;red border indicates target utterance.

the non-target speaker’s nonchalant attitude infu-riates the target speaker. On the other hand, theprevious utterance of the target speaker suggestshis foul mood that exacerbates in the target ut-terance. Contextual model bcLSTM could invokethese kind of contextual reasoning to arrive at thecorrect output label.

It must be noted that the possible explana-tions shown in these examples are contrived.Whereas, the labels are produced by bcLSTM.

We repeatedly observed across variousutterance-level dialogue understanding tasks thatGloVe CNN fail and bcLSTM succeed in produc-ing correct labels in such context-reliant cases.To verify the role of context, we removed thecontextual utterances around the target utterancesand observed similar misclassification producedby GloVe CNN and bcLSTM alike. This indicatesthe likely ability of bcLSTM to capture the rightcontext from the neighbouring utterances in adialogue. However, the inner workings of suchnetworks still remain veiled to this day. Thus,in the future, we should design approaches thatare more explicit about the reasoning behind theoutput. Such explainable AI systems could pavethe way to even richer systems.

7 Future Directions

Future work on utterance-level dialogue under-standing could focus on a number of different di-rections:

• As evidenced by results in Table 8 and 9 cur-rent contextual models such as bcLSTM of-ten lacks the ability to make effective use ofcontext considering inter- and intra-speakerdependencies. It is particularly true for theemotion classification tasks in IEMOCAP

Oh, maybe a little. Nothing serious

Emotion: neutral

He let him kiss you. You said you did

Emotion: frustration

Well, what of it. It gave him a lot of pleasure and it didn’t hurt me.

Emotion: neutral

That’s a nice point of view I must say

Emotion: anger

contextualdependency

labelconsistency

Figure 25: Context dependency in IEMOCAP dataset;red border indicates target utterance.

and DailyDialog, where we found that drop-ping inter- and intra-speaker specific contextleads to an improvement over the results ofbcLSTM which uses full context. bcLSTMis thus not efficient in making use of the con-textual utterances of both the interlocutorsand lacks the ability to use the right amountof context. Future work in this directioncould focus on development of better contex-tual models which are efficient in making useof their context considering speaker specificinformation. One promising direction is touse both context and speaker sensitive depen-dence which has been shown to be effectivefor emotion recognition (Zhang et al., 2019;Ghosal et al., 2019).

• Task-specific Context Modeling: Both in-terlocutors’ utterances play a vital role in sev-eral tasks (refer to Table 9) addressed in thispaper—intent classification in the MultiWOZdataset, Persuader’s and Persuadee’s act clas-sification in the Persuasion for Good dataset.However, in some other tasks, i.e., emotionclassification on the IEMOCAP and Daily-Dialog, act classification on the DailyDialogdataset, dropping target speaker’s, and non-target speaker’s utterances from context ex-hibit contrasting results (refer to Table 8). Inparticular, for the act classification task in theDailyDialog dataset, non-target speaker’s ut-terances are more informative as compared tothe target speaker’s utterances. We observe astark opposite outcome for the emotion clas-sification task in IEMOCAP and DailyDialogdatasets where removing non-target speaker’sutterances improves the overall performance.Interestingly, in the case of the DailyDialog

dataset, the same contextual input yields twodivergent trends in the results for two dif-ferent tasks. Due to these contrasting yetinteresting phenomena, we believe it maynot be optimal to adopt a task-agnosticunified context modeling approach for allthe tasks which call for task-specific contextmodeling approaches.

• Given an utterance, present contextual ap-proaches to these problems can’t explainwhere and how contextual information areused. In particular, they fail to explain theirdecisions which is a general problem of theAI models. In this work, we only inves-tigate the general trend in how contextualinformation can aid dialogue understandingtasks. Our probing techniques do not attemptto reason how contextual information helpto infer the labels of target utterances e.g.,whether the model relies on coreferential, af-fective contextual information, and as shownin Fig. 2, how it jointly fuses these differenttypes of contextual information for reason-ing.

• Speaker-specific modeling shows its efficacyin most of the tasks that we consider in thispaper. However, in this work, we chose toexclude it from our detailed analysis. In thefuture, we should strive to address this issueand analyze why and how speaker specificcontext modeling is effective for utterance-level dialogue understanding.

• The perturbation and adversarial attackingstrategies used in this work are task agnosticwhich may not be ideal as the utterance-leveldialogue understanding tasks differ fromeach other. In the future, we plan to designtask-specific probing strategies to gain fur-ther insights from these contextual models.

• Adapting the Proposed Probing Methodsto Other Tasks: The proposed probingmethods engineered to understand the roleof context can easily be adapted or extendedto other context-dependent tasks — summa-rization, dialogue generation, document-levelsentiment analysis to name a few. With amplecomputational support, the role of RoBERTalike models in context modeling can be an-alyzed. This aligns well with the quest of

explaining RoBERTa and other transformer-based models in contextual tasks by themeans of attention visualization, measuringcosine similarity between the [CLS] tokenand the tokens in the contextual utterances tounderstand the role of these tokens in infer-ring the target utterances. The latter can bevery useful in explaining the case studies inSection 6.12.

• What about Multi-party Dialogues? Read-ers at this point may ponder the absence ofmulti-party dialogues in our study. The pri-mary rationale for this is the additional com-plexity associated with multi-party dialogues.Multi-party dialogues involve many speakersand hence introduce complex coreferencesthat make inferences and context modelingharder than dyadic dialogues. The level ofconvolution that multi-party dialogues bringcan be considered as a separate topic of re-search. MELD (Poria et al., 2019) is oneof the publicly available datasets for emotionand sentiment classification in multi-partydatasets. Our preliminary experiments on thisdataset with bcLSTM and DialogueRNN, re-ported in (Poria et al., 2019), shows only aslight improvement over the non-contextualmodels like GloVe CNN. MELD containsvery short utterances, like yeah, oh, whichalthough appear neutral, contain non-neutralemotions when perceived in their associatedcontext. This solidifies the need for furtherresearch on context representation modelingto understand multi-party dialogues and thisis one of our future research goals.

8 Conclusion

This paper establishes a unified baseline for allthe utterance-level dialogue understanding sub-tasks. Furthermore, we probed the contextualbaseline bcLSTM with different strategies engi-neered to understand the role of context. This con-sequently lends us insight into the behaviour ofbcLSTM at the presence of various context per-turbations. Such probes have bolstered many in-teresting intuitions about utterance-level dialogueunderstanding—the role of label dependency andfuture utterances; the robustness of contextualmodels as opposed to their non-contextual coun-terparts against adversarial probes; the impact ofposition of an utterance on its correct classifi-

cation. We also compared two different mini-batch-creation schemes for training—dialogue-based and utterance-based mini-batch—and com-pared their performance under varied settings.We believe that these probing strategies can bestraightforwardly adapted to other context-relianttasks.

AcknowledgmentsThis research is supported by (i) A*STAR un-der its RIE 2020 Advanced Manufacturing andEngineering (AME) programmatic grant, AwardNo. – A19E2b0098, and (ii) DSO, NationalLaboratories, Singapore under grant numberRTDST190702 (Complex Question Answering).We are thankful to Rishabh Bhardwaj for his timeand insightful comments toward this work.

ReferencesPaweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang

Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gašic. 2018. Multiwoz-alarge-scale multi-domain wizard-of-oz dataset fortask-oriented dialogue modelling. arXiv preprintarXiv:1810.00278.

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower, Samuel Kim, Jean-nette N Chang, Sungbok Lee, and Shrikanth SNarayanan. 2008. IEMOCAP: Interactive emo-tional dyadic motion capture database. Languageresources and evaluation, 42(4):335–359.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Deepanway Ghosal, Navonil Majumder, Soujanya Po-ria, Niyati Chhaya, and Alexander Gelbukh. 2019.Dialoguegcn: A graph convolutional neural networkfor emotion recognition in conversation. In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP), pages 154–164.

Devamanyu Hazarika, Soujanya Poria, Roger Zimmer-mann, and Rada Mihalcea. 2019. Emotion recog-nition in conversations with transfer learning fromgenerative conversation modeling. arXiv preprintarXiv:1910.04980.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Eduard Hovy. 1987. Generating natural language un-der pragmatic constraints. Journal of Pragmatics,11(6):689–719.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1875–1885.

Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee.2017. Residual lstm: Design of a deep recurrentarchitecture for distant speech recognition. arXivpreprint arXiv:1701.03360.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In EMNLP 2014, pages1746–1751.

Peter Kuppens, Nicholas B Allen, and Lisa B Sheeber.2010. Emotional inertia and psychological malad-justment. Psychological Science, 21(7):984–991.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. arXiv preprint arXiv:1910.13461.

Juncen Li, Robin Jia, He He, and Percy Liang. 2018.Delete, retrieve, generate: a simple approach to sen-timent and style transfer. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 1865–1874.

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017. Dailydialog: A manuallylabelled multi-turn dialogue dataset. In Proceedingsof the Eighth International Joint Conference on Nat-ural Language Processing (Volume 1: Long Papers),pages 986–995.

Zheng Lian, Jianhua Tao, Bin Liu, and Jian Huang.2019. Domain adversarial learning for emotionrecognition. arXiv preprint arXiv:1910.13807.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Navonil Majumder, Soujanya Poria, Devamanyu Haz-arika, Rada Mihalcea, Alexander Gelbukh, and ErikCambria. 2019. DialogueRNN: An Attentive RNNfor Emotion Detection in Conversations. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, volume 33, pages 6818–6825.

Michael W Morris and Dacher Keltner. 2000. Howemotions work: The social functions of emotionalexpression in negotiations. Research in organiza-tional behavior, 22:1–50.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proceedings of the 27th international conferenceon machine learning (ICML-10), pages 807–814.

Costanza Navarretta, K Choukri, T Declerck, S Goggi,M Grobelnik, and B Maegaard. 2016. Mirroring fa-cial expressions and emotions in dyadic conversa-tions. In LREC.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difficulty of training recurrent neuralnetworks. In International conference on machinelearning, pages 1310–1318.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika,Navonil Majumder, Amir Zadeh, and Louis-PhilippeMorency. 2017. Context-Dependent SentimentAnalysis in User-Generated Videos. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 873–883, Vancouver, Canada. Associa-tion for Computational Linguistics.

Soujanya Poria, Devamanyu Hazarika, Navonil Ma-jumder, Gautam Naik, Erik Cambria, and Rada Mi-halcea. 2019. MELD: A multimodal multi-partydataset for emotion recognition in conversations. InProceedings of the 57th Conference of the Associa-tion for Computational Linguistics, ACL 2019, Flo-rence, Italy, July 28- August 2, 2019, Volume 1:Long Papers, pages 527–536. Association for Com-putational Linguistics.

Libo Qin, Wanxiang Che, Yangming Li, Mingheng Ni,and Ting Liu. 2020. Dcr-net: A deep co-interactiverelation network for joint dialog act recognition andsentiment classification. In Proceedings of the AAAIConference on Artificial Intelligence.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limits

of transfer learning with a unified text-to-text trans-former. arXiv preprint arXiv:1910.10683.

Hannah Rashkin, Eric Michael Smith, Margaret Li, andY-Lan Boureau. 2019. Towards empathetic open-domain conversation models: a new benchmark anddataset. In Proceedings of the Association for Com-putatinal Linguistics.

Tulika Saha, Aditya Patra, Sriparna Saha, and Push-pak Bhattacharyya. 2020. Towards emotion-aidedmulti-modal dialogue act classification. In Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics, pages 4361–4372.

Chinnadhurai Sankar, Sandeep Subramanian, Christo-pher Pal, Sarath Chandar, and Yoshua Bengio. 2019.Do neural dialog systems use the conversation his-tory effectively? an empirical study. arXiv preprintarXiv:1906.01603.

Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh,Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019.Persuasion for good: Towards a personalized per-suasive dialogue system for social good. arXivpreprint arXiv:1906.06725.

Yan Wang, Jiayu Zhang, Jun Ma, Shaojun Wang, andJing Xiao. 2020. Contextualized emotion recogni-tion in conversation as sequence tagging. In Pro-ceedings of the 21th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue, pages186–195.

Chung-Hsien Wu, Ze-Jing Chuang, and Yu-Chung Lin.2006. Emotion recognition from text using semanticlabels and separable mixture models. ACM trans-actions on Asian language information processing(TALIP), 5(2):165–183.

Songlong Xing, Sijie Mai, and Haifeng Hu. 2020.Adapted dynamic memory network for emotionrecognition in conversation. IEEE Transactions onAffective Computing.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In Advances in neural in-formation processing systems, pages 5753–5763.

Dong Zhang, Liangqing Wu, Changlong Sun,Shoushan Li, Qiaoming Zhu, and Guodong Zhou.2019. Modeling both context-and speaker-sensitivedependence for emotion detection in multi-speakerconversations. In Proceedings of the 28th Interna-tional Joint Conference on Artificial Intelligence,pages 5415–5421. AAAI Press.

Hao Zhou, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2017. Emotional chattingmachine: Emotional conversation generation withinternal and external memory. arXiv preprintarXiv:1704.01074.

https://doi.org/10.1609/aaai.v33i01.33016818

https://doi.org/10.1609/aaai.v33i01.33016818

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162

http://aclweb.org/anthology/P17-1081

http://aclweb.org/anthology/P17-1081

https://doi.org/10.18653/v1/p19-1050

https://doi.org/10.18653/v1/p19-1050

Documents

arXiv:2009.13902v1 [cs.CL] 29 Sep 2020b, however, is pre-occupied and replies sarcasti-cally (u 4). This enrages P a to appropriate an an-gry response (u 6). In this dialogue, emotional