MoEL: Mixture of Empathetic Listeners · 2020. 1. 23. · Wang and Wan,2018;Zhou and Wang,2018; Zhou et al.,2018). Both cases have succeeded in generating em-pathetic and emotional

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 121–132,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

121

MoEL: Mixture of Empathetic Listeners

Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, Pascale FungCenter for Artificial Intelligence Research (CAiRE)

Department of Electronic and Computer EngineeringThe Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong

{zlinao,amadotto,jmshinaa,pxuab}@connect.ust.hk,[email protected]

Abstract

Previous research on empathetic dialogue sys-tems has mostly focused on generating re-sponses given certain emotions. However,being empathetic not only requires the abil-ity of generating emotional responses, butmore importantly, requires the understandingof user emotions and replying appropriately.In this paper, we propose a novel end-to-end approach for modeling empathy in dia-logue systems: Mixture of Empathetic Lis-teners (MoEL). Our model first captures theuser emotions and outputs an emotion distri-bution. Based on this, MoEL will softly com-bine the output states of the appropriate Lis-tener(s), which are each optimized to react tocertain emotions, and generate an empatheticresponse. Human evaluations on empathetic-dialogues (Rashkin et al., 2018) dataset con-firm that MoEL outperforms multitask trainingbaseline in terms of empathy, relevance, andfluency. Furthermore, the case study on gen-erated responses of different Listeners showshigh interpretability of our model.

1 Introduction

Neural network approaches for conversation mod-els have shown to be successful in scalabletraining and generating fluent and relevant re-sponses (Vinyals and Le, 2015). However, it hasbeen pointed out by Li et al. (2016a,b,c); Wu et al.(2018b) that only using Maximum Likelihood Es-timation as the objective function tends to lead togeneric and repetitive responses like “I am sorry”.Furthermore, many others have shown that the in-corporation of additional inductive bias leads toa more engaging chatbot, such as understandingcommonsense (Dinan et al., 2018), or modelingconsistent persona (Li et al., 2016b; Zhang et al.,2018a; Mazare et al., 2018a).

Meanwhile, another important aspect of an en-gaging human conversation that received rela-

Emotion: AngrySituation

I was furious when I got inmy first car wreck.

Speaker I was driving on the interstate andanother car ran into the back of me.

Listener Wow. Did you get hurt?Sounds scary.

Speaker No just the airbags went off andI hit my head and got a few bruises.

Listener I am always scared about thoseairbags! I am so glad you are ok!

Table 1: One conversation from empathetic dialogue, aspeaker tells the situation he(she) is facing, and a lis-tener try to understand speaker’s feeling and respondaccordingly

tively less focus is emotional understanding andempathy (Rashkin et al., 2018; Dinan et al., 2019;Wolf et al., 2019). Intuitively, ordinary social con-versations between two humans are often abouttheir daily lives that revolve around happy or sadexperiences. In such scenarios, people generallytend to respond in a way that acknowledges thefeelings of their conversational partners.

Table 1 shows an conversation from theempathetic-dialogues dataset (Rashkin et al.,2018) about how an empathetic person would re-spond to the stressful situation the Speaker hasbeen through. However, despite the importance ofempathy and emotional understanding in humanconversations, it is still very challenging to train adialogue agent able to recognize and respond withthe correct emotion.

So far, to solve the problem of empathetic dia-logue response generation, which is to understandthe user emotion and respond appropriately (Bert-ero et al., 2016), there have been mainly two lines

122

Emotion Tracker

QRY

SharedListener

Listener 1

Listener 2

p

Context Embedding

Key 1

Key

2

Key

n

⋯

Key

3⋯

+p1

p2

pn

Response Embedding

MetaListener

Listener n

Context C

⋯

Response (shifted right)

S

Response S

Figure 1: The proposed model Mixture of Empathetic Listeners, which has an emotion tracker, n empatheticlisteners along with a shared listener, and a meta listener to fuse the information from listeners and produce theempathetic response.

of work. The first is a multi-task approach thatjointly trains a model to predict the current emo-tional state of the user and generate an appropri-ate response based on the state (Lubis et al., 2018;Rashkin et al., 2018). Instead, the second line ofwork focuses on conditioning the response gener-ation to a certain fixed emotion (Hu et al., 2017;Wang and Wan, 2018; Zhou and Wang, 2018;Zhou et al., 2018).

Both cases have succeeded in generating em-pathetic and emotional responses, but have ne-glected some crucial points in empathetic dialogueresponse generation. 1) The first assumes that byunderstanding the emotion, the model implicitlylearns how to respond appropriately. However,without any additional inductive bias, a single de-coder learning to respond for all emotions will notonly lose interpretability in the generation process,but will also promote more generic responses. 2)The second assumes that the emotion to conditionthe generation on is given as input, but we often donot know which emotion is appropriate in order togenerate an empathetic response.

Therefore, in this paper, to address the aboveissues, we propose a novel end-to-end empatheticdialogue agent, called Mixture of Empathetic Lis-

teners 1 (MoEL). Similar to Rashkin et al. (2018),we first encode the dialogue context and use itto recognize the emotional state (n possible emo-tions). However, the main difference is that ourmodel consists of n decoders, further denoted aslisteners, which are optimized to react to each con-text emotion accordingly. The listeners are trainedalong with a Meta-listener that softly combines theoutput decoder states of each listener according tothe emotion classification distribution. Such de-sign allows our model to explicitly learn how tochoose an appropriate reaction based on its under-standing of the context emotion. A detailed illus-tration of MoEL is shown in Figure 1.

The proposed model is tested against sev-eral competitive baseline settings (Vaswani et al.,2017; Rashkin et al., 2018), and evaluated withhuman judges. The experimental results showthat our approach outperforms the baselines inboth empathy and relevance. Finally, our analy-sis demonstrates that not only MoEL effectivelyattends to the right listener, but also each listenerlearns how to properly react to its correspondingemotion, hence allowing a more interpretable gen-erative process.

1The code will be released at https://github.com/HLTCHKUST/MoEL

https://github.com/HLTCHKUST/MoEL

https://github.com/HLTCHKUST/MoEL

123

2 Related Work

Conversational Models: Open domain conver-sational models has been widely studied (Serbanet al., 2016; Vinyals and Le, 2015; Wolf et al.,2019). A recent trend is to produce personalizedresponses by conditioning the generation on a per-sona profile to make the response more consis-tent through the dialogue (Li et al., 2016b). Inparticular, PersonaChat (Zhang et al., 2018b; Ku-likov et al., 2018) dataset was created, and then ex-tended in ConvAI 2 challenge (Dinan et al., 2019),to show that by adding persona information as in-put to the model, the produced responses elicitmore consistent personas. Based on such, severalfollow-up work has been presented (Mazare et al.,2018b; Hancock et al., 2019; Joshi et al., 2017;Kulikov et al., 2018; Yavuz et al., 2018; Zemlyan-skiy and Sha, 2018; Madotto et al., 2019). How-ever, such personalized dialogue agents focus onlyon modeling a consistent persona and often ne-glect the feelings of their conversation partners.

Another line of work combines retrieval andgeneration to promote the response diversity (Caiet al., 2018; Weston et al., 2018; Wu et al.,2018b). However, only fewer works focus onemotion (Winata et al., 2017, 2019; Xu et al.,2018; Fan et al., 2018a,c,b; Lee et al., 2019) andempathy in the context of dialogues systems (Bert-ero et al., 2016; Chatterjee et al., 2019a,b; Shinet al., 2019). For generating emotional dia-logues, Hu et al. (2017); Wang and Wan (2018);Zhou and Wang (2018) successfully introduce aframework of controlling the sentiment and emo-tion of the generated response, while (Zhou andWang, 2018) also introduces a new Twitter con-versation dataset and propose to distantly super-vised the generative model with emojis. Mean-while, (Lubis et al., 2018; Rashkin et al., 2018)also introduce new datasets for empathetic dia-logues and train multi-task models on it.

Mixture of Experts: The idea of having spe-cialized parameters, or so-called experts, has beenwidely studied topics in the last two decades (Ja-cobs et al., 1991; Jordan and Jacobs, 1994). For in-stance, different architectures and methodologieshave been used such as SVM (Collobert et al.,2002), Gaussian Processes (Tresp, 2001; Theisand Bethge, 2015; Deisenroth and Ng, 2015),Dirichlet Processes (Shahbaba and Neal, 2009),Hierarchical Experts (Yao et al., 2009), Infinite

Number of Experts (Rasmussen and Ghahramani,2002) and sequential expert addition (Aljundiet al., 2017). More recently, the Mixture Of Ex-pert (Shazeer et al., 2017; Kaiser et al., 2017)model was proposed which added a large numberof experts in between of two LSTM (Schmidhu-ber, 1987) layers to enhance the capacity of themodel. This idea of having independent special-ized experts inspires our approach to model thereaction to each emotion with a separate expert.

3 Mixture of Empathetic Listeners

The dialogue context is an alternating set of ut-terances from speaker and listener. We denote thedialogue context asC = {U1, S1, U2, S2, · · · , Ut}and the speaker emotion state at each utterance asEmo = {e1, e2, · · · , et} where ∀ei ∈ {1, . . . , n}.Then, our model aims to track the speaker emo-tional state et from the dialogue context C, andgenerates an empathetic response St.

Overall, MoEL is composed of three compo-nents: an emotion tracker, emotion-aware listen-ers, and a meta listener as shown in Figure 1.The emotion tracker (which is also the context en-coder) encodesC and computes a distribution overthe possible user emotions. Then all the listenersindependently attend to this distribution to com-pute their own representation. Finally, the metalistener takes the weighted sum of representationsfrom the listeners and generates the final response.

3.1 EmbeddingWe define the context embedding EC ∈R|V |×demb , and the response embedding ER ∈R|V |×demb which are used to convert tokens intoembeddings. In multi-turn dialogues, ensuringthat the model is able to distinguish among turnsis essential, especially when multiple emotion arepresent in different turns. Hence, we incorporate adialogue state embedding in the input. This is usedto enable the encoder to distinguish speaker utter-ances and listener utterances (Wolf et al., 2019).As shown in Figure 2, our context embeddingEC is the positional sum of the word embeddingEW , the positional embeddingEP (Vaswani et al.,2017) and the dialogue state embedding ED.

EC(C) = EW (C) + EP (C) + ED(C) (1)

3.2 Emotion TrackerMoEL uses a standard transformer en-coder (Vaswani et al., 2017) for the emotion

124

I lost my wallet yesterday Thats terrible , did you call police ? No , planing to do soWordEmbedding

Dialogue StateEmbedding

PositionalEmbedding

Context Embedding

Figure 2: Context embedding is computed by summing up the word embedding, dialogue state embedding andpositional embedding for each token.

tracker. We first flatten all dialogue turns inC, and map each token into its vectorial rep-resentation using the context embedding EC .Then the encoder encodes the context sequenceinto a context representation. We add a querytoken QRY at the beginning of each inputsequence as in BERT (Devlin et al., 2018), tocompute the weighted sum of the output tensor.Denoting a transformer encoder as TRSEnc, thencorresponding context representation become:

H = TRSEnc(EC([QRY ;C])) (2)

where [; ] denotes concatenation, H ∈ RL×dmodel

where L is the sequence length. Then, we definethe final representation of the token QRY as

q = H0 (3)

where q ∈ Rdmodel , which is then used as the queryfor generating the emotion distribution.

3.3 Emotion Aware Listeners

The emotion aware listeners mainly consist of1) a shared listener that learns shared informa-tion for all emotions and 2) n independently pa-rameterized Transformer decoders (Vaswani et al.,2017) that learn how to appropriately react givena particular emotional state. All the listeners aremodeled by a standard transformer decoder layerblock, denoted as TRSDec, which is made of threesub-components: a multi-head self-attention overthe response input embedding, a multi-head atten-tion over the output of the emotion tracker, anda position-wise fully connected feed-forward net-work.

Thus, we define the set of listeners as L =[TRS0

Dec, . . . , TRSnDec]. Given the target se-

quence shifted by one r0:t−1, each listener com-pute its own emotional response representation Vi:

Vi = TRSiDec(H,E

R(r0:t−1)) (4)

where TRSiDec refers to the i-th listener, includ-

ing the shared one. Conceptually, we expect thatthe output from the shared listener, TRS0

Dec, to bea general representation which can help the modelto capture the dialogue context. On the other hand,we expect that each empathetic listener learns howto respond to a particular emotion. To model thisbehavior, we assign different weights to each em-pathetic listener according to the user emotion dis-tribution, while assigning a fixed weight of 1 to theshared listener.

To elaborate, we construct a Key-Value Mem-ory Network (Miller et al., 2016) and representeach memory slot as a vector pair (ki, Vi), whereki ∈ Rdmodel denotes the key vector and Vi is fromEquation 4. Then, the encoder informed query qis used to address the key vectors k by perform-ing a dot product followed by a Softmax function.Thus, we have:

pi =eq

>ki∑nj=1 e

q>kj(5)

each pi is the score assigned to Vi, thus used asthe weight of each listener. During training, giventhe speaker emotion state et, we supervise eachweight pi by maximizing the probability of theemotion state et with a cross entropy loss function:

L1 = − log pet (6)

Finally, the combined output representation iscompute by the weighted sum of the memory val-ues Vi and the shared listener output V0.

VM = V0 +n∑

i=1

piVi (7)

3.4 Meta ListenerFinally, the Meta Listener is implemented usinganother transformer decoder layer, which furthertransform the representation of the listeners and

125

Surp

rise

d

Exc

ited

Annoy

ed

Pro

ud

Angr

y

Sad

Gra

tefu

l

Lon

ely

Impr

esse

d

Afrai

d

Disgu

sted

Con

fiden

t

Ter

rified

Hop

eful

Anxi

ous

Disap

poi

nte

d

Joyf

ul

Pre

par

ed

Guilt

y

Furiou

s

Nos

talg

ic

Jeal

ous

Antici

pat

ing

Em

bar

rass

ed

Con

tent

Dev

asta

ted

Sen

tim

enta

l

Car

ing

Tru

stin

g

Ash

amed

Appr

ehen

sive

Fai

thfu

l

0

20

40

60

80A

ccura

cy

Emotion Detection Accuracy

Top5

Top1

Figure 3: Top-1 and Top-5 emotion detection accuracy over 32 emotions at each turn

Params. BLEU Empathy Relevance FluencyGold - - 3.93 3.93 3.35TRS 16.94M 3.02 3.32 3.47 3.52

MultiTRS 16.95M 2.92 3.36 3.57 3.31MoEL 23.1M 2.90 3.44 3.70 3.47

Table 2: Comparison between our proposed methodsand baselines. All of models receive close BLEU score.MoEL achieve highest Empathy and Relevance score,while TRS achieve better Fluency score. The numberof parameters for each model is reported.

generates the final response. The intuition is thateach listener specializes to a certain emotion andthe Meta Listener gathers the opinions generatedby multiple listeners to produce the final response.Hence, we define another TRSMeta

Dec , and an affinetransformation W ∈ Rdmodel×|V | to compute:

O = TRSMetaDec (H,VM ) (8)

p(r1:t|C, r0:t−1) = softmax(O>W ) (9)

whereO ∈ Rdmodel×t is the output of meta listenerand p(r1:t|C, r0:t−1) is a distribution over the vo-cabulary for the next tokens. We then use a stan-dard maximum likelihood estimator (MLE) to op-timize the response prediction:

L2 = − log p (St|C) (10)

Lastly, all the parameters are jointly trained end-to-end to optimize the listener selection and re-sponse generation by minimizing the weighted-sum of two losses:

L = αL1 + βL2 (11)

Where α and β are hyperparameters to balancetwo loss.

Model Win Loss TieMoEL vs TRS 37.3% 18.7% 44%

MoEL vs Multi-TRS 36.7% 32.6% 30.7%

Table 3: Result of human A/B test. Tests are conductedpairwise between MoEL and baseline models

4 Experiment

4.1 Dataset

We conduct our experiment on the empathetic-dialogues (Rashkin et al., 2018) dataset whichconsist of 25k one-to-one open-domain conversa-tion grounded in emotional situations. The datasetprovides 32, evenly distributed, emotion labels.Table 1 shows an example from the training set.The speakers are talking about their situation andthe listeners is trying to understand their feelingand reply accordingly. At training time the emo-tional labels of the speakers are given, while wehide the label in test time to evaluate the empathyof our model.

4.2 Training

We train our model using Adam opti-mizer (Kingma and Ba, 2014) and varied thelearning rate during training following (Vaswaniet al., 2017). The weight of both losses α andβ are set to 1 for simplicity. We use pre-trainedGlove vectors (Pennington et al., 2014) to initial-ize the word embedding and we share it acrossthe encoder and the decoder. The rest of theparameters are randomly initialized.

In the early training stage, emotion tracker ran-domly assign weights to the listeners, and maysend noisy gradient flow back to the wrong lis-teners, which can make the model convergenceharder. To stabilize the learning process, we re-place the distribution p of the listeners with the or-

126

acle emotion et information using a certain proba-bility εoracle, and we gradually anneal it during thetraining. We set an annealing rate γ = 1 × 10−3,and a threshold tthd equal to 1× 104, thus at eachiteration t iteration we compute:

εoracle = γ + (1− γ)e−t

tthd (12)

4.3 BaselineWe compare our model with two baselines:

Transformer (TRS) The standard Transformermodel (Vaswani et al., 2017) that is trained to min-imize MLE loss as in Equation 10.

Multitask Transformer (Multi-TRS) A Multi-task Transformer trained as (Rashkin et al., 2018)to incorporate additional supervised informationabout the emotion. The encoder of multitask trans-former is the same as our emotion tracker, and thecontext representationQ, from Equation 3, is usedas input to an emotion classifier. The whole modelis jointy trained by optimizing both the classifica-tion and generation loss.

4.4 HyperparameterIn all of our experiments we used 300 dimensionalword embedding and 300 hidden size everywhere.We use 2 self-attention layers made up of 2 at-tention heads each with embedding dimension 40.We replace Positionwise Feedforward sub-layerwith 1D convolution with 50 filters of width 3. Wetrain all of models with batch size 16 and we usebatch size 1 in the test time.

4.5 Evaluation MetricsBLEU We compute BLEU scores (Papineniet al., 2002) to compare the generated responseagainst human responses. However, in open-domain dialogue response generation, BLEU isnot a good measurement of generation quality (Liuet al., 2016), so we use BLEU only as a reference.

Human Ratings In order to measure the qual-ity of the generated responses, we conduct hu-man evaluations with Amazon Mechanical Turk.Following Rashkin et al. (2018), we first ran-domly sample 100 dialogues and their correspond-ing generations from MoEL and the baselines. Foreach response, we assign three human annotatorsto score the following aspect of models: Empa-thy, Relevance, and Fluency. Note that we evalu-ate each metric independently and the scores range

between 1 and 5, in which 1 is ”not at all” and 5 is”very much”.

We ask the human judges to evaluate each of thefollowing categories from a 1 to 5 scale, where 5is the best score.

• Empathy / Sympathy: Did the responses fromthe LISTENER show understanding of thefeelings of the SPEAKER talking about theirexperience?

• Relevance: Did the responses of the LIS-TENER seem appropriate to the conversa-tion? Were they on-topic?

• Fluency: Could you understand the responsesfrom the LISTENER? Did the language seemaccurate?

Human A/B Test In this human evaluation task,we aim to directly compare the generated re-sponses with each other. We randomly sample 100dialogues each for MoEL vs {TRS, Multi-TRS}.Three workers are given randomly ordered re-sponses from either MoEL or {TRS, Multi-TRS},and are prompted to choose the better response.They can either choose one of the responses or se-lect tie when the provided options are either bothgood or both bad.

5 Results

Emotion detection To verify whether our modelcan attend to the appropriate listeners, we computethe emotion detection accuracy for each turn. Ourmodel achieve 38%, 63%, 74% in terms of top-1,top-3, top-5 detection accuracy over 32 emotions.We notice that some emotions frequently appearin similar context (e.g., Annoyed, Angry, Furious)which might degrade the detection accuracy. Fig-ure 3 shows the per class accuracy in the test set.We can see that by using top-5 the majority of theemotion achieve around 80% accuracy.

Response evaluation Both automatic and hu-man evaluation results are shown in Table 2. TRSachieves the highest BLEU score and Fluencyscore but the lowest Empathy and Relevance score.This shows us that the responses generated by TRSare more generic but cannot accurately capture theuser emotions. With the additional supervision onuser emotions, multi-task training improves bothEmpathy and Relevance score, but it still degradesFluency. In contrast, MoEL achieves the highest

127

Emotion Disgusted

Context Speaker: i had to dog-sit for my dad about a week ago . i had to clean up dogpoop at least 10 times , it was awful .

TRS that is horrible . i hope you did not have to clean it up .Multi-TRS oh my goodness , that must have been so scary !MoEL oh wow , that is pretty gross .Goal ohh , cleaning dog poop is always been annoying , yet petting the dog is always

amusingEmotion Disappointed

ContextSpeaker: i was turned down for a job i really thought i would be a good fit at .Listener: did they give you any reason for not giving you the job ?Speaker: someone else interviewed who had much more experience than i did .

TRS that is good . i hope you have a good time .Multi-TRS that is a good idea . i hope it was a good idea .MoEL i am sorry to hear that . i am sure you will do great againGoal well , at least a candidate that deserved the job got it .

Speaker: so my friend ’s baby fell down and scraped his knee yesterday and i was trying so hardto cheer him up.Listenercaring: that was really nice of you. I am glad you were able to help him.Listenerterrified oh wow ! that must have been so scary .Listenersad oh no , i am so sorry to hear that .Listenerexcited that sounds like a good idea .Listenerproud that was awesome ! did he get any good at you ?Speaker: my husband lost a job but I am hoping he can find a full-time job soon.Listenerhopeful: what kind of job is it?Listenerterrified oh that sucks . you should have been together .Listenersad i am so sorry . i am sure he will be fine .Listenerexcited that sounds awesome . what kind of job did he get you for ?Listenerproud oh wow ! congratulations to him . you must be proud of him .

Table 4: Generated responses from TRS, Multi-TRS and MoEL in 2 different user emotion states (top) and com-paring generation from different listeners (bottom). We use hard attention on Terrified, Sad, Excited and Proudlisteners.

Empathy and Relevance score. This suggests thatthe multi-expert strategy helps to capture the useremotional states and context simultaneously, andelicits a more appropriate response. The humanA/B tests also confirm that the responses from ourmodel are more preferred by human judges.

6 Analysis

In order to understand whether or how MoELcan effectively improve other baselines, learn eachemotion, and properly react to them, we conductthree different analyses: model response compar-ison, listener analysis, and visualization of theemotion distribution p.

Model response comparison The top part ofTable 4 compares the generated responses fromMoEL and the two baselines on two differentspeaker emotional states. In the first example,MoEL captures the exact emotion of the speaker,by replying with ”cleaning up dog poop is prettygross”, instead of ”horrible” and ”scary”. In thesecond example, both TRS and Multi-TRS fail tounderstand that the speaker is disappointed aboutthe failure of his interview, and they generate in-appropriate responses. On the other hand, MoELshows an empathetic response by comforting thespeaker with ”I am sure you will do great again”.More examples can be find in the Appendix.

128

surp

rised

exci

ted

anno

yed

prou

dan

gry

sad

grat

eful

lone

lyim

pres

sed

afra

iddi

sgus

ted

confi

dent

terr

ified

hope

ful

anxi

ous

disa

ppoi

nted

joyf

ulpr

epar

edgu

ilty

furio

usno

stal

gic

jeal

ous

antic

ipat

ing

emba

rras

sed

cont

ent

deva

stat

edse

ntim

enta

lca

ring

trus

ting

asha

med

appr

ehen

sive

faith

ful

Context: i recently spoke to my father for the first time in years .it reminded me of my childhood and the distant past .

Response: oh that is so sweet . i am sure you have a lot of good memories .

Context: running my first ( and maybe only ! ) marathonfelt like *such* a huge accomplishment !

Response: that is awesome ! i bet you are proud !

Context: i just moved to this neighborhood and some dumbcriminals shot one of my neighbors and ran into the woods !

Response: that is horrible ! i am glad you are okay !

Context: i recently found out that a person i admired did notfeel the same way for me. i was pretty surprised.

Response: that is so cool . i am glad you enjoyed it

Context: my husband lost a job but i am hoping he can finda full time job soon.

Response: what kind of job is it ?

Context: so my friend ’s baby fell down and scraped hisknee yesterday and i was trying so hard to cheer him up

Response: oh wow ! that must have been terrible . did you get hurt ?

Context: i could not wait to go to the concert .Response: that sounds like fun .

.

Context: i find it odd – and annoying – when peopledo not respond to or even acknowledge emails.

Response: that is so annoying. i hate when that happens

Figure 4: The visualization of attention on the listeners: The left side is the context followed by the responsesgenerated by MoEL. The heat map illustrate the attention weights on 32 listeners

Listener analysis To have a better understand-ing of how each listener learned to react todifferent context, we conduct a study of com-paring responses produced by different listen-ers. To do so, we fix the input dialogue con-text and we manually modify the attention vec-tor distribution p used to produce the response.We experiment with the correct listener and fourother listeners: Listenerterrified, Listenersad,Listenerexcited, Listenerproud. Given the samecontext, we expect that different listeners will re-act differently, as this is our inductive bias. Forexample, Listenersad is optimized to comfort sadpeople, and Listener{excited,proud} share the pos-itive emotions from the user. From the genera-tion results in the bottom parts of Table 4 we cansee that the corresponding listeners can produceempathetic and relevant responses when they rea-sonably match the speaker emotions. However,when the expected emotion label is opposite to theselected listener, such as caring and sad, the re-sponse becomes emotionally inappropriate.

Interestingly, in the last example, the sad lis-tener actually produces a more meaningful re-sponse by encouraging the speaker. This is dueto the first part of the context which conveys a sademotion. On the other hand, for the same example,the excited listener responds with very relevant yetunsympathetic response. In addition, as many di-

alogue contexts contain multiple emotions, beingable to capture them would lead to a better under-standing of the speaker emotional state.

Visualization of Emotion Distribution Finally,to understand how MoEL chooses the listener ac-cording to the context, we visualize the emotiondistribution p in Figure 4. In most of the cases, themodel attends to the proper listeners (emotions),and generate a proper responses. This is confirmedalso by the accuracy results shown in Figure 3.However, our model is sometimes focuses on partsof the dialogue context. For example, in the fifthexample in Figure 4, the model fails to detect thereal emotion of speaker as the context contains “Iwas pretty surprised” in its last turn.

On the other hand, the last three rows of theheatmap indicate that the model learns to lever-age multiple listeners to produce an empatheticresponse. For example, when the speaker talksabout some criminals that shot one of his neigh-bors, MoEL successfully detects both annoyedand afraid emotions from the context, and replieswith an appropriate response ”that is horrible! iam glad you are okay!” that addresses both emo-tions. However, in the third row, the model pro-duces ”you” instead of ”he” by mistake. Althoughthe model is able to capture relevant emotions forthis case, other emotions also have non-negligibleweights which results in a smooth emotion distri-

129

bution p that confuses the meta listener from accu-rately generating a response.

7 Conclusion & Future Work

In this paper, we propose a novel way to gen-erate empathetic dialogue responses by usingMixture of Empathetic Listeners (MoEL). Dif-ferently from previous works, our model under-stands the user feelings and responds accordinglyby learning specific listeners for each emotion.We benchmark our model in empathetic-dialoguesdataset (Rashkin et al., 2018), which is a multi-turn open-domain conversation corpus groundedon emotional situations. Our experimental re-sults show that MoEL is able to achieve competi-tive performance in the task with the advantage ofbeing more interpretable than other conventionalmodels. Finally, we show that our model is able toautomatically select the correct emotional decoderand effectively generate an empathetic response.

One of the possible extensions of this workwould be incorporating it with Persona (Zhanget al., 2018a) and task-oriented dialogue sys-tems (Gao et al., 2018; Madotto et al., 2018; Wuet al., 2019, 2017, 2018a; Reddy et al., 2018;Raghu et al., 2019). Having a persona would al-low the system to have more consistent and per-sonalized responses, and combining open-domainconversations with task-oriented dialogue systemswould equip the system with more engaging con-versational capabilities, hence resulting in a moreversatile dialogue system.

Acknowledgments

This work has been partially funded byITF/319/16FP and MRP/055/18 of the Inno-vation Technology Commission, the Hong KongSAR Government. We sincerely thank thethree anonymous reviewers for their insightfulcomments.

ReferencesRahaf Aljundi, Punarjay Chakravarty, and Tinne Tuyte-

laars. 2017. Expert gate: Lifelong learning witha network of experts. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 3366–3375.

Dario Bertero, Farhad Bin Siddique, Chien-Sheng Wu,Yan Wan, Ricky Ho Yin Chan, and Pascale Fung.2016. Real-time speech emotion and sentiment

recognition for interactive dialogue systems. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 1042–1047.

Deng Cai, Yan Wang, Victoria Bi, Zhaopeng Tu,Xiaojiang Liu, Wai Lam, and Shuming Shi.2018. Skeleton-to-response: Dialogue genera-tion guided by retrieval memory. arXiv preprintarXiv:1809.05296.

Ankush Chatterjee, Umang Gupta, Manoj KumarChinnakotla, Radhakrishnan Srikanth, Michel Gal-ley, and Puneet Agrawal. 2019a. Understandingemotions in text using deep learning and big data.Computers in Human Behavior, 93:309–317.

Ankush Chatterjee, Kedhar Nath Narahari, MeghanaJoshi, and Puneet Agrawal. 2019b. Semeval-2019task 3: Emocontext: Contextual emotion detectionin text. In Proceedings of The 13th InternationalWorkshop on Semantic Evaluation (SemEval-2019),Minneapolis, Minnesota.

Ronan Collobert, Samy Bengio, and Yoshua Bengio.2002. A parallel mixture of svms for very large scaleproblems. In Advances in Neural Information Pro-cessing Systems, pages 633–640.

Marc Deisenroth and Jun Wei Ng. 2015. Distributedgaussian processes. In International Conference onMachine Learning, pages 1481–1490.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Emily Dinan, Varvara Logacheva, Valentin Malykh,Alexander Miller, Kurt Shuster, Jack Urbanek,Douwe Kiela, Arthur Szlam, Iulian Serban, RyanLowe, et al. 2019. The second conversationalintelligence challenge (convai2). arXiv preprintarXiv:1902.00098.

Emily Dinan, Stephen Roller, Kurt Shuster, AngelaFan, Michael Auli, and Jason Weston. 2018. Wizardof wikipedia: Knowledge-powered conversationalagents. ICLR.

Yingruo Fan, Jacqueline CK Lam, and Victor OK Li.2018a. Multi-region ensemble convolutional neuralnetwork for facial expression recognition. In Inter-national Conference on Artificial Neural Networks,pages 84–94. Springer.

Yingruo Fan, Jacqueline CK Lam, and Victor OKLi. 2018b. Unsupervised domain adaptation withgenerative adversarial networks for facial emotionrecognition. In 2018 IEEE International Conferenceon Big Data (Big Data), pages 4460–4464. IEEE.

Yingruo Fan, Jacqueline CK Lam, and Victor OKLi. 2018c. Video-based emotion recognition usingdeeply-supervised neural networks. In Proceedingsof the 2018 on International Conference on Multi-modal Interaction, pages 584–588. ACM.

130

Jianfeng Gao, Michel Galley, and Lihong Li. 2018.Neural approaches to conversational ai. In The41st International ACM SIGIR Conference on Re-search & Development in Information Retrieval,pages 1371–1374. ACM.

Braden Hancock, Antoine Bordes, Pierre-EmmanuelMazare, and Jason Weston. 2019. Learning fromdialogue after deployment: Feed yourself, chatbot!arXiv preprint arXiv:1901.05415.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In International Confer-ence on Machine Learning, pages 1587–1596.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan,Geoffrey E Hinton, et al. 1991. Adaptive mixturesof local experts. Neural computation, 3(1):79–87.

Michael I Jordan and Robert A Jacobs. 1994. Hier-archical mixtures of experts and the em algorithm.Neural computation, 6(2):181–214.

Chaitanya K Joshi, Fei Mi, and Boi Faltings. 2017.Personalization in goal-oriented dialog. arXivpreprint arXiv:1706.07503.

Lukasz Kaiser, Aidan N Gomez, Noam Shazeer,Ashish Vaswani, Niki Parmar, Llion Jones, andJakob Uszkoreit. 2017. One model to learn themall. arXiv preprint arXiv:1706.05137.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Ilya Kulikov, Alexander H Miller, Kyunghyun Cho,and Jason Weston. 2018. Importance of a searchstrategy in neural dialogue modelling. arXivpreprint arXiv:1811.00907.

Nayeon Lee, Zihan Liu, and Pascale Fung. 2019. Teamyeon-zi at semeval-2019 task 4: Hyperpartisan newsdetection by de-noising weakly-labeled data. InProceedings of the 13th International Workshop onSemantic Evaluation, pages 1052–1056.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A diversity-promoting ob-jective function for neural conversation models. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 110–119.

Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016b. Apersona-based neural conversation model. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), volume 1, pages 994–1003.

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016c. Deep rein-forcement learning for dialogue generation. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 1192–1202.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How not to evaluate your dialogue system: An em-pirical study of unsupervised evaluation metrics fordialogue response generation. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing, pages 2122–2132.

Nurul Lubis, Sakriani Sakti, Koichiro Yoshino, andSatoshi Nakamura. 2018. Eliciting positive emotionthrough affect-sensitive dialogue response genera-tion: A neural network approach. In Thirty-SecondAAAI Conference on Artificial Intelligence.

Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, andPascale Fung. 2019. Personalizing dialogue agentsvia meta-learning. In Proceedings of the 57th Con-ference of the Association for Computational Lin-guistics, pages 5454–5459.

Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.2018. Mem2seq: Effectively incorporating knowl-edge bases into end-to-end task-oriented dialog sys-tems. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1468–1478.

Pierre-Emmanuel Mazare, Samuel Humeau, MartinRaison, and Antoine Bordes. 2018a. Training mil-lions of personalized dialogue agents. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 2775–2779.

Pierre-Emmanuel Mazare, Samuel Humeau, MartinRaison, and Antoine Bordes. 2018b. Training mil-lions of personalized dialogue agents. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 2775–2779.Association for Computational Linguistics.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for di-rectly reading documents. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing, pages 1400–1409.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

http://aclweb.org/anthology/D18-1298

http://aclweb.org/anthology/D18-1298

131

Dinesh Raghu, Nikhil Gupta, and Mausam. 2019.Disentangling Language and Knowledge in Task-Oriented Dialogs. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 1239–1255, Minneapolis, Minnesota.Association for Computational Linguistics.

Hannah Rashkin, Eric Michael Smith, Margaret Li,and Y-Lan Boureau. 2018. I know the feeling:Learning to converse with empathy. arXiv preprintarXiv:1811.00207.

Carl E Rasmussen and Zoubin Ghahramani. 2002. In-finite mixtures of gaussian process experts. In Ad-vances in neural information processing systems,pages 881–888.

Revanth Reddy, Danish Contractor, Dinesh Raghu, andSachindra Joshi. 2018. Multi-level memory for taskoriented dialogs. arXiv preprint arXiv:1810.10647.

Jurgen Schmidhuber. 1987. Evolutionary principles inself-referential learning. on learning now to learn:The meta-meta-meta...-hook. Diploma thesis, Tech-nische Universitat Munchen, Germany, 14 May.

Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, andJoelle Pineau. 2016. Generative deep neural net-works for dialogue: A short review. arXiv preprintarXiv:1611.06216.

Babak Shahbaba and Radford Neal. 2009. Nonlinearmodels using dirichlet process mixtures. Journal ofMachine Learning Research, 10(Aug):1829–1850.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. 2017. Outrageously large neural networks:The sparsely-gated mixture-of-experts layer. arXivpreprint arXiv:1701.06538.

Jamin Shin, Peng Xu, Andrea Madotto, and PascaleFung. 2019. Happybot: Generating empathetic dia-logue responses by improving user experience look-ahead. arXiv preprint arXiv:1906.08487.

Lucas Theis and Matthias Bethge. 2015. Generativeimage modeling using spatial lstms. In Advancesin Neural Information Processing Systems, pages1927–1935.

Volker Tresp. 2001. Mixtures of gaussian processes. InAdvances in neural information processing systems,pages 654–660.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. In International Conference on Ma-chine Learning.

Ke Wang and Xiaojun Wan. 2018. Sentigan: gener-ating sentimental texts via mixture adversarial net-works. In Proceedings of the 27th InternationalJoint Conference on Artificial Intelligence, pages4446–4452. AAAI Press.

Jason Weston, Emily Dinan, and Alexander Miller.2018. Retrieve and refine: Improved sequence gen-eration models for dialogue. In Proceedings of the2018 EMNLP Workshop SCAI: The 2nd Interna-tional Workshop on Search-Oriented ConversationalAI, pages 87–92.

Genta Indra Winata, Onno Kampman, Yang Yang,Anik Dey, and Pascale Fung. 2017. Nora the empa-thetic psychologist. Proc. Interspeech 2017, pages3437–3438.

Genta Indra Winata, Andrea Madotto, Zhaojiang Lin,Jamin Shin, Yan Xu, Peng Xu, and Pascale Fung.2019. Caire hkust at semeval-2019 task 3: Hierar-chical attention for dialogue emotion classification.In Proceedings of the 13th International Workshopon Semantic Evaluation, pages 142–147.

Thomas Wolf, Victor Sanh, Julien Chaumond, andClement Delangue. 2019. Transfertransfo: Atransfer learning approach for neural networkbased conversational agents. arXiv preprintarXiv:1901.08149.

Chien-Sheng Wu, Andrea Madotto, Genta Winata, andPascale Fung. 2017. End-to-end recurrent entity net-work for entity-value independent goal-oriented di-alog learning. In Dialog System Technology Chal-lenges Workshop, DSTC6.

Chien-Sheng Wu, Andrea Madotto, Genta IndraWinata, and Pascale Fung. 2018a. End-to-end dy-namic query memory network for entity-value inde-pendent task-oriented dialog. In 2018 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 6154–6158. IEEE.

Chien-Sheng Wu, Richard Socher, and CaimingXiong. 2019. Global-to-local memory pointer net-works for task-oriented dialogue. arXiv preprintarXiv:1901.04713.

Yu Wu, Furu Wei, Shaohan Huang, Zhoujun Li,and Ming Zhou. 2018b. Response generation bycontext-aware prototype editing. arXiv preprintarXiv:1806.07042.

Peng Xu, Andrea Madotto, Chien-Sheng Wu, Ji HoPark, and Pascale Fung. 2018. Emo2vec: Learn-ing generalized emotion representation by multi-task training. In Proceedings of the 9th Workshopon Computational Approaches to Subjectivity, Senti-ment and Social Media Analysis, pages 292–298.

Bangpeng Yao, Dirk Walther, Diane Beck, and Li Fei-Fei. 2009. Hierarchical mixture of classification ex-perts uncovers interactions between brain regions.In Advances in Neural Information Processing Sys-tems, pages 2178–2186.

https://doi.org/10.18653/v1/N19-1126

https://doi.org/10.18653/v1/N19-1126

http://www.idsia.ch/~juergen/diploma.html



132

Semih Yavuz, Abhinav Rastogi, Guanlin Chao, DilekHakkani-Tur, and Amazon Alexa AI. 2018. Deep-copy: Grounded response generation with hierarchi-cal pointer networks. ConvAI Workshop@NIPS.

Yury Zemlyanskiy and Fei Sha. 2018. Aiming to knowyou better perhaps makes me a more engaging dia-logue partner. CoNLL 2018, page 551.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018a.Personalizing dialogue agents: I have a dog, do youhave pets too? In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 2204–2213.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018b.Personalizing dialogue agents: I have a dog, do youhave pets too? In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 2204–2213. Association for Computational Linguistics.

Hao Zhou, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2018. Emotional chatting ma-chine: Emotional conversation generation with in-ternal and external memory. In Thirty-Second AAAIConference on Artificial Intelligence.

Xianda Zhou and William Yang Wang. 2018. Mojitalk:Generating emotional responses at scale. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 1128–1137.

http://aclweb.org/anthology/P18-1205

http://aclweb.org/anthology/P18-1205

Documents

MoEL: Mixture of Empathetic Listeners · 2020. 1. 23. · Wang and Wan,2018;Zhou and Wang,2018; Zhou et al.,2018). Both cases have succeeded in generating em-pathetic and emotional