A Knowledge Regularized Hierarchical Approach for Emotion

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 5614–5624,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

5614

A Knowledge Regularized Hierarchical Approachfor Emotion Cause Analysis

Chuang Fan1,5, Hongyu Yan1,5, Jiachen Du1,5, Lin Gui2, Lidong Bing3

Min Yang4, Ruifeng Xu1,5,6∗, Ruibin Mao7

1Harbin Institute of Technology (Shenzhen), China 2University of Warwick, UK3R&D Center Singapore, Machine Intelligence Technology, Alibaba DAMO Academy

4Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences5Joint Lab of Harbin Institute of Technology and RICOH

6Peng Cheng Laboratory, China 7Shenzhen Securities Information Co.,Ltd, China

Abstract

Emotion cause analysis, which aims to iden-tify the reasons behind emotions, is a key topicin sentiment analysis. A variety of neuralnetwork models have been proposed recently,however, these previous models mostly focuson the learning architecture with local tex-tual information, ignoring the discourse andprior knowledge, which play crucial roles inhuman text comprehension. In this paper,we propose a new method to extract emo-tion cause with a hierarchical neural modeland knowledge-based regularizations, whichaims to incorporate discourse context infor-mation and restrain the parameters by senti-ment lexicon and common knowledge. Theexperimental results demonstrate that our pro-posed method achieves the state-of-the-art per-formance on two public datasets in differentlanguages (Chinese and English), outperform-ing a number of competitive baselines by atleast 2.08% in F-measure.

1 Introduction

Sentiment analysis has gained increasing popular-ity in recent years due to many useful applications(Pang and Lee, 2007). The goal of sentiment anal-ysis is to classify the sentiment polarity of a giventext as positive, negative, neutral, or more fine-grained classes (Kim, 2014; Li et al., 2015; Yanget al., 2016; Zhou et al., 2016; Chen et al., 2017;Qian et al., 2017; Li et al., 2018b; Yu et al., 2018).Most of these researches have assumed that emo-tion expressions are already observed and try toidentify the emotion categories from text. How-ever, in practice, such as in product comments orpolitical reviews, we may care more about the rea-son why the customers or critics hold the emotionrather than a simple category label. Because theycan improve the quality of the products or services

∗∗ Corresponding Author: [email protected]

according to the emotion cause provided by users.Emotion cause analysis (ECA) aims to identify thereasons behind a certain emotion expression in anevent text, for example:Ex.1 When the children saw the gifts I preparedcarefully, (c−2)|they cheered happily and huggedme. (c−1)| I was full of happiness. (c0)Here, Ex.1 shows a document with three clausesmarked as (c−2), (c−1), and (c0). The goal ofECA is to determine which clause contains emo-tion cause (e.g., (c−1)) for an emotion word (e.g.,happiness in (c0)).

Previous approaches for emotion cause analysismostly depend on rule-based methods (Lee et al.,2010; Chen et al., 2010) and machine learningalgorithms (Ghazi et al., 2015; Gui et al., 2016;Xu et al., 2019). Most of them rely heavily oncomplicated linguistic rules or feature engineer-ing, which is time-consuming and labor-intensive.Recent studies have focused on solving the taskusing neural models (Gui et al., 2017; Li et al.,2018a; Chen et al., 2018; Li et al., 2019) with welldesigned attention mechanism based on local text.Despite the effectiveness of neural models, thereare some defects in previous studies. First, theyusually consider each clause individually, i.e., ig-noring the discourse context information that canimpact the semantic expression among differentclauses of a document. Second, prior knowledgesuch as sentiment lexicon and relative position in-formation that can provide crucial emotion causecues has not been fully exploited in neural models.

To alleviate these limitations, we propose a reg-ularized hierarchical neural network (RHNN) foremotion cause analysis, which combines the dis-course context information and knowledge-basedregularizations. Our model investigates the fol-lowing intuitions. Firstly, documents exhibit dis-course structure which may carry valuable infor-mation about the emotion cause cues. We em-

5615

ploy a hierarchical learning structure to capture themutual impacts of semantic expression among thediscourse context, to help produce better clauserepresentation. Secondly, in emotion events, emo-tion causes usually express a certain sentiment po-larity by some sentiment words. For example, inthe emotion cause of Ex.1, the sentiment wordscheered and happily express a positive polarity,also play a crucial role to provoke the emotionhappiness. Therefore, capturing these sentimentwords can enhance the causal connection betweenlearned features and predictions. We approach thisissue by designing a regularizer that incorporateslinguistic knowledge (e.g., sentiment lexicon) toenlarge the margin of attention weights of senti-ment words and non-sentiment words. Besides,it is often the case that humans usually write im-portant points in different sections. For emotionevents, emotion causes generally occur on posi-tions very close to the emotion word and occur fre-quently. Ex.1 shows an anecdotal example illus-trating this behavior that the emotion cause clausec−1 adjoins the emotion word happiness. To ben-efit from this phenomenon, we introduce a regu-larizer biased by relative position information tosupervise the representation learning of text andfurther to revise the predictive position distribu-tion of emotion causes relative to the emotionword (in brief, predictive distribution).

To sum up, our contribution includes:

• We propose a novel discourse-aware learn-ing structure with knowledge-based regular-izations for emotion cause analysis.

• We empirically evaluate the proposed modelon two public datasets in different languages(Chinese and English) and show statisticallysignificant improvements compared to thestate-of-the-art methods.

• To make the mechanism of our model clear,we also compare the performance of differentcombinations by ablation experiments. Ex-tensive analysis on both datasets confirms thefeasibility of incorporating discourse infor-mation and restraining the parameters by sen-timent lexicon and common knowledge.

2 Our Framework

In this section, we first give the task definition.Then, our proposed regularized hierarchical neural

Figure 1: The architecture of our model.

network (RHNN), as shown in Fig 1, will be de-scribed. The two auxiliary regularizers of RHNNwill be introduced in the next section.

2.1 Task Definition

The formal definition of emotion cause analysis isgiven in (Gui et al., 2016). Formally, for a docu-ment d = {c1, c2, . . . , cn} consisting of n clauses,it contains an emotion word e and at least an emo-tion cause clause corresponding to this emotionword. Each clause ci = {wi1, wi2, . . . , wik} con-sists of k words and is labeled with emotion cause-oriented labels ∈ {0, 1}. We regard ECA as a bi-nary classification task and aim to identify whichclause contains emotion cause.

2.2 Hierarchical Attention Network

Documents exhibit discourse structure which canserve as useful information for clause representa-tion generation. One simple but effective approachis to adopt a hierarchical attention network to sim-ulate this structure. Our hierarchical attention net-work consists of several parts: a word encoder, aword attention layer, a clause attention layer anda clause encoder. The details of each componentwill be described in the following paragraphs.

Word Encoder Gated Recurrent Unit (GRU)has been widely adopted for text processing (Choet al., 2014). In this work, we first map eachword into a low dimensional embedding space byWord2Vec (Mikolov et al., 2013) and then feed thewhole document into a GRU-based word encoderto extract word sequence features. To summarizeinformation from both directions, we use bidirec-

5616

tional GRU to exploit two parallel passes:−→hit =

−−−−→GRUw(xit), t ∈ [1, k] (1)

←−hit =

←−−−−GRUw(xit), t ∈ [k, 1] (2)

where xit is embedding vector for the word wit inclause ci at time step t and k is the length of clauseci. Then we concatenate hidden states of the twodirections hit = [

−→hit,←−hit] as the representation of

each word.

Word Attention We introduce an attentionmechanism to extract such words that are impor-tant to the meaning of the clause and aggregate therepresentation of these informative words to con-struct the clause vector. Specifically,

git = wTm(hit ⊕ eE) (3)

αit =exp(git)∑t′ exp(git′)

(4)

oci =∑t

αithit (5)

where wm is the parameter for computing atten-tion signals and ⊕ is concatenate operation. Theembedding of emotion word e is denoted by eE .αit is the emotion-specific attention signal show-ing the importance of wordwit. oci is the weightedsum of word representation based on weights.

Clause Attention Intuitively, different clausesof a document are different informative and shouldbe labeled with different importance. Targetingthis problem, we design a clause attention mech-anism to indicate the importance of each clause.See it differently, the attention signals can be re-garded as some ”prior” information to bias theclause encoder toward some content that is moreimportant to extracting the emotion cause.

As for details, we adopt a one-layer MLP to getthe attention signals. Apparently, position infor-mation plays an important role in capturing the rel-ative distance of the clause to emotion word. Thus,we concatenate the clause vector oci and its posi-tion embedding as the feature to obtain the atten-tion signal, this yields:

αi = sigmoid(wv(oci ⊕ li)) (6)

o′ci = oci · αi (7)

where wv is a parameter vector, li is the ran-domly initialized position embedding and keepsunchanged in the training stage, αi is the weightof clause ci. Then the clauses with different im-portance (e.g., o′ci) are fed into a clause encoder.

Clause Encoder Just as the meaning of a wordis determined by its context, the semantic expres-sion of a clause is usually impacted by its dis-course context. Based on this observation, we in-troduce a clause encoder to model the latent se-mantic relations among different clauses. Analo-gously, we also append the relative position infor-mation to enhance the relations between the clauseand its position. Formally,

−→hi =

−−−→GRUc(o

′ci ⊕ li), i ∈ [1, n] (8)

←−hi =

←−−−GRUc(o

′ci ⊕ li), i ∈ [n, 1] (9)

where n is the number of clauses in a document.Also, the two directional hidden state

−→hi and

←−hi

are concatenated as the final emotion-specific rep-resentation oi = [

−→hi ,←−hi ].

2.3 Model TrainingThe emotion-specific representation oi with its po-sition embedding li as the final feature for emotioncause prediction and the model is trained by mini-mizing the cross entropy:

yi = softmax(Wm(oi ⊕ li)) (10)

Lce = −∑i

yilogyi (11)

where Wm is a parameter matrix, yi and yi are tar-get class distribution and predictive class distribu-tion respectively.

3 Regularizers Based on SentimentLexicon and Relative Position

One crucial emotion cause cue is sentiment wordssince emotion causes usually express a certain sen-timent polarity by sentiment words. However,there is no effective mechanism to guarantee thatthe above module indeed attends the words withsentiment polarity. Beyond this, a straightforwardsolution to inject position information is to di-rectly incorporate relative word position embed-ding, which is a weak representation of the rela-tive distance between the emotion word and eachclause. To ease these problems, we introduce twoauxiliary regularizers:

• Sentiment Regularizer (SR.): If a clausecontains several words which exist in a senti-ment lexicon, the calculated average weightsof these words should be properly larger thanother words. We approach this issue with anauxiliary hinge loss function.

5617

• Position Regularizer (PR.): Relative posi-tion is a critical emotion cause indicator: ingeneral the closer a clause is to the emotionword, the higher emotion cause probabilityit should be assigned. We approach this is-sue by introducing a proxy distribution andan auxiliary cross entropy function.

Formally, we disassemble the joint loss of emo-tion cause detection into an original cross entropyloss, a sentiment regularization loss, and a positionregularization loss. The new training objective isrevised as:

J (θ) = Lce + λ1 ∗ Lsr + λ2 ∗ Lpr (12)

where Lce is the original cross entropy loss (§2.3).λ1 and λ2 are hyper-parameters. Lsr and Lpr aretwo auxiliary regularization losses (§3.1 and §3.2).θ is the parameter set.

3.1 Sentiment RegularizerThe weak causal connection between learned fea-tures and predictions is a major issue in emotioncause analysis. Even though sentiment words arekey to clause representation generation, most ex-isting models do not focus on sentiment wordsor place less emphasis on them when producingclause representation. In other words, attention isdistracted by irrelevant words with less sentimentpolarity. To address this issue and sufficiently ben-efit from linguistic resources, we explicitly en-courage the larger margin of attention weights be-tween sentiment words and non-sentiment wordsusing a sentiment lexicon. For sentiment words,the average attention weight is calculated by:

Avgs =1

ls

k∑t=1

Swit (13)

Swit =

{αit if wit ∈ ci ∩ S0 otherwise

(14)

where wit is the t-th word of clause ci, k is thelength of ci and ls is the number of sentimentwords in ci. αit is the calculated attention weightin §2.2 and S is a sentiment lexicon. Correspond-ingly, for non-sentiment words:

Avgns =1

lns

k∑t=1

NSwit (15)

NSwit =

{αit if wit ∈ ci − S0 otherwise

(16)

Our training objective is to lead the model topay more attention to sentiment words. Thus, theregularization term is formally expressed as:

Lsr =∑i

max(0,m− (Avgs −Avgns)) (17)

where m is a hyper-parameter for margin.

3.2 Position RegularizerEmpirically, emotion causes usually occur at thepositions which are very close to the emotionword. However, another main issue is that the pre-dictive distribution may locates the clauses that aredistant from and irrelevant to the emotion word.The goal of this regularizer is to narrow the dif-ference between the predictive distribution and thetrue position distribution of emotion causes rel-ative to the emotion word (in brief, true distribu-tion). Obviously we can not obtain the true distri-bution. Hence, we assume that it should satisfy thefollowing conditions: (1) It should be a normedfunction within [0, 1]; (2) It should be a symmetricfunction of a certain value. Based on these condi-tions, we employ a function defined as follows:

qi =

{1− |ri|n if − b ≤ ri ≤ b

0 otherwise(18)

where ri is the relative distance of clause ci toemotion word, n is the number of clauses in a doc-ument, and b is the left and right boundary whichlimits the scope of emotion cause. Then we applyq = (q1, q2, . . . , qn) as the proxy distribution tosimulate the true distribution.

Simultaneously, from section 2.3, we get thepredictive class distribution yi of clause ci. Then,the probability for the emotion cause at position ican be calculated as:

pi = yi(label = 1) (19)

Similarity, we can obtain the predictive distribu-tion of emotion causes relative to the emotionword by:

p = (p1, p2, . . . , pn) (20)

The goal is to enforce the model to constrain thedifference between the p and q, thus, we use crossentropy to measure the difference:

Lpr = −∑i

qilog(pi) (21)

Note that the two introduced regularizers worklike L1 and L2 terms, which do not introduce

5618

Item Number Item NumberChinese Dataset (Chi.)Documents 2105 Cause 1 2046Clauses 11799 Cause 2 56Causes 2167 Cause 3 3English Dataset (Eng.)Documents 2156 Cause 1 1949Clauses 16259 Cause 2 164Causes 2421 Cause 3 32

Table 1: Details of the two datasets. Cause 1, Cause 2and Cause 3 represent the documents with 1, 2 and 3cause clauses, respectively.

any new parameters and only influence the train-ing of the standard model parameters. The hyper-parameters λ1 and λ2 guide the model to achievethe best trade-off among three types of losses.

4 Datasets and Implementation Details

4.1 DatasetsWe select two public datasets from different lan-guages to evaluate the proposed model: ChineseDataset (Gui et al., 2016) collected from SINAcity news1 and English Dataset (Gao et al., 2017)collected from English novels. Each document ofboth datasets has only one emotion word and oneor more emotion causes. It has been ensured thatthe emotion and the causes are relevant. The docu-ments are segmented into several clauses manuallyfor emotion cause analysis. The details about thetwo datasets are summarized in Table 1.

4.2 Implementation DetailsFor the Chinese dataset, there is no training/testsplit, we randomly divide the documents into atraining/development/test set in a ratio of 8:1:1and partition the clauses by Jieba2. For the Englishdataset, we randomly select 10% from the originaltraining set as the development set and lowercase,lemmatize all the tokens by NLTK3. We evaluateour method 25 times with different splits and thenperform one sample t-test on the experimental re-sults by following (Gui et al., 2017). The precision(P ), recall (R) and F-measure (F ) are employed tomeasure the performance in this task.

The sentiment lexicon adopted for the Chinesedataset consists of two parts. The first part is se-

1http://news.sina.com.cn/society/2https://github.com/fxsjy/jieba3http://nltk.org

lected from HowNet (Dong et al., 2006) sentimentanalysis lexicon set4 and the second part comesfrom NTUSD (Ku et al., 2006). The combinationof the two parts serves as the Chinese sentimentlexicon of this research. The English sentimentlexicon comes from MPQA (Wilson et al., 2005)and we only select the words with high sentimentpolarity, because they are less sensitive to contex-tual information and usually express consistencesentiment polarities from their prior polarity. Forboth sentiment lexica, we filter out the words thatare not in the datasets. Ultimately, 2022 and 1348sentiment words are selected for the Chinese andEnglish dataset respectively.

Online learning is performed with the Adam op-timizer (Kingma and Ba, 2015) and initial learn-ing rate 0.001 is adopted. The number of layers inBi-GRU is set to 2 and dropout rate 0.5 is used toavoid overfitting. The word vectors are pre-trainedby word2vec (Mikolov et al., 2013) and keep un-changed during training stage. We perform gridsearch over the hyper-parameters m ({0.10, 0.15,0.20}), the boundary b ({2, 3, 4}), the dimension-ality of the Bi-GRU ({64, 128}), λ1 and λ2 (both{0.25, 0.5, 0.75 }). For each corpus, the highestF-measure combination of these hyper-parametersis selected using development set.

5 Experiments

In this section, we will compare our RHNN modelwith the following groups of methods:

• Rule-based and commonsense-based meth-ods: Rule-based method (RB) is a traditionalrule-based method proposed by Lee et al.(2010). Commonsense-based methods (CB)is a knowledge-based method proposed byRusso et al. (2011). It uses Chinese EmotionCognition Lexicon (Xu et al., 2013) as com-monsense knowledge.

• Machine learning method: SVM is aSVM classifier trained on unigrams, bi-grams and trigrams features (Li and Xu,2014). Word2vec is a SVM classifiertrained on word representations pre-trainedby Word2vec (Mikolov et al., 2013). Multi-kernel represents a document by a syntac-tic structure and utilizes a modified convolu-tion kernel method to determine which clausecontains the emotion cause (Gui et al., 2016).

4http://www.keenage.com/html/c index.html

5619

Method P R FRB∗ 0.6747 0.4287 0.5243CB∗ 0.2672 0.7130 0.3887SVM∗ 0.4200 0.4375 0.4285Word2vec∗ 0.4301 0.4233 0.4136Multi-kernel∗ 0.6588 0.6972 0.6752LambdaMART∗ 0.7720 0.7499 0.7608CNN∗ 0.6472 0.5493 0.5915ConvMS-Memnet∗ 0.7076 0.6838 0.6955CANN 0.7721 0.6891 0.7266HCS 0.7388 0.7154 0.7269MANN 0.7843 0.7587 0.7706RHNN 0.8112 0.7725 0.7914

Table 2: Experimental results on the Chinese dataset.Superscript ∗ indicates the results are reported in (Guiet al., 2017) and the rest are reprinted from the corre-sponding publications (p <0.001).

LambdaMART extracts emotion causes us-ing learning to rank methods which basedon the emotion-independent and emotion-dependent features (Xu et al., 2019).

• Deep learning method: CNN is a convolu-tional neural network for sentence classifica-tion (Kim, 2014). ConvMS-Memnet consid-ers emotion cause analysis as a reading com-prehension task and designs a multiple-slotdeep memory network to model context in-formation (Gui et al., 2017). CANN uses aco-attention neural network to identify emo-tion causes (Li et al., 2018a). HCS is pro-posed by Yu et al. (2019) using a multiple-level hierarchical network to detect the emo-tion causes. MANN is the current state-of-the-art method employing a multi-attention-based model for emotion cause extraction (Liet al., 2019). RHNN is our proposed model.

5.1 Main Results

The experimental results on both datasets areshown in Table 2 and Table 3, respectively. RByields high precision but with low recall. CB hasan opposite scenario from RB. A possible reasonis that these linguistic-based methods depend onsome cue words to identify the emotion cause, dif-ferent rules or common sense may contain differ-ent cue words.

For the machine learning methods, SVM andWord2vec have similar performance on the Chi-

Method P R FWord2vec 0.1651 0.8673 0.2774SVM 0.2757 0.6416 0.3856CNN 0.7218 0.2628 0.3390ConvMS-Memnet 0.4605 0.4177 0.4381MANN 0.7933 0.4081 0.5328RHNN 0.6901 0.5267 0.5975

Table 3: Experimental results on the English dataset,we follow the results that are implemented in (Liet al., 2019), the only available results on this dataset(p <0.001).

nese dataset, but SVM outperforms Word2vec onthe English dataset. The main reason is that thepolysemantic phenomenon is more obvious in En-glish expressions. Multi-kernel has better perfor-mance by capturing context information through asyntactic tree. LambdaMART, which is based onranking strategy and global emotion features, per-forms best among feature-based methods. How-ever, both Multi-kernel and LambdaMART rely onexpensive human-based features and lack of ex-pandability on different dataset.

Compared with CNN, ConvMS-Memnet mod-els the context of each word and obtains better per-formance on both datasets. The co-attention basedCANN captures the mutual relations between theemotion clause and each candidate clause, whichhas a comparable result with hierarchical-basedHCS. MANN considers the interaction betweenthe emotion clause and candidate clauses by de-signing a multi-attention mechanism and achievesthe best performance among baselines.

The proposed RHNN model further improvesthe performance on both datasets as shown inthe tables. The improvement is significant withp-value less than 0.001 in one sample t-test.Specifically, RHNN manages to boost the perfor-mance by 3.06% in F-measure compared to Lamb-daMART, which exhibits that by restraining theparameters with knowledge-based regularizations,RHNN is better to identify the emotion cause cuesthan feature engineering. RHNN also outperformsthe current best-performing method MANN by2.08% on the Chinese dataset and 6.47% on theEnglish dataset in F-measure respectively. Fur-thermore, for the English dataset, our proposedmodel has balance performance in precision andrecall. The reason for this phenomenon is thatRHNN can capture more emotion cue (e.g., sen-

5620

MethodApplying with Chi. Eng.H. SR. PR. F F

Base × × × 0.7152 0.5199H X × × 0.7553 0.5668

SR × X × 0.7570 0.5765PR × × X 0.7483 0.5683

HSR X X × 0.7860 0.6154HPR X × X 0.7659 0.5777SPR × X X 0.7600 0.5878

RHNN X X X 0.7914 0.5975

Table 4: Effect of different components, i.e., hierar-chical structure (H.), sentiment regularizer (SR.) andposition regularizer (PR.). The leftmost column is theabbreviation for corresponding sub-models (e.g., SPRdenotes the sub-model with SP. and PR. except H.).

MethodChi. Eng.

Sub. All. Sub. All.SR 0.7650 0.7570 0.6713 0.5765

HSR 0.7934 0.7860 0.7084 0.6154SPR 0.7741 0.7600 0.6904 0.5878

RHNN 0.7997 0.7914 0.6849 0.5975

Table 5: The F-measure on sub-dataset (Sub.) that onlyselects the clauses which contain sentiment words, andall-dataset (All.) that experiments on the whole dataset.

timent words) information to optimize the modelextracting emotion causes more exactly.

5.2 Detailed Analysis

Ablations of RHNN Model The proposedRHNN model consists of three components, in-cluding hierarchical structure (H.), sentiment reg-ularizer (SR.) and position regularizer (PR.). Weconduct ablation experiments to reveal the effectof each component. As illustrated in Table 4, allmodels with the proposed component consistentlyimprove upon the Base model, verifying the ef-fectiveness of the proposed approach. Comparewith H. and PR. model, the SR. improve the per-formance most. The main reason is that there are55.24% and 66.49% of emotion causes which con-tain sentiment words on the Chinese dataset andEnglish dataset respectively, enforcing the modelto pay more attention to sentiment words can en-hance the causal connection between learned fea-tures and predictions.

On the Chinese dataset, the RHNN achieves thebest performance with a 7.62% improvement on

Figure 2: The F-measure of different limited scopes ofemotion cause on two datasets.

the F score compared with the baseline. How-ever, on the English dataset, the HSR model per-forms better than the RHNN model. It may becaused by the overlapping between components.Besides, the performance on the English datasetalways lower about 20% in F-measure than thaton the Chinese dataset, one possible explanationfor this phenomenon is that there are more clausestructures in English expressions which is difficultfor the model to capture this information withoutdiscourse tree.

Effect of Sentiment Regularizer To gain moreinsights into our proposed model, we conduct fur-ther experiments to examine the effectiveness ofsentiment regularizer on the subset of two datasets.The experiment results in Table 5 show that: 1)RHNN and HSR model achieve the best perfor-mance on the subset of two datasets respectively,similar observations can be found regarding onwhole dataset; 2) With sentiment regularizer, theperformance is boosted on both subsets comparedto that on the whole dataset. This is consistent withour intuition because sentiment regularizer con-tributes much to pick up the words with sentimentpolarity and these words are important causal indi-cators in clauses. Meanwhile, each clause containssentiment words in the subsets, resulting in a betterperformance on the subsets. 3) The performanceimprovement on the English subset is remarkablyhigher than that on the Chinese subset, one possi-bility is that there are more explicit emotion termsin English expression than in Chinese.

Effect of Position Regularizer From Eq.(18),we see that the value of b limits the scope of emo-tion cause. In this section, we further investigatethe effect of different limited scopes. For sim-plicity and efficiency, here we only apply the PR

5621

Figure 3: Comparison of attention distribution with orwithout sentiment regularizer.

model on this experiment. The results are shownin Fig 2, the percentages denote the coverage ofemotion cause in datasets. For instance, 72.77%represents that there are 72.77% emotion causesadjoining the emotion word in English dataset.From Fig 2, we can see that the performancetrends for the two datasets are similar, and theperformance improves with the expanding limitedscope of emotion cause. However, when the lim-ited scope of emotion cause is larger than a certainvalue (2 or 3), the performance decreases. Thismay be due to the reason that the larger of lim-ited scope, the higher coverage of emotion causein datasets. Nevertheless, when the limited scopeis too large, the model is forced to allocate higherprobability to these clauses which are distant fromand irrelevant to the emotion word, and then leadsto the performance degradation.

5.3 Case StudyEssentially, sentiment regularizer (SR.) aims toenlarge the margin between sentiment words andnon-sentiment words. The question is, will themodel focus on the words with sentiment polarity?We randomly choose one example (Ex.2) from theChinese dataset to visualize its attention distribu-tion and compare the difference between plusingsentiment regularizer or not.Ex.2 一个老乡来到店里问起小美的情况,(c−2)|大熊抱怨小美不好. (c−1)|她听到后感到既伤心又生气. (c0)Ex.2 A fellow came to store asking XiaoMei’s sit-uation, (c−2)|DaXiong complained XiaoMei notgood. (c−1)| She heard it then felt both sad andangry. (c0)

In this example, the cause of emotion word sadis in (c0). The visualization results are shown inFig 3, we can observe that the model without sen-timent regularizer most focus on non-sentimentwords such as XiaoMei, she and heard it whichare inessential to provoke the emotion. However,when we plus the SR. into model, we can seean obvious weights shift on attention distribution.

Emotion cause events RHNN results1 I was immediately*ashamed* of myself formy vanity, for having as-sumed that he wanted meto stay with him forever.I’m sorry, that was a littlearrogant

for having as-sumed that hewanted me tostay with himforever

2 He hopes of being*admitted* to a sightof the young ladies, ofwhose beauty he hadheard much. But he sawonly the father.

of whose beautyhe had heardmuch

3 I didn’t know wherein the hell you was, saidEnnis, four years, I about*give up* on you.

Null

Table 6: The error instances of RHNN model for emo-tion cause analysis.

More clearly, the model captures the sentimentwords complained and not good which are crucialto identify the emotion cause. This shows that ourmodel with sentiment regularizer is more effectivein extracting the most important keywords relat-ing to the emotion cause. Also, better results areobtained using sentiment regularizer, this is con-sistent with what we observed in Table 4.

Finally, we perform error analysis to understandwhat types of errors are introduced with the pro-posed model, focusing on three cases from the En-glish dataset. The results are listed in Table 6,where the first column depicts the content of theemotion cause events and the second column de-picts the emotion causes identified by RHNN. Asshown in Table 6, the emotion causes appear inbold and emotion word is labeled between *.

From Table 6, we can find that there are twoclauses contain the emotion cause in event 1.However, our model only detects one emotioncause clause. In event 2, our model has an errorprediction. One possible reason is that our modelis prone to treating clauses which contain senti-ment words as emotion cause. RHNN extractsnothing from event 3, it may be due to the rea-son that the far distance between the emotion wordand emotion cause clause, resulting in a difficultunderstanding of causal relations. Our proposedmodel is capable of getting rich emotion cause

5622

cues with knowledge-based regularizations. How-ever, it also introduces some noisy into emotioncause analysis.

6 Related Work

Emotion classification is an important fundamen-tal aspect of sentiment analysis. Going one stepfurther, emotion cause analysis (ECA) which aimsto discover the reason behind emotions, can beconstructive to guide the direction of future work,i.e., improving the quality of products or servicesaccording to the emotion causes of comments pro-vided by users. In this section, we describe therelated work on emotion cause analysis.

Lee et al. (2010) first gave the formal defini-tion of emotion cause analysis and manually con-structed a dataset from the Academia Sinica Bal-anced Chinese Corpus. Based on this corpus,Chen et al. (2010) designed two sets of featuresbuilt on six groups of linguistic cues to detectemotion cause. Support vector machines (SVMs)and conditional random fields (CRFs) were in-vestigated to detect cause or non-cause text withextended rule-based features in existing methods(Gui et al., 2014; Ghazi et al., 2015). Other thanrule-based methods, Russo et al. (2011) proposeda crowdsourcing method to construct a common-sense knowledge base for emotion cause extrac-tion in Italian newspaper articles. But it is chal-lenging to extend the common-sense knowledgebase automatically. Recently, Gui et al. (2016)proposed a multi-kernel based method to iden-tify the emotion cause from a manually annotatedemotion cause corpus. Xu et al. (2019) proposeda method based on learning to re-rank candidateemotion cause clauses with extracting a number ofemotion-dependent and emotion-independent fea-tures. However, these methods are heavily depen-dent on the expensive human-based features andare too difficult in a real-world application.

Inspired by the success of neural network meth-ods, deep neural models and attention mechanismshave been widely used in emotion cause analy-sis. Gui et al. (2017) proposed a novel deep neu-ral network which regarded emotion cause anal-ysis as a question-answering task. In this study,a convolution-based memory network was intro-duced to store the context information. Li et al.(2018a) considered the context around the emo-tion word as a query instead of only emotionword to model the mutual impacts between each

candidate clause and the emotion clause. Chenget al. (2017) constructed a corpus based on Chi-nese microblog and proposed to detect emotioncause using multiple-user structures. Besides, Yuet al. (2019) proposed a multiple-level hierarchi-cal network-based clause selection strategy. Liet al. (2019) proposed a multi-attention-based neu-ral model to capture the mutual influences betweenthe emotion clause and each candidate clause, andthen generate the representations for the above twoclauses separately. This method achieves the cur-rent best performance. However, the existing ap-proaches usually focus on the local textural infor-mation, ignoring the discourse structure (Zubiagaet al., 2018), and prior knowledge such as senti-ment lexicon (Qian et al., 2017) and relative po-sition information, which can provide importantemotion cues for emotion cause analysis task.

7 Conclusion and Future Work

In this paper, we provide a regularized hierarchicalneural network (RHNN) for emotion cause anal-ysis. The proposed model aggregates discoursecontext information through a hierarchical learn-ing structure and restrains the parameters withknowledge-based regularizations. We evaluate theproposed model on two public datasets in differ-ent languages. The experimental results demon-strate that our proposed method achieves the state-of-the-art performance on both datasets and exten-sive analysis confirms the feasibility of incorpo-rating the discourse context and knowledge-basedregularizations.

To preserve the simplicity of the proposedmodel, we do not consider document as a treestructure. In the future, we will exploit how toincorporate discourse parse tree or discourse re-lations into emotion cause analysis task to furtherimprove the performance.

8 Acknowledgements

This work was supported by National Natu-ral Science Foundation of China U1636103,61632011, 61876053, Shenzhen FoundationalResearch Funding JCYJ20180507183527919,JCYJ20180507183608379, Key TechnologiesResearch and Development Program of ShenzhenJSGG20170817140856618, Joint Research Pro-gram of Shenzhen Securities Information Co.,Ltd. No. JRPSSIC2018001, and the EU-H2020(grant no. 794196).

5623

References

Peng Chen, Zhongqian Sun, Lidong Bing, and WeiYang. 2017. Recurrent attention network on mem-ory for aspect sentiment analysis. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2017, Copen-hagen, Denmark, September 9-11, 2017, pages 452–461.

Ying Chen, Wenjun Hou, Xiyao Cheng, and ShoushanLi. 2018. Joint learning for emotion classificationand emotion cause detection. In Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing, Brussels, Belgium, October31 - November 4, 2018, pages 646–651.

Ying Chen, Sophia Yat Mei Lee, Shoushan Li, andChu-Ren Huang. 2010. Emotion cause detectionwith linguistic constructions. In COLING 2010,23rd International Conference on ComputationalLinguistics, Proceedings of the Conference, 23-27August 2010, Beijing, China, pages 179–187.

Xiyao Cheng, Ying Chen, Bixiao Cheng, ShoushanLi, and Guodong Zhou. 2017. An emotion causecorpus for chinese microblogs with multiple-userstructures. ACM Transactions on Asian and Low-Resource Language Information Processing (TAL-LIP), 17(1):6.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder-decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2014, October25-29, 2014, Doha, Qatar, A meeting of SIGDAT,a Special Interest Group of the ACL, pages 1724–1734.

Zhendong Dong, Qiang Dong, and Changling Hao.2006. Hownet and the computation of meaning.

Qinghong Gao, H Jiannan, X Ruifeng, G Lin, Y He,KF Wong, and Q Lu. 2017. Overview of ntcir-13 ecatask. In Proceedings of the NTCIR-13 Conference.

Diman Ghazi, Diana Inkpen, and Stan Szpakowicz.2015. Detecting emotion stimuli in emotion-bearingsentences. In Computational Linguistics and In-telligent Text Processing - 16th International Con-ference, CICLing 2015, Cairo, Egypt, April 14-20,2015, Proceedings, Part II, pages 152–165.

Lin Gui, Jiannan Hu, Yulan He, Ruifeng Xu, Qin Lu,and Jiachen Du. 2017. A question answering ap-proach for emotion cause extraction. In Proceed-ings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2017,Copenhagen, Denmark, September 9-11, 2017,pages 1593–1602.

Lin Gui, Dongyin Wu, Ruifeng Xu, Qin Lu, andYu Zhou. 2016. Event-driven emotion cause extrac-tion with corpus construction. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2016, Austin, Texas,USA, November 1-4, 2016, pages 1639–1649.

Lin Gui, Li Yuan, Ruifeng Xu, Bin Liu, Qin Lu, andYu Zhou. 2014. Emotion cause detection with lin-guistic construction in chinese weibo text. In Nat-ural Language Processing and Chinese Computing,pages 457–464. Springer.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1746–1751.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. 2006.Opinion extraction, summarization and tracking innews and blog corpora. In Computational Ap-proaches to Analyzing Weblogs, Papers from the2006 AAAI Spring Symposium, Technical ReportSS-06-03, Stanford, California, USA, March 27-29,2006, pages 100–107.

Sophia Yat Mei Lee, Ying Chen, and Chu-Ren Huang.2010. A text-driven rule-based system for emotioncause detection. In Proceedings of the NAACL HLT2010 Workshop on Computational Approaches toAnalysis and Generation of Emotion in Text, pages45–53. Association for Computational Linguistics.

Shoushan Li, Lei Huang, Rong Wang, and GuodongZhou. 2015. Sentence-level emotion classificationwith label and context dependence. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing of the Asian Federation of Natural LanguageProcessing, ACL 2015, July 26-31, 2015, Beijing,China, Volume 1: Long Papers, pages 1045–1053.

Weiyuan Li and Hua Xu. 2014. Text-based emotionclassification using emotion cause extraction. Ex-pert Syst. Appl., 41(4):1742–1749.

Xiangju Li, Shi Feng, Daling Wang, and Yifei Zhang.2019. Context-aware emotion cause analysis withmulti-attention-based neural network. Knowl.-Based Syst., 174:205–218.

Xiangju Li, Kaisong Song, Shi Feng, Daling Wang, andYifei Zhang. 2018a. A co-attention neural networkmodel for emotion cause analysis with emotionalcontext awareness. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural Language

https://aclanthology.info/papers/D17-1047/d17-1047




http://aclweb.org/anthology/C10-1021

http://aclweb.org/anthology/C10-1021

http://aclweb.org/anthology/D/D14/D14-1179.pdf



https://doi.org/10.1007/978-3-319-18117-2_12

https://doi.org/10.1007/978-3-319-18117-2_12







http://arxiv.org/abs/1412.6980

http://arxiv.org/abs/1412.6980

http://www.aaai.org/Library/Symposia/Spring/2006/ss06-03-020.php

http://www.aaai.org/Library/Symposia/Spring/2006/ss06-03-020.php

https://www.aclweb.org/anthology/P15-1101/

https://www.aclweb.org/anthology/P15-1101/

https://doi.org/10.1016/j.eswa.2013.08.073

https://doi.org/10.1016/j.eswa.2013.08.073

https://doi.org/10.1016/j.knosys.2019.03.008

https://doi.org/10.1016/j.knosys.2019.03.008




5624

Processing, Brussels, Belgium, October 31 - Novem-ber 4, 2018, pages 4752–4757.

Xin Li, Lidong Bing, Wai Lam, and Bei Shi. 2018b.Transformation networks for target-oriented senti-ment classification. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2018, Melbourne, Australia, July15-20, 2018, Volume 1: Long Papers, pages 946–956.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing System, pages 3111–3119.

Bo Pang and Lillian Lee. 2007. Opinion mining andsentiment analysis. Foundations and Trends in In-formation Retrieval, 2(1-2):1–135.

Qiao Qian, Minlie Huang, Jinhao Lei, and XiaoyanZhu. 2017. Linguistically regularized LSTM forsentiment classification. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics, ACL 2017, Vancouver, Canada,July 30 - August 4, Volume 1: Long Papers, pages1679–1689.

Irene Russo, Tommaso Caselli, Francesco Rubino, Es-ter Boldrini, and Patricio Martınez-Barco. 2011.Emocause: An easy-adaptable approach to extractemotion cause contexts. In Proceedings of the 2ndWorkshop on Computational Approaches to Subjec-tivity and Sentiment Analysis, pages 153–160.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In HLT/EMNLP 2005, Hu-man Language Technology Conference and Confer-ence on Empirical Methods in Natural LanguageProcessing, Proceedings of the Conference, 6-8 Oc-tober 2005, Vancouver, British Columbia, Canada,pages 347–354.

Bo Xu, Hongfei Lin, Yuan Lin, Yufeng Diao, LiangYang, and Kan Xu. 2019. Extracting emotion causesusing learning to rank methods from an informationretrieval perspective. IEEE Access, 7:15573–15583.

Ruifeng Xu, Chengtian Zou, Yanzhen Zheng, Xu Jun,Lin Gui, Bin Liu, and Xiaolong Wang. 2013. A newemotion dictionary based on the distinguish of emo-tion expression and emotion cognition. Journal ofChinese Information Processing, 27(6):82–90.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In NAACL HLT 2016, The 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, San Diego California, USA, June 12-17, 2016, pages 1480–1489.

Jianfei Yu, Luıs Marujo, Jing Jiang, Pradeep Karu-turi, and William Brendel. 2018. Improving multi-label emotion classification via sentiment classifica-tion with dual attention transfer network. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, Brussels, Bel-gium, October 31 - November 4, 2018, pages 1097–1102.

Xinyi Yu, Wenge Rong, Zhuo Zhang, Yuanxin Ouyang,and Zhang Xiong. 2019. Multiple level hierarchicalnetwork-based clause selection for emotion causeextraction. IEEE Access, 7:9071–9079.

Deyu Zhou, Xuan Zhang, Yin Zhou, Quan Zhao, andXin Geng. 2016. Emotion distribution learning fromtexts. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP 2016, Austin, Texas, USA, November1-4, 2016, pages 638–647.

Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, RobProcter, Michal Lukasik, Kalina Bontcheva, TrevorCohn, and Isabelle Augenstein. 2018. Discourse-aware rumour stance classification in social mediausing sequential classifiers. Inf. Process. Manage.,54(2):273–290.

https://doi.org/10.18653/v1/P18-1087

https://doi.org/10.18653/v1/P18-1087

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality



https://doi.org/10.1561/1500000011

https://doi.org/10.1561/1500000011

https://doi.org/10.18653/v1/P17-1154

https://doi.org/10.18653/v1/P17-1154

https://aclanthology.info/papers/W11-1720/w11-1720

https://aclanthology.info/papers/W11-1720/w11-1720

http://aclweb.org/anthology/H/H05/H05-1044.pdf

http://aclweb.org/anthology/H/H05/H05-1044.pdf

https://doi.org/10.1109/ACCESS.2019.2894701



http://aclweb.org/anthology/N/N16/N16-1174.pdf











https://doi.org/10.1016/j.ipm.2017.11.009