Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Multimodal Sentiment Analysis with Word-Level Fusion andReinforcement Learning
Minghai Chen∗
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Sen Wang∗
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Paul Pu Liang∗
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Tadas Baltrušaitis
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Amir Zadeh
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Louis-Philippe Morency
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA
ABSTRACTWith the increasing popularity of video sharing websites such as
YouTube and Facebook, multimodal sentiment analysis has received
increasing attention from the scientific community. Contrary to
previous works in multimodal sentiment analysis which focus on
holistic information in speech segments such as bag of words rep-
resentations and average facial expression intensity, we develop a
novel deep architecture for multimodal sentiment analysis that per-
forms modality fusion at the word level. In this paper, we propose
the Gated Multimodal Embedding LSTM with Temporal Attention
(GME-LSTM(A)) model that is composed of 2 modules. The Gated
Multimodal Embedding alleviates the difficulties of fusion when
there are noisy modalities. The LSTM with Temporal Attention per-
forms word level fusion at a finer fusion resolution between input
modalities and attends to the most important time steps. As a result,
the GME-LSTM(A) is able to better model the multimodal structure
of speech through time and perform better sentiment comprehen-
sion. We demonstrate the effectiveness of this approach on the
publicly-available Multimodal Corpus of Sentiment Intensity and
Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-
the-art sentiment classification and regression results. Qualitative
analysis on our model emphasizes the importance of the Tempo-
ral Attention Layer in sentiment prediction because the additional
acoustic and visual modalities are noisy. We also demonstrate the
effectiveness of the Gated Multimodal Embedding in selectively fil-
tering these noisy modalities out. Our results and analysis open new
areas in the study of sentiment analysis in human communication
and provide new models for multimodal fusion.
∗Equal contribution.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
ICMI’17, November 13–17, 2017, Glasgow, UK© 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa-
tion for Computing Machinery.
ACM ISBN 978-1-4503-5543-8/17/11. . . $15.00
https://doi.org/10.1145/3136755.3136801
CCS CONCEPTS• Computing methodologies→ Artificial Intelligence; Com-puter Vision; Natural Language Processing; Machine Learn-ing; • Information systems→ Sentiment Analysis;
KEYWORDSMultimodal Sentiment Analysis, Multimodal Fusion, Human Com-
munication, Deep Learning, Reinforcement Learning
ACM Reference Format:Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh,
and Louis-Philippe Morency. 2017. Multimodal Sentiment Analysis with
Word-Level Fusion and Reinforcement Learning. In Proceedings of 19th ACMInternational Conference on Multimodal Interaction (ICMI’17). ACM, New
York, NY, USA, 9 pages. https://doi.org/10.1145/3136755.3136801
1 INTRODUCTIONMultimodal sentiment analysis is an emerging field at the intersec-
tion of natural language processing, computer vision, and speech
processing. Sentiment analysis aims to find the attitude of a speaker
or writer towards a document, topic or an event [17]. Sentiment
can be expressed by the spoken words, the emotional tone of the
delivery and the accompanying facial expressions. As a result, it
is helpful to combine visual, language, and acoustic modalities for
sentiment prediction [20]. To combine cues from different modali-
ties, previous work mainly focused on holistic video-level feature
fusion. This was done with simple features (such as bag-of-words
and average smile intensity) calculated over an entire video as rep-
resentations of verbal, visual and acoustic features [36, 46]. These
simplistic fusion approaches mostly ignore the structure of speech
by focusing on simple statistics from videos and combining modali-
ties at an abstract level.
The cornerstone of our approach is capturing the full structure
of speech using a time-dependent recurrent approach that is able
to properly perform modality fusion at every timestep. This un-
derstanding of speech structure is important due to two major
reasons: 1) There are local interactions between modalities, such
as how loud a word is being uttered which has roots in language
and acoustic modalities or whether or not a word was accompanied
by a smile which has roots in language and vision. Considering
arX
iv:1
802.
0092
4v1
[cs
.LG
] 3
Feb
201
8
ICMI’17, November 13–17, 2017, Glasgow, UK Chen, Wang, Liang, Baltrušaitis, Zadeh and Morency
local interaction helps in dealing with commonly researched prob-
lems in natural language processing such as ambiguity, sarcasm
or limited context by providing more information from the visual
and acoustic modalities. Consider the word “crazy”; this word can
have a positive sentiment if accompanied by a smile or can have a
negative sentiment if accompanied by a frown. At the same time
the word “great” accompanied by a frown implies sarcastic speech.
Also, in limited context, inference of sentiment intensity is difficult.
For example the word “good” accompanied by neutral nonverbal
behavior could mean that the utterance is positive; but the same
word accompanied with big smile could mean highly positive senti-
ment. 2) There are global interactions between modalities mostly
established based on temporal relations between modalities. Ex-
amples include a delayed laughter due to a speaker’s utterance of
words or a delayed smile because of a speech pause. Each of the
modalities also have their own intramodal interactions (such as
how different gestures happen over the utterance), which can be
characterized as the global structure of speech.
To properly model structure of speech, two key questions need
to be answered: “what modality to look for at each moment in
time?” and “what moments in speech are important in the commu-
nication?”. To address the first question, a model should be able to
“gate” the useful modalities at each moment in time. If a modality
does not contain useful information or the modality is too noisy
that negatively affects the performance of the model, the model
should be able to shut off the modality and perform inference based
on information present in the other modalities. To address the sec-
ond question, a model should be able to divert it’s “attention” to
key moments of communication such as when a polarized word
is uttered or when a smile happens. In this paper we introduce
the Gated Multimodal Embedding LSTM with Temporal Attention
model, which explicitly accounts for these two key questions by us-
ing a gated mechanism for multimodal fusion at each time step and
a Temporal Attention Layer for sentiment prediction. Our model is
able to explore the structure of speech through a stateful recurrent
mechanism and perform fusion at word level between different
modalities. This gives our model the ability to account for local
(by word level fusion of multimodal information) and global (by
stateful multimodal memory) interactions between modalities.
The remainder of the paper is as follows. In Section 2, we re-
view the related work in multimodal sentiment analysis. In Section
3, we give formal definition of our problems and present our ap-
proaches in detail. In Section 4, we describe the CMU-MOSI dataset,
experimental methodology and baseline models. The results on
CMU-MOSI dataset are presented in Section 5. A detailed analy-
sis of our model’s components is provided in Section 6. Section 7
concludes the paper.
2 RELATEDWORKDeep learning approaches have been became extremely popular in
the past few years [42]. The field of multimodal machine learning
specifically has gotten unprecedented momentum [3]. Multimodal
models have been used for sentiment analysis [28, 45, 46]; medical
purposes, such as detection of suicidal risk, PTSD and depression
[29, 34, 35, 43]; emotion recognition [23]; image captioning and
media description [8, 41]; question answering [2]; and multimodal
translation [31]. Our work is specifically connected to the following
areas:
Sentiment analysis on written text modality has been well-
studied [17] with models that predict sentiment from language.
Early works used the bag of words or n-gram representations [40]
to derive sentiment from individual words. Other approaches fo-
cused on opinionated words [26, 27, 32], and some applied more
complicated structures such as trees [30] and graphs [22]. These
structures aimed to derive the sentiment of sentence based on the
sentiment of individual words and their compositions. More recent
works used dependency-based semantic analysis [1], distributional
representations for sentiment [12] and a convolutional architec-
ture for the semantic modeling of sentences [13]. However, we are
primarily working on spoken text rather than written text, which
gives us the opportunity to integrate additional audio and visual
modalities. These modalities are helpful by providing additional
information, but sometimes may be noisy.
Multimodal sentiment analysis integrates verbal and nonver-bal behavior to predict user sentiment. Though various multimodal
datasets with sentiment labels exist [16, 21, 39], the CMU-MOSI
dataset [46] is the first dataset with opinion level sentiment labels.
Recent multimodal sentiment analysis approaches focus on deep
neural networks, including Convolutional Neural Networks (CNN)
with multiple-kernel learning [24] and Select-Additive Learning
(SAL) CNN [36] which learns generalizable features across speakers.
[5] uses an unimodal deep neural network for three modalities
(language, acoustic and visual) and explores the effectiveness of
early fusion and late fusion. The features of the three modalities
are similar to our work, but to fuse the multimodal information
their work simply uses an concatenation approach at video level
while we propose and justify the use of more advanced methods
for multimodal fusion.
Besides sentiment, other speakers’ attributes such as persuasion,
passion and confidence could also be analyzed by similar methods
[15, 18]. [6] proposes an ensemble classification approach that com-
bines two different perspectives on classifying multimodal data: the
first perspective assumes inter-modality conditional independence,
while the second perspective explicitly captures the correlation
between modalities is and recognized by a clustering based kernel
similarity approach. These methods can also be applied to multi-
modal sentiment analysis.
Our gated controller for visual and acoustic modality is inspired
by the controller used by Zoph and Le [48], where they use a Recur-
rent Neural Network (RNN) controller to determine the structure
of a CNN.
Compared to previous work, our method has two major advan-
tages. To the best of our knowledge, our model is the first to use
word level modality fusion, which means that we align each word
to corresponding video frames and audio segments. Secondly, we
are also the first to propose an attention layer and a input gate con-
troller trained by reinforcement learning to approach the problem
of noisy modalities.
3 PROPOSED APPROACHESIn this section, we give a detailed description of our proposed ap-
proach which will be divided into 2 modules: the Gated Multimodal
Multimodal Sentiment Analysis, Word-Level Fusion, Reinforcement Learning ICMI’17, November 13–17, 2017, Glasgow, UK
Embedding Layer and the LSTM with Temporal Attention model.
The Gated Multimodal Embedding Layer performs selective multi-
modal fusion at each time step (word level) using input modality
gates, and the LSTM with Temporal Attention performs sentiment
prediction with attention to each time step. Together, these mod-
ules combine to form the Gated Multimodal Embedding LSTM with
Temporal Attention (GME-LSTM(A)). The section ends with the
training details for the GME-LSTM(A).
3.1 Gated Multimodal EmbeddingThe first component is the Gated Multimodal Embedding that per-
forms multimodal fusion by learning the local interactions between
modalities. Suppose the dataset contains N video clips, each con-
taining an opinion mapped to sentiment intensity. A video clip
contains T time steps, where each time step corresponds to a word.
Each video clip is also labeled with the ground truth sentiment
value y ∈ R. We align words with their corresponding video and
audio segment using the Penn Phonetics Lab Forced Aligner (P2FA)
[33]. P2FA is a software that computes an alignment between a
speech audio file and a verbatim text transcript. Following the align-
ment, at each word level time step t , we obtain the aligned feature
vectors for language (word), acoustic and visual modalities: xwt , xat ,xvt respectively.
One problem of previous models is that of noisy modalitiesduring multimodal fusion since the textual modality may be nega-
tively affected by the visual and audio modalities. As a result, useful
textual features may be ignored because the corresponding visual
or acoustic feature is noisy and important information may be lost.
Our solution is to introduce an on/off input gate controller that
determines if the acoustic or visual modality at each time step
should contribute to the overall prediction. The reason why we
apply the input gate controller on acoustic/visual features while
always letting textual features in is that the language modality is
much more reliable for multimodal sentiment analysis than visual
or acoustic. Also visual and acoustic modalities can be noisy or un-
reliable since audio visual feature extraction is done automatically
using methods that add additional noise.
Mathematically, we formalize this with inputs xat and xvt repre-
senting the audio and visual inputs at time step t respectively. We
have a controller Ca, with weights θa , for determining the on/off
of audio modalities, and Cv , with weights θv , for determining the
on/off of visual modalities. These controllers are implemented as
deep neural networks Ca ( · ;θa ) and Cv ( · ;θv ) that take in xat and
xvt as input and outputs a binary label cat and cvt (0/1) respectively.
The binary output of these controllers mimics the act of accepting
or rejecting a modality based on its noise level. The new inputs x′atand x′vt to the network are:
x′at = cat · xat = Ca (xat ;θa ) · xat (1)
x′vt = cvt · xvt = Cv (xvt ;θa ) · xvt (2)
We concatenate features from three different modalities: visual,
audio and text to form the inputs xt to the word level LSTM with
Temporal Attention, described in the next section. By extracting
features with alignment at the word level, we exploit the temporal
correlation among different modalities xwt , x′at , x
′vt , but at the same
time the model is less affected by the impact of noisy modalities
during multimodal fusion.
xt =
xwtx′atx′vt
(3)
3.2 LSTM with Temporal AttentionThe second component is a sentiment prediction model that cap-
tures the temporal interactions on the multimodal embedding layer.
This component learns the global interactions between modalities
for sentiment prediction. To do so, we use an LSTM with Tempo-
ral Attention (LSTM(A)). The Gated Multimodal Embedding xt ispassed as input to the LSTM at each time step t . A LSTM [10] with
a forget gate [9] is used to learn global temporal information on
multimodal input data Xt :©«ifom
ª®®®®®¬=
©«siдmoid
siдmoid
siдmoid
tanh
ª®®®®®¬U
(XtWht−1
)(4)
ct = f ⊙ ct−1 + i ⊙ m (5)
ht = o ⊙ tanh(ct ) (6)
where i, f and o are the input, forget and output gates of the LSTM,
c is the LSTM memory cell and h is the LSTM output, W and Ulinearly transform Xt and ht−1 respectively into the LSTM gate
space and the ⊙ operator denotes the Hadamard product (entry-
wise product).
We use an attention model similar to [37] to selectively combine
temporal information from the input modalities by adaptively pre-
dicting the most important time steps towards sentiment of a video
clip. We expect relevant information to sentiment to have high
attention weights. For example if a person is “crying” or “laughing”,
this information is relevant to the sentiment of the opinion and
should have higher importance than a neutral word such as “movie”.
This attention mechanism also allows the modalities to act as com-
plimentary information. In cases where language is not helpful, the
model can adaptively focus on the presence of sentiment related
non-verbal behaviors such as facial gestures and tone of voice.
Mathematically, we add a soft attention layer α on top of the
sequence of LSTM hidden outputs. α is obtained by multiplying the
hidden layer at each time step t with a shared vectorw and passing
the sequence through a softmax function.
α = so f tmax
©«
w⊤h1···
w⊤hT
ª®®®®®®®¬(7)
The attention units α are used to weight the importance of each
time step’s hidden layer to final sentiment prediction. Suppose Hrepresents the matrix of all hidden units of the LSTM [h1; ...; hT ].Then the final sentiment prediction y is obtained by:
z = Hα (8)
y = Q(z) (9)
ICMI’17, November 13–17, 2017, Glasgow, UK Chen, Wang, Liang, Baltrušaitis, Zadeh and Morency
LSTM
𝑥"#
LSTM LSTM LSTM
AttentionUnits
ℎ" ℎ% ℎ& ℎ'
FC-ReLU
𝑥"() 𝑥"(* 𝑥%# 𝑥%() 𝑥%(* 𝑥&# 𝑥&() 𝑥&(* 𝑥'(*𝑥'()𝑥'#
𝐶)
𝑥%)
0/1 𝑅 = 𝑒234
…
𝑦6
…
GME
𝑥") 𝑥&)
GME
𝑥')
GME
Figure 1: Architecture of the GME-LSTM(A) model for the visual modality. Cv is the controller for the visual modality thatselectively allows visual inputs xvt to pass. FC-ReLU is a fully-connected layer with rectified linear unit (ReLU) as activation.After obtaining a sentiment prediction y and loss L, we use R = eb−L as the reward signal to train the visual input gatecontroller Cv .
where function Q represents a dense layer with a non-linear acti-
vation. We select Mean Absolute Error (MAE) as the loss function.
Though Mean Square Error (MSE) is a more popular choice for loss
function, MAE is a common criteria for sentiment analysis [45].
L =1
N
N∑i=1|yi − yi | (10)
Figure 1 shows the full structure of the GME-LSTM(A) model.
3.3 Training Details for GME-LSTM(A)To train the GME-LSTM(A), we need to know how output decisions
of the controller affects the performance of our LSTM(A) model.
Given the weights of the gate controller and input data xa1:T and
xv1:T , the controller decides whether we should reject an input or not.
The rejected inputs are replaced with 0, while the accepted inputs
are not changed. In this way we obtain the new inputs x′a1:T and x′v
1:T .
After we train the LSTM(A) with the new inputs (xw1:T , x
′a1:T , x
′v1:T ),
we get a MAE loss, L, on the validation set. Here L can be seen an
indicator of how well our controller affects the performance of the
model. Note that that lower MAE implies better performance, so
we use e−L as the reward signal to train the controllers.
Take the visual controller Cv as an example: we are maximizing
the expected reward, represented by J (θv ):J (θv ) = EP (cv
1:T |xv1:T ;θ
v )[e−L] (11)
where T is the total number of time steps in the dataset. The sen-
timent prediction MAE L in the reward signal is non-convex and
non-differentiable with respect to the parameters of the GME since
changes in the outputs of the GME change the MAE L in a discrete
manner. Straightforward gradient descent methods will not explore
all the possible regions of the function. This form of problem has
been studied in reinforcement learning where policy gradient meth-
ods balance exploration and optimization by randomly sampling
many possible outputs of the GME controller before optimizing for
best performance. Specifically, the REINFORCE algorithm [38] is
used to iteratively update θv :
∇θv J (θv ) =T∑i=1
EP (cv1:T |x
v1:T ;θ
v )[∇θv log P(cvi |xvi ;θ
v )e−L] (12)
An empirical approximation of the above quantity is to sample the
outputs of the controller [48]:
∇θv J (θv ) ≈1
n
n∑k=1
T∑i=1∇θv log P(cvi |x
vi ;θ
v )e−Lk (13)
where n is the number of different inputs datasets (xw1:T , x
′a1:T , x
′v1:T )
that the controller samples, and Lk is the MAE on the validation
dataset after the model is trained on kth inputs set.
In order to reduce variance of this estimation, we employ a
baseline function b [48]:
∇θv J (θv ) ≈1
n
n∑k=1
T∑i=1∇θv log P(cvi |x
vi ;θ
v )(eb−Lk ) (14)
where b is an exponential moving average of the previous MAEs
on the validation set.
If we take the visual input gate controller as an example, the
detailed training algorithm for the visual input gate controller is
shown in Algorithm 1. The acoustic input gate is trained in the
same manner.
Multimodal Sentiment Analysis, Word-Level Fusion, Reinforcement Learning ICMI’17, November 13–17, 2017, Glasgow, UK
Algorithm 1 Train gate controller
1: function trainGateController(Cv )2: for epoch ← 1 : epoch_num do3: for k ← 1 : n do4: for i ← 1 : T do5: p_pass = predict(Cv , xvi )6: x′vi ← 0
7: x′vi ← xvi with probability p_pass8: end for9: lossk ← trainLSTM(A)(xw
1:T , xa1:T , x
′v1:T )
10: end for11: updateController (Cv , lossk , loss_baseline)12: updateLossBaseline(lossk , loss_baseline)13: end for14: end function
4 EXPERIMENTAL METHODOLOGYIn this section, we describe the experimental methodology includ-
ing the dataset, data splits for training, validation and testing, the
input features and their preprocessing, the experimental details and
finally review the baseline models that we compare our results to.
4.1 CMU-MOSI DatasetWe test on the Multimodal Corpus of Sentiment Intensity and Sub-
jectivity Analysis (CMU-MOSI) dataset [45], which is a collection of
online videos in which a speaker is expressing his or her opinions
towards a movie. Each video is split into multiple clips, and each
clip contains one opinion expressed by one or more sentences. A
clip has one sentiment label y ∈ [−3,+3] which is a continuous
value representing speaker’s sentiment towards a certain aspect of
the movie. Figure 2 depicts a snapshot from the CMU-MOSI dataset.
The CMU-MOSI dataset consists of 93 videos / 2199 labeled clips
and training is performed on the labeled clips. Each video in the
CMU-MOSI dataset is from a different speaker. We use the train
and test sets defined in [36] which trains on 52 videos/1284 clips
(52 distinct speakers), validates on 10 videos/229 clips (10 distinct
speakers) and tests on 31 videos/686 clips (31 distinct speakers).
There is no speaker dependent contamination in our experiments so
our model is generalizable and learns speaker-independent features.
4.2 Input FeaturesWe use text, video, and audio as input modalities for our task. For
text inputs, we use pre-trained word embeddings (glove.840B.300d)
[19] to convert the transcripts of videos in the CMU-MOSI dataset
into word vectors. This is a 300 dimensional word embedding
trained on 840 billion tokens from the common crawl dataset. For
audio inputs, we use COVAREP [7] to extract acoustic features
including 12 Mel-frequency cepstral coefficients (MFCCs), pitch
tracking and voiced/unvoiced segmenting features, glottal source
parameters, peak slope parameters and maxima dispersion quo-
tients. For video inputs, we use Facet [11] and OpenFace [4, 44] to
extract a set of features including facial action units, facial land-
marks, head pose, gaze tracking and HOG features [47].
Figure 2: A snapshot from the CMU-MOSI dataset, wheretext, visual and audio features are aligned. For example, inthe bottom row of Figure 2, the first scene is labeled withtext - the speaker is currently saying the word “It”, this isaligned with the video clip of her speaking that word whereshe looks excited.
4.3 Implementation DetailsBefore training, we select the best 20 features from Facet and 5
from COVAREP using univariate linear regression tests. The se-
lected Facet and COVAREP features are linearly normalized by the
maximum absolute value in the training set.
For the LSTM(A) model, we set the number of hidden units of
the LSTM as 64. The maximum sequence length of the LSTM, T ,is 115. There are 50 units in the ReLU fully connected layer. The
model is trained using ADAM [14] with learning rate 0.0005 and
MAE (mean absolute error) as the loss function.
For the GME-LSTM(A) model, the visual and audio controllers
are each implemented as a neural network with one hidden layer of
32 units and sigmoid activation. The number of samplesn generated
from the controller at each training step is 5. Each sampled LSTM(A)
model is trained using ADAM [14] with learning rate 0.0005 and
MAE (mean absolute error) as the loss function. The input gate
controller is then trained using ADAM [14] with learning rate
0.0001.
5 EXPERIMENTAL RESULTS5.1 Baseline ModelsWe compare the performance of our methods to the following state-
of-the-art multimodal sentiment analysis models:
SAL-CNN (Selective Additive Learning CNN) [36] is a multi-
modal sentiment analysis model that attempts to prevent identity-
dependent information from being learned so as to improve gener-
alization based only on accurate indicators of sentiment.
SVM-MD (Support Vector Machine Multimodal Dictionary) [46]
is a SVM trained for classification or regression on multimodal
concatenated features for each video clip.
C-MKL (Convolutional Multiple Kernel Learning) [25] is a mul-
timodal sentiment analysis model which uses a CNN for textual
feature extraction and multiple kernel learning for prediction.
RF (Random Forest) is a baseline intended for comparison to a
non neural network approach.
ICMI’17, November 13–17, 2017, Glasgow, UK Chen, Wang, Liang, Baltrušaitis, Zadeh and Morency
Random is a baselinewhich always predicts a random sentiment
intensity between [3,−3] [46]. This is designed as a lower bound
to compare model performance.
Human performance was recorded when humans are asked to
predict the sentiment score of each opinion segment [46]. This acts
as a future target for machine learning methods.
Since sentiment analysis based on language has beenwell-studied,
we also compare our methods with following text-based models:
RNTN (Recursive Neural Tensor Network) [30] is a well-known
sentiment analysis method that leverages the sentiment of words
and their dependency structure.
DAN (Deep Average Network) [12] is a simple but efficient senti-
ment analysis model that uses information only from distributional
representation of the words.
D-CNN (DynamicCNN) [13] is among the state of the art mod-
els in text-based sentiment analysis which uses a convolutional
architecture adopted for the semantic modeling of sentences.
Finally, any model with “text” appended denotes the model
trained only on the textual modality of the CMU-MOSI video clips.
5.2 ResultsIn this section, we summarize the results on multimodal sentiment
analysis. In Table 1, we compare our proposed approaches with
previous state-of-the-art multimodal as well as language-based
baseline models for sentiment analysis (described in Section 5.1).
The multimodal section of Table 1 shows the performance of our
two proposed approaches compared to other multimodal baseline
methods. The model we proposed, GME-LSTM(A) as well as the
version without gated controller LSTM(A), both outperform multi-
modal and single modality sentiment analysis models. The GME-
LSTM(A) model gives the best result achieved across all models,
improving upon the state of the art by 4.08% in binary classification
accuracy and 13.2% in MAE. Since GME-LSTM(A) is able to attend
both in time, using soft attention as well as in input modality, using
the Gated Multimodal Embedding Layer, it is not a surprise that
this model outperforms all others.
The language section of Table 1 shows that LSTM(A) on a single
modality, language, obtains slightly worse performance than some
language-based methods. This is because these methods use more
complicated language models such as dependency-based parse tree.
However, by combining cues from audio and video with careful
multimodal fusion, GME-LSTM(A) immediately outperforms all
language-based and multimodal baseline models. This jump in per-
formance shows that good temporal attention and multimodal fu-
sion is key: our model benefit from the addition of input modalities
more so than other models did.
6 DISCUSSIONIn this section, we analyze the usefulness of our model’s different
components, demonstrating that the Temporal Attention Layer and
the Gated Multimodal Embedding over input modalities are both
crucial towards multimodal fusion and sentiment prediction.
6.1 LSTM with Temporal Attention AnalysisLanguage is most important in predicting sentiment. In both
the LSTM model (Table 2) and the LSTM(A) model (Table 3), using
Table 1: Sentiment prediction results on test set using dif-ferent text-based andmultimodal methods. Numbers are re-ported in binary classification accuracy (Acc), F-score andMAE, and the best scores are highlighted in bold (exclud-ing human performance). ∆SOTA shows improvement overthe state-of-the-art. Results for RNTN are parenthesized be-cause themodelwas trained on the Stanford Sentiment Tree-bank dataset [30] which is much larger than CMU-MOSI.
Modalities Method Acc F-score MAE
Text
RNTN [30] (73.7) (73.4) (0.990)
DAN [12] 70.0 69.4 -
D-CNN [13] 69.0 65.1 -
SAL-CNN text [36] 73.5 - -
SVM-MD text [46] 73.3 72.1 1.186
RF text [46] 57.6 57.5 -
LSTM text (ours) 67.8 51.2 1.234
LSTM(A) text (ours) 71.3 67.3 1.062
Multimodal
Random 50.2 48.7 1.880
SAL-CNN [36] 73.0 - -
SVM-MD [46] 71.6 72.3 1.100
C-MKL [25] 73.5 - -
RF [46] 57.4 59.0 -
LSTM (ours) 69.4 63.7 1.245
LSTM(A) (ours) 75.7 72.1 1.019
GME-LSTM(A) (ours) 76.5 73.4 0.955Human 85.7 87.5 0.710
∆SOTA ↑ 3.0 ↑ 1.1 ↓ 0.145
Table 2: Sentiment prediction results on test set using LSTMmodel with different combinations of modalities. Numbersare reported in binary classification accuracy (Acc), F-scoreand MAE, and the best scores are highlighted in bold.
Method Modalities Acc F-score MAE
LSTM
text 67.8 51.2 1.234
audio 44.9 61.9 1.511
video 44.9 61.9 1.505
text + audio 66.8 55.3 1.211text + video 63.0 65.6 1.302
text + audio + video 69.4 63.7 1.245
only the text modality provides a better sentiment prediction than
using unimodal audio and visual modalities.
Acoustic and visual modality are noisy. When we provide ad-
ditional modalities to the LSTM model without attention (Table
2), the performance does not improve significantly. Using all three
modalities actually leads to slightly worse performance in F-score
and MAE as compared to using fewer input modalities. This allows
us to deduce that the audio and video features are probably noisy
Multimodal Sentiment Analysis, Word-Level Fusion, Reinforcement Learning ICMI’17, November 13–17, 2017, Glasgow, UK
Table 3: Sentiment prediction results on test set usingLSTM(A) model with different combinations of modalities.Numbers are reported in binary classification accuracy (Acc),F-score andMAE, and the best scores are highlighted in bold.
Method Modalities Acc F-score MAE
LSTM(A)
text 71.3 67.3 1.062
audio 55.4 63.0 1.451
video 52.3 57.3 1.443
text + audio 73.5 70.3 1.036
text + video 74.3 69.9 1.026
text + audio + video 75.7 72.1 1.019
and may hurt the model’s performance if multimodal fusion is not
carefully performed.
Temporal Attention improves sentiment prediction. On the
other hand, whenwe use the the LSTM(A)model, Table 3 shows that
adding more modalities improves sentiment regression and classifi-
cation. The LSTM(A) (Table 3) consistently outperforms the LSTM
(Table 2) across all modality combinations. We hypothesize that by
using temporal attention, the model will assign the largest attention
weights to time steps where all 3 modalities give strong, consistent
sentiment predictions and abandon noisy frames altogether. As a
result, temporal attention improves sentiment prediction despite
the presence of noisy acoustic and visual modalities.
Successful cases of the LSTM(A) model. To obtain a better in-
sight into our model’s performance, we provide some successful
cases to demonstrate the contribution of the Temporal Attention
Layer. By further studying the attention weights α in the LSTM(A),
we can find which words/time steps the model focuses on. The
following are examples of successful cases when we look at the tex-
tual modality alone. Each line represents a single transcript and the
bold word indicates the word which the model assigns the highest
attention weight to.
I thought it was fun.And she really enjoyed the film.
But a lot of the footage was kind of unnecessary.The highlighted words are all words that are good indicators of
positive or negative sentiment.
The LSTM(A) model combines word meanings with audioand visual indicators. Figure 3 and Figure 4 are examples where
the LSTM(A) model is successful when we use all 3 modalities. In
these examples, the LSTM(A) model is able to leverage the word
level alignment of audio and visual modalities to overcome the
ambiguity in the corresponding aligned word. The LSTM(A) model
is able to determine overall video sentiment to a greater accuracy
as compared to the LSTM model without attention.
Word level fusion enables fine grained multimodal analysis.We see that the model is indeed capturing the meaning of words
and implicitly classifying them based on their sentiment: positive,
negative or neutral. For neutral words, the model correctly looks
at the aligned visual and audio modalities to make a prediction.
Therefore, the model is learning the indicators of sentiment from
facial gestures and tone of voice as well. This is a benefit of word
He’s not gonna be looking like a chirper bright young man butearly thirties really you want me to buy that.
Visual modality: Looks disappointed
LSTM sentiment prediction: 1.24LSTM(A) sentiment prediction: -0.94
Ground truth sentiment: -1.8
Figure 3: Successful case 1: Although the highest weightedword extracted from the transcript (top) is “want”, with am-biguous sentiment, the LSTM(A) leverages the visual modal-ity (center), where the speaker looks disappointed, to makea prediction on video sentimentmuch closer to ground truth(bottom).
The only actor who can really sell their lines is Erin.
Visual modality: Looks sad
LSTM sentiment prediction: 1.86LSTM(A) sentiment prediction: -0.3
Ground truth sentiment: -1.0
Figure 4: Successful case 2: Although the highest weightedword extracted from the transcript (top) is “lines”, with am-biguous sentiment, the LSTM(A) leverages the visual modal-ity (center), where the speaker looks sad, to make a predic-tion on video sentiment much closer to ground truth (bot-tom).
level fusion sincewe can examine exactlywhat themodel is learning
at a finer resolution.
6.2 Gated Multimodal Embedding AnalysisGatedMultimodal Embedding helpsmultimodal fusion. TheLSTM(A)model is still susceptible to noisymodalities. Table 4 shows
that the GME-LSTM(A) model outperforms the LSTM(A) model on
ICMI’17, November 13–17, 2017, Glasgow, UK Chen, Wang, Liang, Baltrušaitis, Zadeh and Morency
Table 4: Sentiment prediction results on test set using LSTM,LSTM(A) and GME-LSTM(A) multimodal models. Numbersare reported in binary classification accuracy (Acc), F-scoreand MAE, and the best scores are highlighted in bold.
Method Modalities Acc F-score MAE
LSTM text + audio + video 69.4 63.7 1.245
LSTM(A) text + audio + video 75.7 72.1 1.019
GME-LSTM(A) text + audio + video 76.5 73.4 0.955
LSTM(A) sentiment prediction: -2.00GME-LSTM(A) sentiment prediction: 1.48
Ground truth sentiment: 1.2
Figure 5: Successful case 1: Across the entire video, thespeaker’s facial features were rather monotonic except forone frame where she smiled brightly (left). Our visual inputgate rejects the visual input at time steps before and after,but allows this frame to pass since the speaker is display-ing obvious facial gestures. The prediction was much closerto ground truth as compared to without the input gate con-troller (right).
all metrics, indicating that there is value in attending in modalities
using the Gated Multimodal Embedding .
GME-LSTM(A)model correctly selects helpfulmodalities.Toobtain a better insight into the effect of the Gated Multimodal
Embedding Layer, a successful example is shown in Figure 5, where
the input gate controller for the visual modality correctly identifies
frames where obvious facial expressions are displayed, and rejects
those with a blank expression.
GME-LSTM(A) model correctly rejects noisy modalities. We
now revisit a failure case of the LSTM(A) model, where the speaker
is covering her mouth during the word that gives best sentiment
prediction, “cute” (Figure 6). The LSTM(A) model is focusing on an
uninformative time step and makes a poor sentiment prediction. In
other words, the model may be confused if the added visual and
audio modalities are uninformative or noisy. We found that the
Gated Multimodal Embedding correctly rejects the noisy visual
input at the time step of “cute” and the GME-LSTM(A) model gives
a sentiment prediction closer to the ground truth (Figure 6). This is
a good example where the GME-LSTM(A) model directly tackles
the problem that motivated its development: the issue of noisy
modalities that hurts performance when multimodal fusion is not
carefully performed. Specifically, the GME-LSTM(A) model was
able to learn that the visual modality was mismatched with the
textual modality, further recognizing that the visual modality was
noisywhile the correspondingwordwas a good indicator of positive
speaker sentiment.
First of all I’d like to say little James or Jimmy he’s so cute he’s so ...
Visual modality: Hands cover mouth
LSTM sentiment prediction: 1.23LSTM(A) sentiment prediction: -0.94
GME-LSTM(A) sentiment prediction: 1.57Ground truth sentiment: 3.0
Figure 6: Successful case 2: The LSTM(A) extracts the wrongword from the sentiment, extracting “little” instead of thebetter word “cute” (top). Upon inspection, the speaker is cov-ering her mouth when the word “cute" is spoken (center),which leads to less attentionweight onword “cute” since themodalities are not consistently strong at that frame. As a re-sult, the LSTM(A) model makes a prediction on video senti-ment that is further away from ground truth (bottom). How-ever, the Gated Multimodal Embedding correctly rejects thenoisy visual input at the time step of “cute” (bottom). Includ-ing the Gated Multimodal Embedding improves the senti-ment prediction back closer to ground truth.
7 CONCLUSIONIn this paper we proposed Gated Multimodal Embedding LSTM
with Temporal Attention model for multimodal sentiment analysis.
Our approach is the first of it’s kind to perform multimodal fusion
at word level. Furthermore to build a model that is suitable for
the complex structure of speech, we introduce selective word-level
fusion between modalities using gating mechanism trained using
reinforcement learning. We use attention model to divert the focus
of our model to important moments in speech. The stateful nature
of our model allows for long interactions to be captured between
different modalities. We show state of the art performance in MOSI
dataset and we bring qualitative analysis of how our model is able
to deal with various challenges of understanding communication
dynamics.
REFERENCES[1] Basant Agarwal, Soujanya Poria, Namita Mittal, Alexander Gelbukh, and Amir
Hussain. 2015. Concept-level sentiment analysis with dependency-based semantic
parsing: a novel approach. Cognitive Computation 7, 4 (2015), 487–499.
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In
Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.[3] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2017.
Multimodal Machine Learning: A Survey and Taxonomy. arXiv preprintarXiv:1705.09406 (2017).
[4] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface:
an open source facial behavior analysis toolkit. In Applications of Computer Vision(WACV), 2016 IEEE Winter Conference on. IEEE, 1–10.
Multimodal Sentiment Analysis, Word-Level Fusion, Reinforcement Learning ICMI’17, November 13–17, 2017, Glasgow, UK
[5] Jayanth Koushik Louis-Philippe Morency Behnaz Nojavanasghari,
Deepak Gopinath. 2016. Deep Multimodal Fusion for Persuasiveness
Prediction.
[6] M. Chatterjee, S. Park, L.-P. Morency, and S. Scherer. 2015. Combining Two
Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits. In
Proceedings of International Conference Multimodal Interaction (ICMI 2015).[7] Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer.
2014. COVAREP—A collaborative voice analysis repository for speech technolo-
gies. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on. IEEE, 960–964.
[8] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach,
Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term
recurrent convolutional networks for visual recognition and description. In
Proceedings of the IEEE conference on computer vision and pattern recognition.2625–2634.
[9] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget:
Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451–2471.
[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[11] iMotions. 2017. Facial Expression Analysis. (2017). goo.gl/1rh1JN
[12] Mohit Iyyer, Varun Manjunatha, Jordan L Boyd-Graber, and Hal Daumé III. 2015.
Deep Unordered Composition Rivals Syntactic Methods for Text Classification..
In ACL (1). 1681–1691.[13] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional
neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).[14] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-
mization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
[15] Navonil Majumder, Soujanya Poria, Alexander Gelbukh, and Erik Cambria. 2017.
Deep Learning-Based Document Modeling for Personality Detection from Text.
IEEE Intelligent Systems 32, 2 (2017), 74–79.[16] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multi-
modal sentiment analysis: Harvesting opinions from the web. In Proceedings ofthe 13th international conference on multimodal interfaces. ACM, 169–176.
[17] Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foun-dations and Trends® in Information Retrieval 2, 1–2 (2008), 1–135.
[18] Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-
Philippe Morency. 2014. Computational Analysis of Persuasiveness in Social
Multimedia: ANovel Dataset andMultimodal Prediction Approach. In Proceedingsof the 16th International Conference on Multimodal Interaction (ICMI ’14). ACM,
New York, NY, USA, 50–57. https://doi.org/10.1145/2663204.2663260
[19] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
Global Vectors for Word Representation.
[20] Veronica Perez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013.
Utterance-Level Multimodal Sentiment Analysis. In Association for Computa-tional Linguistics (ACL). Sofia, Bulgaria. http://ict.usc.edu/pubs/Utterance-Level%20Multimodal%20Sentiment%20Analysis.pdf
[21] Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013.
Utterance-Level Multimodal Sentiment Analysis.. In ACL (1). 973–982.[22] Soujanya Poria, Basant Agarwal, Alexander Gelbukh, Amir Hussain, and Newton
Howard. 2014. Dependency-based semantic parsing for concept-level text analy-
sis. In International Conference on Intelligent Text Processing and ComputationalLinguistics. Springer, 113–127.
[23] Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A Review of
Affective Computing: FromUnimodal Analysis toMultimodal Fusion. InformationFusion 1 (2017), 34.
[24] Soujanya Poria, Erik Cambria, and Alexander F Gelbukh. 2015. Deep Convo-
lutional Neural Network Textual Features and Multiple Kernel Learning for
Utterance-level Multimodal Sentiment Analysis.
[25] Soujanya Poria, Erik Cambria, and Alexander F. Gelbukh. 2015. Deep Convo-
lutional Neural Network Textual Features and Multiple Kernel Learning for
Utterance-level Multimodal Sentiment Analysis. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon,Portugal, September 17-21, 2015. 2539–2544. http://aclweb.org/anthology/D/D15/D15-1303.pdf
[26] Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Federica Bisio. 2016. Sentic
LDA: Improving on LDA with semantic similarity for aspect-based sentiment
analysis. In Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE,4465–4473.
[27] Soujanya Poria, Alexander Gelbukh, Dipankar Das, and Sivaji Bandyopadhyay.
2012. Fuzzy clustering for semi-supervised learning–case study: Construction of
an emotion lexicon. In Mexican International Conference on Artificial Intelligence.Springer, 73–86.
[28] Soujanya Poria, Haiyun Peng, Amir Hussain, Newton Howard, and Erik Cambria.
2017. Ensemble application of convolutional neural networks and multiple kernel
learning for multimodal sentiment analysis. Neurocomputing (2017).
[29] Stefan Scherer, Gale M. Lucas, Jonathan Gratch, Albert Skip Rizzo, and Louis-
Philippe Morency. 2016. Self-reported symptoms of depression and PTSD are
associated with reduced vowel space in screening interviews. IEEE Transactionson Affective Computing 7, 1 (Jan. 2016), 59–73. https://doi.org/10.1109/TAFFC.
2015.2440264
[30] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Man-
ning, Andrew Y Ng, Christopher Potts, et al. 2013. Recursive deep models for
semantic compositionality over a sentiment treebank. In Proceedings of the con-ference on empirical methods in natural language processing (EMNLP), Vol. 1631.Citeseer, 1642.
[31] Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared
task on multimodal machine translation and crosslingual image description.
In Proceedings of the First Conference on Machine Translation, Berlin, Germany.Association for Computational Linguistics.
[32] Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede.
2011. Lexicon-based methods for sentiment analysis. Computational linguistics37, 2 (2011), 267–307.
[33] ucbvislab. 2013. p2fa-vislab. (2013). https://github.com/ucbvislab/p2fa-vislab
[34] Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Dennis Lalanne,
Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja
Pantic. 2016. AVEC 2016: Depression, Mood, and Emotion Recognition Workshop
and Challenge. In Proceedings of the 6th International Workshop on Audio/VisualEmotion Challenge. ACM, 3–10.
[35] Verena Venek, Stefan Scherer, Louis-Philippe Morency, Albert Rizzo, and John
Pestian. 2016. Adolescent suicidal risk assessment in clinician-patient interaction.
IEEE Transactions on Affective Computing (2016).
[36] Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing.
2016. Select-Additive Learning: Improving Cross-individual Generalization in
Multimodal Sentiment Analysis. arXiv preprint arXiv:1609.05244 (2016).[37] Minlie Zhao Li Zhu-Xiaoyan Wang, Yequan Huang. 2016. Attention-based LSTM
for Aspect-level Sentiment Classification. EMNLP (2016).
[38] Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for
connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229–256.
https://doi.org/10.1007/BF00992696
[39] Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun,
Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Senti-
ment analysis in an audio-visual context. IEEE Intelligent Systems 28, 3 (2013),46–53.
[40] Bishan Yang and Claire Cardie. 2012. Extracting opinion expressions with semi-
markov conditional random fields. In Proceedings of the 2012 Joint Conference onEmpirical Methods in Natural Language Processing and Computational NaturalLanguage Learning. Association for Computational Linguistics, 1335–1345.
[41] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016.
Image captioning with semantic attention. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 4651–4659.
[42] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2017.
Recent Trends in Deep Learning Based Natural Language Processing. arXivpreprint arXiv:1708.02709 (2017).
[43] Zhou Yu, Stefen Scherer, David Devault, Jonathan Gratch, Giota Stratou, Louis-
Philippe Morency, and Justine Cassell. 2013. Multimodal prediction of psycholog-
ical disorders: Learning verbal and nonverbal commonalities in adjacency pairs.
In Semdial 2013 DialDam: Proceedings of the 17th Workshop on the Semantics andPragmatics of Dialogue. 160–169.
[44] Amir Zadeh, Tadas Baltrušaitis, and Louis-Philippe Morency. 2017. Convolutional
experts constrained local model for facial landmark detection. In Computer Visionand Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE,2051–2059.
[45] Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI:
Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online
Opinion Videos. arXiv preprint arXiv:1606.06259 (2016).[46] Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multi-
modal sentiment intensity analysis in videos: Facial gestures and verbal messages.
IEEE Intelligent Systems 31, 6 (2016), 82–88.[47] Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng, and Shai Avidan. 2006. Fast
human detection using a cascade of histograms of oriented gradients. In ComputerVision and Pattern Recognition, 2006 IEEE Computer Society Conference on, Vol. 2.IEEE, 1491–1498.
[48] Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement
learning. arXiv preprint arXiv:1611.01578 (2016).