11
Video Captioning with Multi-Faceted Attention Xiang Long IIIS, Tsinghua University [email protected] Chuang Gan IIIS, Tsinghua University [email protected] Gerard de Melo Rutgers University [email protected] Abstract Recently, video captioning has been attracting an increasing amount of interest, due to its po- tential for improving accessibility and infor- mation retrieval. While existing methods rely on different kinds of visual features and model structures, they do not fully exploit relevant semantic information. We present an exten- sible approach to jointly leverage several sorts of visual features and semantic attributes. Our novel architecture builds on LSTMs for sen- tence generation, with several attention lay- ers and two multimodal layers. The atten- tion mechanism learns to automatically select the most salient visual features or semantic at- tributes, and the multimodal layer yields over- all representations for the input and outputs of the sentence generation component. Experi- mental results on the challenging MSVD and MSR-VTT datasets show that our framework outperforms the state-of-the-art approaches, while ground truth based semantic attributes are able to further elevate the output quality to a near-human level. 1 Introduction The task of automatically generating captions for videos has been receiving an increasing amount of attention. On YouTube, for example, every single minute, hundreds of hours of video content are uploaded. There is no way a person could sit and watch these overwhelming amounts of videos, so new techniques to search and quickly understand them are highly sought. Generating captions, i.e., short natural lan- guage descriptions, for videos is an important tech- nique to address this challenge, while also greatly improving their accessibility for blind and visually impaired users. Video captioning has been studied for a long time and remains challenging, given the difficulties of video interpretation, natural language generation, and the interplay between them. Understanding a video hinges on our ability to make sense of video frames and of the relationships between consecutive frames. The output needs to be grammatically correct se- quence of words. Different parts of the output cap- tion may pertain to different parts of the video. In previous work, 3D ConvNets (Du et al., 2015) have been proposed to capture motion information in short videos, while LSTMs (Hochreiter and Schmid- huber, 1997) can be used to generate natural lan- guage, and a variety of different visual attention models (Yao et al., 2015; Pan et al., 2015; Yu et al., 2015) have been deployed, attempting to cap- ture the relationship between caption words and the video content. These methods, however, only make use of visual information from the video, often with unsatisfac- tory results. In many real-world settings, we can easily obtain additional information related to the video. Apart from sound, there may also be a title, user-supplied tags, categories, and other metadata. Both visual video features as well as attributes such as tags can be imperfect and incomplete. However, by jointly considering all available signals, we may obtain complementary information that aids in gen- erating better captions. Humans, too, often benefit arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

Video Captioning with Multi-Faceted Attention

Xiang LongIIIS, Tsinghua University

[email protected]

Chuang GanIIIS, Tsinghua University

[email protected]

Gerard de MeloRutgers [email protected]

Abstract

Recently, video captioning has been attractingan increasing amount of interest, due to its po-tential for improving accessibility and infor-mation retrieval. While existing methods relyon different kinds of visual features and modelstructures, they do not fully exploit relevantsemantic information. We present an exten-sible approach to jointly leverage several sortsof visual features and semantic attributes. Ournovel architecture builds on LSTMs for sen-tence generation, with several attention lay-ers and two multimodal layers. The atten-tion mechanism learns to automatically selectthe most salient visual features or semantic at-tributes, and the multimodal layer yields over-all representations for the input and outputs ofthe sentence generation component. Experi-mental results on the challenging MSVD andMSR-VTT datasets show that our frameworkoutperforms the state-of-the-art approaches,while ground truth based semantic attributesare able to further elevate the output quality toa near-human level.

1 Introduction

The task of automatically generating captions forvideos has been receiving an increasing amount ofattention.

On YouTube, for example, every single minute,hundreds of hours of video content are uploaded.There is no way a person could sit and watch theseoverwhelming amounts of videos, so new techniquesto search and quickly understand them are highlysought. Generating captions, i.e., short natural lan-

guage descriptions, for videos is an important tech-nique to address this challenge, while also greatlyimproving their accessibility for blind and visuallyimpaired users.

Video captioning has been studied for a long timeand remains challenging, given the difficulties ofvideo interpretation, natural language generation,and the interplay between them. Understanding avideo hinges on our ability to make sense of videoframes and of the relationships between consecutiveframes.

The output needs to be grammatically correct se-quence of words. Different parts of the output cap-tion may pertain to different parts of the video. Inprevious work, 3D ConvNets (Du et al., 2015) havebeen proposed to capture motion information inshort videos, while LSTMs (Hochreiter and Schmid-huber, 1997) can be used to generate natural lan-guage, and a variety of different visual attentionmodels (Yao et al., 2015; Pan et al., 2015; Yu etal., 2015) have been deployed, attempting to cap-ture the relationship between caption words and thevideo content.

These methods, however, only make use of visualinformation from the video, often with unsatisfac-tory results. In many real-world settings, we caneasily obtain additional information related to thevideo. Apart from sound, there may also be a title,user-supplied tags, categories, and other metadata.Both visual video features as well as attributes suchas tags can be imperfect and incomplete. However,by jointly considering all available signals, we mayobtain complementary information that aids in gen-erating better captions. Humans, too, often benefit

arX

iv:1

612.

0023

4v1

[cs

.CV

] 1

Dec

201

6

Page 2: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

Video: 301 frames

a monkey is playing with a dog.

Semantic Attributes: “Monkey”, “Pulling”

Temporal Features: Resnet-152

Motion Features: C3D × 184096

× 3012048

Figure 1: Example video with extracted visual features,semantic attribute, and the generated caption as output.

from additional context information when trying tounderstand what a video is portraying.

Incorporating these additional signals is not just amatter of adding additional features. While gener-ating the sequence of words in the caption, we needto be able to flexibly attend to the relevant framesover time, the relevant parts within a given frame,and relevant additional signals to the extent that theypertain to a particular output word.

Based on these considerations, we proposea novel multi-faceted attention architecture thatjointly considers multiple heterogeneous forms ofinputs. This model is flexibly attends to temporal in-formation, motion features, and semantic attributesfor every channel. An example of this is given inFigure 1. Each part of the attention model is an inde-pendent branch and it is straightforward to incorpo-rate additional branches for further kinds of features,making our model highly extensible. We present aseries of experiments that highlight the contributionof attributes to yield state-of-the-art results on stan-dard datasets.

2 Related Work

Machine Translation. Some of the first widelynoted successes of deep sequence-to-sequencelearning models were for the task of machine trans-lation (Cho et al., 2014b; Cho et al., 2014a;Sutskever et al., 2014; Kalchbrenner and Blunsom,2013; Li et al., 2015; Lin et al., 2015). In sev-eral respects, this is actually a similar task to videocaption generation, just with a rather different inputmodality. What they share in common is that bothrequire bridging different representations, and thatoften an encoder-decoder paradigm is used with a

Recurrent Neural Network (RNN) decoder to gen-erate sentences in the target language. Many tech-niques for video captioning are inspired by neuralmachine translation ones, including soft attentionmechanisms to focus on different parts of the inputwhen generating the target sentence word by word(Bahdanau et al., 2015).

Image Captioning. Image captioning can be re-garded as a greatly simplified case of video caption-ing, with videos consisting of just a single frame.Recurrent architectures are often used here as well(Karpathy et al., 2014; Kiros et al., 2014; Chenand Zitnick, 2015; Mao et al., 2015; Vinyals et al.,2015). Spatial attention mechanisms allow for fo-cusing on different areas of an image (Xu et al.,2015b). Recently, image captioning incorporatingsemantic concepts have achieved inspiring results.A semantic attention approach has been proposed(You et al., 2016) to selectively attend to semanticconcept proposals and fuse them into hidden statesand outputs of RNNs, but their model is difficult toextend for multiple channels. Overall, none of thesemethods for image captioning need to account fortemporal and motion aspects.

Video captioning. For video captioning, manyworks utilize a recurrent neural architecture to gen-erate video descriptions, conditioned on either anaverage-pooling (Venugopalan et al., 2015b) or re-current encoding (Xu et al., 2015a; Donahue et al.,2015; Venugopalan et al., 2015a; Venugopalan et al.,2016) of frame-level features, or on a dynamic lin-ear combination of context vectors obtained via tem-poral attention (Yao et al., 2015). Recently, hierar-chical recurrent neural encoders (HRNE) with atten-tion mechanism have been proposed to encode video(Pan et al., 2015). A recent paper (Yu et al., 2015)additionally exploits several kinds of visual attentionand relies on a multimodal layer to combine them.In our work, we present a novel attention model withmore effective multimodal layers that jointly modelsmultiple heterogeneous signals, including semanticattributes, and experimentally show the benefits ofthis approach over previous work.

3 The Proposed Approach

In this section, we describe our approach for com-bining multiple forms of attention for video cap-

Page 3: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

tioning. Figure 2 illustrates the architecture of ourmodel. The core of our model is a sentence genera-tor based on generator is a simple Long Short TermMemory (LSTM) units (Hochreiter and Schmidhu-ber, 1997). Instead of a traditional sentence genera-tor, which directly receives a previous word and se-lects the next word, our model relies on several at-tention layers to selectively focus on important partsof temporal, motion, and semantic features. Theoutput words are generated via a softmax readingfrom a multimodal layer (Mao et al., 2015), whichintegrates information from the different attentionlayers. An additional multimodal layer integratesinformation before the input reaches the sentencegenerator to enable better hidden representations inthe LSTM. We first briefly review the basic LSTM,and then describe our model in detail, including ournovel multi-faceted attention mechanism to considertemporal, motion, and semantic attribute perspec-tives.

3.1 Long Short Term Memory Networks

A Recurrent Neural Network (RNN) (Elman, 1990)is a neural network adding extra feedback connec-tions to feed-forward networks, so as to be able towork with sequences. The network is updated notonly based on the input but also based on the pre-vious hidden state. RNNs can compute the hid-den states (h1, h2, . . . , hm) given an input sequence(x1, x2, . . . , xm) based on recurrence of the follow-ing form:

ht = φ(Whxt + Uhht−1 + bh), (1)

where weight matrices W , U and bias b are param-eters to be learned and φ(·) is an element-wise acti-vation function.

RNNs trained via unfolding have proven infe-rior at capturing long-term temporal information.LSTM units were introduced to avoid these chal-lenges. LSTMs not only compute the hidden statesbut also maintains a cell state to account for rele-vant signals that have been observed. They have theability to remove or add information to the cell state,modulated by gates.

Given an input sequence (x1, x2, ..., xm),an LSTM unit computes the hidden state(h1, h2, ..., hm) and cell states (c1, c2, ..., cm)

via repeated application of the following equations:

it = σ(Wixt + Uiht−1 + bi) (2)

ft = σ(Wfxt + Ufht−1 + bf ) (3)

ot = σ(Woxt + Uoht−1 + bo) (4)

gt = φ(Wgxt + Ught−1 + bg) (5)

ct = ft � ct−1 + it � gt (6)

ht = ot � ct, (7)

where σ(·) is the sigmoid function and� denotes theelement-wise multiplication of two vectors. For con-venience, we denote the computations of the LSTMat each time step t as ht, ct = LSTM(xt, ht−1,ct−1).

3.2 Input Representations

When training a video captioning model, as a firststep, we need to extract feature vectors that serveas inputs to the LSTM. For visual features, we canextract one feature vector per frame, leading to a se-ries of what we call temporal features. We can alsoextract another form of feature vector from severalconsecutive frames, which we call motion features.Additionally, we could also extract other forms ofvisual features, such as features from an area of aframe, the same area of consecutive frames, etc. Inthis paper, we only consider temporal features, de-noted by {vi}, and motion features, denoted by {fi},which are commonly used in video captioning.

For semantic features, we need to extract a setof related attributes denoted by {ai}. These canbe based on title, tags, etc., if available. Alterna-tively, we can also rely on techniques to extract orpredict attributes that are not directly given. In par-ticular, because we have captions for the videos inthe training set, we can train different models to pre-dict caption-related semantic features for videos inthe validation and test sets. As the choice of seman-tic features is not the core contribution, we describeour specific experimental setups in Section 4.

After determining a set of attributes for eachvideo, each attribute ai in any video in the entiredataset corresponds to an entry in the vocabulary andeach word wi in any caption of the training set alsocorresponds to an entry in the vocabulary.

An embedding matrix E is used to represent bothwords and semantic attributes and we denote by

Page 4: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

Multi-Faceted Attention

Softmax

LSTM

Multi-Faceted Attention

LSTM

{ }is

{ }if

{ }it

Embedding

Embedding

Embedding

Multi-Faceted Attention

Multi-Faceted Attention

Softmax

Every Frame

Every 16 FramesMotion

Features

Temporal Features

Video

SemanticAttributes

tx

x

tm

th

h

tm

2w

Multi-Faceted Attention

2p

Softmax

LSTM

Multi-Faceted Attention

Embedding

2h

2

hm

2x

2

xm

...

tw

tp

Attention Attention Attention

Multimodel

Attention Attention Attention

Multimodel

tx

x

tm

{ }is { }if { }it

th

h

tm

{ }is { }if { }it

1x

1

xm

1h

1

hm

1p

1w

Figure 2: Model architecture. Temporal, motion, and semantic features are weighted via an attention mechanism andaggregated, together with an input word, by a multimodal layer. Its output updates the LSTM’s hidden state. A similarattention mechanism then determines the next output word via a softmax layer.

E[w] an embedding vector of a given w. Thus, weobtain attribute embedding vectors {si} and inputword embedding vectors as:

si = E[ai] (8)

xt = E[wt] (9)

3.3 Multi-Faceted AttentionWe do not directly feed xt to the LSTM. Instead, wefirst apply our multi-faceted attention model to xt.

Assuming that we have a series of multimodalfeature vectors for a given video, we generate a cap-tion word by word. At each step, we need to se-lect relevant information from these feature vectors,which we from now on refer to as context vectors{c1, c2, ..., cn}. Due to the variability of the lengthof videos, it is challenging to directly input all thesevectors to the model at every time step. A simplestrategy is to compute the average of the context vec-tors and input this average vector to each time stepof the model.

yt =1

m

m∑i=1

ci (10)

However, this strategy collapses all available infor-mation into a single vector, neglecting the inherentstructure, which captures the temporal progression,among other things. Thus, this sort of folding leadsto a significant loss of information. Instead, we wishto focus on the most salient parts of the features

at every time step. Instead of a naive averaging ofthe context vectors {c1, c2, . . . , cn}, a soft attentionmodel calculates weights αt

i for each ci, condition-ing on the input vector xt at each time step t. Forthis, we first compute basic attention scores eti andthen feed these through a sequential softmax layerto obtain a set of attention weights {αt

1, αt2, . . . , α

tn}

that quantify the relevance of {c1, c2, . . . , cn} for xt.

eti = xTt Uci (11)

αti =

exp(eti)∑nj=1 exp(etj)

(12)

yt =

m∑i=1

αtici (13)

We obtain the corresponding output vectors yt asweighted averages.

This soft attention model, strictly speaking, con-verts an entire input sequence (x1, x2, . . . , xm) to anentire output sequence (y1, y2, . . . , ym) based on allcontext vectors {c1, c2, . . . , cn}. For convenience,we denote the attention model outputs at a giventime step t as yt = Attention(xt, {ci}).

In particular, the attention model is applied to thetemporal features {vi}, motion features {fi} and se-

Page 5: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

mantic features {si}:

sxt = Attention(xt, {si}) (14)

vxt = Attention(xt, {vi}) (15)

fxt = Attention(xt, {fi}) (16)

We then obtain the input to the LSTM mxt via a

multimodal layer

mxt = φ (W x [xt, s

xt , w

xv � vxt , wx

f � fxt ] + bm,x)(17)

Here, wxv and wx

f facilitate capturing the relativeimportance of each dimension of the temporal andmotion feature space (You et al., 2016). We applydropout (Srivastava et al., 2014) to this multimodallayer to reduce overfitting.

Subsequently, we can obtain ht via the LSTM. Atthe first time step, the mean values of the features areused to initialize the LSTM states to yield a generaloverview of the video:

mx0 = W i[Mean({si}),Mean({vi}),Mean({fi})]

(18)h0, c0 = LSTM(mx

0 , 0, 0) (19)

ht, ct = LSTM(mxt , ht−1, ct−1) (20)

where Mean(·) denotes mean pooling of the givenfeature set.

We also apply the attention model to hidden statesht, and use a multimodal layer to concatenate out-puts of the attention model and map it into a fea-ture space that has exactly the same dimensional-ity as the word embeddings. This multimodal layeris followed by a softmax layer with a dimensional-ity equal to the size of the vocabulary. The projec-tion matrix from the multimodal layer to the softmaxlayer is set to be the transpose of the word embed-ding matrix:

sht = Attention(ht, {si}) (21)

vht = Attention(ht, {vi}) (22)

fht = Attention(ht, {fi}) (23)

mht = φ (W h [ht, s

ht , w

hv � vht , wh

f � fht ] + bm,h)(24)

pt = Softmax(ETmht ) (25)

where Softmax(·) denotes a sequential softmax.

By using two multimodel layers, we combine sixattention layers with the core LSTM. This modelis highly extensible since we can easily add extrabranches for additional features.

3.4 Training and Generation

We can interpret the output of the softmax layer ptas a probability distribution over words:

P(wt+1|w1:t, V, S,Θ) (26)

where V denotes the corresponding video, S denotessemantic attributes and Θ denotes model parame-ters. The overall loss function is defined as the neg-ative logarithm of the likelihood and our goal is tolearn all parameters Θ in our modal by minimizingthe loss function over the entire training set:

minΘ

− 1

N

N∑i=1

Ti∑t=1

P(wit+1|wi

1:t, Vi, Si,Θ) (27)

where N is the total number of captions in the train-ing set, and Ti is the number of words in captioni. During the training phase, we add a begin-of-sentence tag 〈BOS〉 to the start of the sentence andan end-of-sentence tag 〈EOS〉 to the end of sentence.We use Stochastic Gradient Descent to find the opti-mum with the gradient computed via Backpropaga-tion Through Time (BPTT) (Werbos, 1990). Train-ing continues until the METEOR evaluation scoreon the validation set stops increasing, and we op-timize the hyperparameters using random search tomaximize METEOR on the validation set, follow-ing previous studies that found that METEOR ismore consistent with human judgments than BLEUor ROUGE (Vedantam et al., 2015).

After the parameters are learned, during the test-ing phase, we also have temporal and motion fea-tures extracted from the video as well as semanticattributes, which were either already given or arepredicted using a model trained on the training set.Given a previous word, we can calculate the prob-ability distribution of the next word pt using themodel described above. Thus, we can generate cap-tions starting from the special symbol 〈BOS〉 withBeam Search (Yu et al., 2015).

Page 6: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDErLSTM-YT (Venugopalan et al., 2015b) - - - 0.333 0.291 -S2VT (Venugopalan et al., 2015a) - - - - 0.298 -TA (Yao et al., 2015) 0.800 0.647 0.526 0.419 0.296 0.517TA∗ 0.811 0.655 0.541 0.422 0.304 0.524LSTM-E (Pan et al., 2016) 0.788 0.660 0.554 0.453 0.310 -HRNE-A (Pan et al., 2015) 0.792 0.663 0.551 0.438 0.331 -h-RNN (Yu et al., 2015) 0.815 0.704 0.604 0.499 0.326 0.658h-RNN∗ 0.824 0.711 0.610 0.504 0.329 0.675T (Ours) 0.813 0.698 0.605 0.504 0.322 0.698M (Ours) 0.807 0.682 0.583 0.473 0.308 0.656TM (Ours) 0.826 0.717 0.619 0.508 0.332 0.694

Table 1: Results of models using visual features only, on MSVD, where (-) indicates unknown scores.

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr ROUGE-LTM 0.826 0.717 0.619 0.508 0.332 0.694 0.702TM-P-NN 0.809 0.696 0.604 0.508 0.324 0.693 0.682TM-P-SVM 0.814 0.719 0.620 0.512 0.330 0.679 0.703TM-P-HRNE 0.829 0.720 0.627 0.528 0.334 0.689 0.705TM-HQ-S 0.863 0.807 0.740 0.633 0.367 0.899 0.747TM-HQ-V 0.877 0.788 0.710 0.639 0.371 1.075 0.749TM-HQ-SV 0.918 0.872 0.825 0.764 0.429 1.393 0.814Human 0.891 0.776 0.681 0.583 0.436 1.322 0.761

Table 2: Results of models combining visual and semantic attention on MSVD.

4 Experimental Results

4.1 DatasetsMSVD: We evaluate our video captioning modelson the Microsoft Research Video Description Cor-pus (Chen and Dolan, 2011). MSVD consists of1,970 video clips typically depicting a single activ-ity, downloaded from YouTube. Each video clip isannotated with multiple human generated descrip-tions in several languages. We only use the Englishdescriptions, about 41 descriptions per video. In to-tal, the dataset consists of 80,839 video/descriptionpairs. Each description on average contains about 8words. We use 1,200 videos for training, 100 videosfor validation and 670 videos for testing, as providedby previous work (Guadarrama et al., 2013).MSR-VTT: We also evaluate on the MSR Video-to-Text (MSR-VTT) dataset (Xu et al., 2016), a newlarge-scale video benchmark for video captioning.MSR-VTT provides 10,000 web video clips. Eachvideo is annotated with about 20 natural sentences.Thus, we have 200,000 video-caption pairs in total.Our video captioning models are trained and hyper-parameters are selected using the official training

and validation set, which consists of 6,513 and 497video clips respectively. And models are evaluatedusing the test set of 2,990 video clips.

4.2 PreprocessingVisual Features: We extract two kinds of visual fea-tures, temporal features and motion features. We usea pretrained ResNet-152 model (He et al., 2015) toextract temporal features, obtaining one fixed-lengthfeature vector for every frame. We use a pretrainedC3D (Du et al., 2015) to extract motion features.The C3D net reads in a video and emits a fixed-length feature vector every 16 frames.Semantic Attributes: While MSVD and MSR-VTT are standard video caption datasets, they do notcome with tags, titles, or other semantic informationabout the videos. Nevertheless, we can reproducea setting with semantic attributes by extracting at-tributes from captions. First, we invoke the StanfordParser (Klein and Manning, 2003) to parse captionsand choose the nsubj edges to find the subject-verbpairs for each caption. We then select the most fre-quent subject and verb across captions of each videoas the high-quality semantic attributes. These at-

Page 7: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

Model BLEU-4 METEOR CIDEr ROUGE-LRank: 1, Team: v2t navigator 0.408 0.282 0.448 0.609Rank: 2, Team: Aalto 0.398 0.269 0.457 0.598Rank: 3, Team: VideoLAB 0.391 0.277 0.441 0.606T 0.367 0.257 0.400 0.581M 0.361 0.253 0.392 0.577TM 0.386 0.265 0.439 0.596TM-P-NN 0.370 0.254 0.401 0.579TM-P-SVM 0.376 0.259 0.412 0.585TM-P-HRNE 0.392 0.269 0.446 0.601TM-HQ-S 0.421 0.286 0.478 0.610TM-HQ-V 0.429 0.289 0.481 0.613TM-HQ-SV 0.451 0.292 0.503 0.625Human 0.343 0.295 0.501 0.560

Table 3: Results on MSR-VTT.

tributes can be used to evaluate our models undera high-quality attribute condition. Next, we can usethe high-quality semantic attributes of the trainingset to train a model to predict semantic attributesfor the test set. Such attributes are used to evalu-ate our model under low-quality semantic attributeconditions.

For our experiments, we consider three models topredict semantic attributes. The first one (NN) is toperform a nearest-neighbor search on every frameof the training set to retrieve similar ones for everyframe of each test video based on ResNet-152 fea-tures and select the most frequent attributes. Thesecond one (SVM) is to train SVMs for the top100 frequent attributes in the training set and pre-dict semantic attributes for test videos based on amean pooling of ResNet-152 features. The third one(HRNE) is to train two hierarchical recurrent neu-ral encoders (Pan et al., 2015) to predict the subjectand verb separately based on temporal ResNet-152features.

4.3 Evaluation MetricsWe rely on four standard metrics, BLEU (Papineniet al., 2002), METEOR (Banerjee and Lavie, 2005),CIDEr (Vedantam et al., 2015) and ROUGE-L (Lin,2004) to evaluate our methods. These are commonlyused in image and video captioning tasks, and allowus to compare our results against previous work. Weuse the Microsoft COCO evaluation server (Chen etal., 2015), which is widely used in previous work,to compute the metric scores. Across all three met-rics, higher scores indicate that the generated cap-

tions are assessed as being closer to captions createdby humans.

4.4 Experimental SettingsThe number of hidden units in the input multimodallayer and in the LSTM are both 512. The activa-tion function of the LSTM is tanh and the activa-tion functions of both multimodal layers are linear.The dropout rates of both the input and output mul-timodal layers are set to 0.5. We use pretrained 300-dimensional GloVe (Pennington et al., 2014) vectorsas our word embedding matrix. We rely on the RM-SPROP algorithm (Tieleman and Hinton, 2012) toupdate parameters for better convergence, with thelearning rate 10−4. The beam size during sentencegeneration is set to 5. Our system is implementedusing the Theano (Bastien et al., 2012; Bergstra etal., 2010) framework.

4.5 ResultsVisual only: First, for comparison, we show the re-sult of only using visual attention at first. Specif-ically, we only use the temporal features and mo-tion features (TM), by removing the semantic branchwith other components of our model unchanged. Toevaluate the effectiveness of different sorts of visualcues, we also report the results of using only tempo-ral features (T) and using only motion features (M).We compare our methods with six state-of-the-artmethods: LSTM-YT (Venugopalan et al., 2015b),S2VT (Venugopalan et al., 2015a), TA (Yao et al.,2015), LSTM-E (Pan et al., 2016), HRNE-A (Pan etal., 2015), and h-RNN (Yu et al., 2015). Table 1 pro-

Page 8: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

0.429

0.411

0.392

0.367 0.356

0.345

0.323 0.316 0.313

0.304 0.304

0.300

0.320

0.340

0.360

0.380

0.400

0.420

0 20 40 60 80 100

Noisy (%)

METEOR

1.393

1.243

1.096

0.971 0.899

0.777 0.717

0.643 0.639 0.591 0.564

0.500

0.600

0.700

0.800

0.900

1.000

1.100

1.200

1.300

1.400

0 20 40 60 80 100

Noisy (%)

CIDEr

0.764 0.741

0.678

0.625 0.606

0.554 0.546

0.515 0.486 0.497 0.486

0.450

0.500

0.550

0.600

0.650

0.700

0.750

0 20 40 60 80 100

Noisy (%)

BLEU-4

0.814

0.796

0.776

0.746 0.738

0.721

0.701 0.688 0.686

0.673 0.663

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0 20 40 60 80 100

Noisy (%)

ROUGE-L

Figure 3: Results of adding noise to high-quality semantic attributes of MSVD. The blue solids are results of addingnoise. The red dashes are corresponding results of the TM model.

vides a comparison of these systems on the MSVDdataset. Since some of the previous work uses dif-ferent features, we also run experiments for some ofthem whose source code are provided by the authors,or we re-implement the models described in their pa-pers, and then evaluate them using our features. Thecorresponding extra results are marked by ‘∗’.

We observe that even just with temporal featuresalone, we obtain fairly good results, which impliesthat the attention model in our approach is useful.Combining temporal and motion features, we seethat our method can outperform previous work, con-firming that our attention model with multimodellayers can extract useful information from tempo-ral and motion features effectively. In fact, the TA,LSTM-E studies also employ both temporal and mo-tion features, but do not have a separate motion at-tention mechanism. And the h-RNN study only con-siders attention after the sentence generator. Insteadour attention mechanism operates both before andafter the sentence generator, enabling it to attend todifferent aspects during the analysis and synthesisprocesses for a single sentence. The results on the

MSR-VTT dataset are shown in Table 3. They areconsistent in that they also show that the combinedattention for temporal and motion features obtainsbetter results.

Multi-Faceted Attention: To show the influenceof our multi-faceted attention with additional se-mantic cues, we first consider the low-quality se-mantic attributes. Tables 2 and 3 provide resultsusing low-quality attributes obtained via our NN(TM-P-NN), SVM (TM-P-SVM), and HRNE (TM-P-HRNE) methods described above. We find thatthe results for NN and SVM are sometimes slightlyworse than only using visual attention, which meansthat too low-quality attributes do not help in im-proving the quality. It appears that these methodsare rather unreliable and introduce significant noise.HRNE fares slightly better than using only visual at-tention, as it combines top-down and bottom-up ap-proaches to obtain more stable and reliable results.

Then, we consider the high-quality semantic at-tributes, subject and verb (TM-HQ-SV), derivedfrom the ground truth captions. We also report theperformance of only using the subject (TM-HQ-S)

Page 9: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

TM (Ours): a man is playing with a car.

TM-HQ-SV (Ours): a girl is jumping.

GT1: a girl is jumping to a car top.

GT2: a little girl jumped on top of a car.

TM (Ours): a person is cooking.

TM-HQ-SV (Ours): a woman is mixing rice.

GT1: a woman is mixing flour and water in a bowl.

GT2: a woman mixes rice and water in a small pot.

TM (Ours): a monkey is walking.

TM-HQ-SV (Ours): a monkey is smoking.

GT1: a monkey is smoking a cigarette.

GT2: a gorilla is smoking.

TM (Ours): a man is exercising.

TM-HQ-SV (Ours): a man is exercising.

GT1: a man is lying down on a blue mat exercising.

GT2: a man is doing exercise.

Figure 4: Examples of generated captions on MSVD. GT1 and GT2 are ground truth captions.

or the verb (TM-HQ-V) individually. These results,too, are included in Tables 2 and 3. We find that ourmethod is able to exploit high-quality subject andverb attributes to outperform other methods by verylarge margins. Even using just a single semanticattribute yields very strong results. Here, verb in-formation proves slightly more informative than thesubject, indicating that identifying actions in videosremains more challenging than identifying impor-tant objects in a video.

Overall, we observe that our method, with justtwo high-quality features, approaches human-levelscores in terms of the METEOR and CIDEr met-rics. For this, we randomly selected one caption foreach video in the test set and evaluate this captionby removing them from the ground truth. Althoughnot perfect, such results (Human) can be viewedas an estimation of the human-level performance.The BLEU scores of our method are in fact evengreater than the human-level ones, since humans of-ten prefer generating longer captions, which tendto obtain lower BLEU scores. Several studies, in-cluding on caption generation, have concluded thatBLEU is not a sufficiently reliable metric in termsof replicating human judgment scores (Kulkarni etal., 2013; Vedantam et al., 2015). Figure 4 showsseveral example captions generated by our approachfor MSVD videos.

To further investigate the influence of noise, werandomly select genuine high-quality subject andverb attributes and replace them with random incor-rect ones. Figure 3 provides the results on MSVD.These result show that even when adding 50% noise,the results are better than just using regular visual at-tention. With extremely strong noise levels, the re-sults are worse than only using visual attention, butare still maintained at a certain level. This showsthat we are likely to benefit from further semanticattributes such as tags, titles, comments, and so on,which are often available for online videos, even ifthey are noisy.

5 Conclusion

We have proposed a novel method for video caption-ing based on an extensible multi-faceted attentionmechanism, outperforming previous work by largemargins.

Even without semantic attributes, our method out-performs state-of-the-art approaches using visualfeatures. With just two high-quality semantic at-tributes, the results become competitive with humanresults across a range of metrics. This opens up im-portant new avenues for future work on exploringthe large space of potential additional forms of se-mantic cues and attributes.

Page 10: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

6 Acknowledgments

This work was supported in part by the Na-tional Basic Research Program of China Grant2011CBA00300, 2011CBA00301 and the Na-tional Natural Science Foundation of China Grant61033001, 61361136003.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improved cor-relation with human judgments. In ACL Workshop,pages 65–72.

Frdric Bastien, Pascal Lamblin, Razvan Pascanu, JamesBergstra, Ian Goodfellow, Arnaud Bergeron, NicolasBouchard, David Wardefarley, and Yoshua Bengio.2012. Theano: new features and speed improvements.In NIPS Workshop.

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pas-canu, G. Desjardins, J. Turian, D. Warde-Farley, andY. Bengio. 2010. Theano: A cpu and gpu math ex-pression compiler.

David L. Chen and William B. Dolan. 2011. Collectinghighly parallel data for paraphrase evaluation. In ACL.

Xinlei Chen and C. Lawrence Zitnick. 2015. Learning arecurrent visual representation for image caption gen-eration. In CVPR.

Xinlei Chen, Hao Fang, Tsung Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar, andC. Lawrence Zitnick. 2015. Microsoft coco captions:Data collection and evaluation server. arXiv preprintarXiv:1504.00325.

Kyunghyun Cho, Bart Van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014a. On the proper-ties of neural machine translation: Encoder-decoderapproaches. In SSST-8.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014b. Learningphrase representations using rnn encoder-decoder forstatistical machine translation. In EMNLP.

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-term re-current convolutional networks for visual recognitionand description. In CVPR.

Tran Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torre-sani, and Manohar Paluri. 2015. C3d: Generic fea-tures for video analysis. In ICCV.

Jeffrey L. Elman. 1990. Finding structure in time. Cog-nitive Science A Multidisciplinary Journal, pages 179–211.

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar,and S. Venugopalan. 2013. Youtube2text: Recogniz-ing and describing arbitrary activities using semantichierarchies and zero-shot recognition. In ICCV, pages2712–2719.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Deep residual learning for image recogni-tion. arXiv preprint arXiv:1512.03385.

Sepp Hochreiter and Jrgen Schmidhuber. 1997. Longshort-term memory. Neural Computation, pages1735–1780.

N. Kalchbrenner and P. Blunsom. 2013. Recurrent con-tinuous translation models. In EMNLP, pages 1700–1709.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Suk-thankar, and Fei Fei Li. 2014. Large-scale videoclassification with convolutional neural networks. InCVPR, pages 1725–1732.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S.Zemel. 2014. Unifying visual-semantic embeddingswith multimodal neural language models. In NIPSDeep Learning Workshop.

Dan Klein and Christopher D. Manning. 2003. Accurateunlexicalized parsing. In ACL, pages 423–430.

G. Kulkarni, V. Premraj, V. Ordonez, and S. Dhar. 2013.Babytalk: Understanding and generating simple imagedescriptions. 35(12):2891–2903.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.A hierarchical neural autoencoder for paragraphs anddocuments. In ACL.

Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou,and Sheng Li. 2015. Hierarchical recurrent neuralnetwork for document modeling. In EMNLP.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In ACL Workshop.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, ZhihengHuang, and Alan Yuille. 2015. Deep captioningwith multimodal recurrent neural networks (m-rnn). InICLR.

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yuet-ing Zhuang. 2015. Hierarchical recurrent neural en-coder for video representation with application to cap-tioning. arXiv preprint arXiv:1511.03476.

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and YongRui. 2016. Jointly modeling embedding and transla-tion to bridge video and language. In CVPR.

Kishore Papineni, Salim Roukos, Todd Ward, andWei Jing Zhu. 2002. Bleu: a method for automaticevaluation of machine translation. In ACL, pages 311–318.

Page 11: arXiv:1612.00234v1 [cs.CV] 1 Dec 2016

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP, pages 1532–1543.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. 15(1):1929–1958.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In NIPS, volume 4, pages 3104–3112.

T Tieleman and G Hinton. 2012. Lecture 6.5-rmsprop:Divide the gradient by a running average of its recentmagnitude. COURSERA: Neural Networks for Ma-chine Learning.

Ramakrishna Vedantam, C. Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image descrip-tion evaluation. In CVPR, pages 4566–4575.

Subhashini Venugopalan, Marcus Rohrbach, JeffreyDonahue, and Raymond Mooney. 2015a. Sequence tosequence – video to text. In ICCV, pages 4534–4542.

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue,Marcus Rohrbach, Raymond Mooney, and KateSaenko. 2015b. Translating videos to natural lan-guage using deep recurrent neural networks. InNAACL, pages 1494–1504.

Subhashini Venugopalan, Lisa Anne Hendricks, Ray-mond Mooney, and Kate Saenko. 2016. Improvinglstm-based video description with linguistic knowl-edge mined from text. In EMNLP.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-mitru Erhan. 2015. Show and tell: A neural imagecaption generator. In CVPR, pages 3156–3164.

P. J. Werbos. 1990. Backpropagation through time: whatit does and how to do it. 78(10):1550–1560.

Huijuan Xu, Subhashini Venugopalan, Vasili Raman-ishka, Marcus Rohrbach, and Kate Saenko. 2015a.A multi-scale multiple instance video description net-work. CoRR, abs/1505.05914.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhutdinov, RichardZemel, and Yoshua Bengio. 2015b. Show, attend andtell: Neural image caption generation with visual at-tention. In CVPR, pages 2048–2057.

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridgingvideo and language. In CVPR.

Li Yao, Atousa Torabi, Kyunghyun Cho, and Nicolas Bal-las. 2015. Describing videos by exploiting temporalstructure. In ICCV, pages 4507–4515.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with semanticattention. arXiv preprint arXiv:1603.03925.

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, andWei Xu. 2015. Video paragraph captioning using hi-erarchical recurrent neural networks. arXiv preprint1510.07712v2.