Multimodal Neural Language Models

Multimodal Neural Language Models

Ryan Kiros [email protected] Salakhutdinov [email protected] Zemel [email protected]

Department of Computer Science, University of TorontoCanadian Institute for Advanced Research

Abstract

We introduce two multimodal neural lan-guage models: models of natural languagethat can be conditioned on other modalities.An image-text multimodal neural languagemodel can be used to retrieve images givencomplex sentence queries, retrieve phrase de-scriptions given image queries, as well as gen-erate text conditioned on images. We showthat in the case of image-text modelling wecan jointly learn word representations andimage features by training our models to-gether with a convolutional network. Unlikemany of the existing methods, our approachcan generate sentence descriptions for imageswithout the use of templates, structured pre-diction, and/or syntactic trees. While we fo-cus on image-text modelling, our algorithmscan be easily applied to other modalities suchas audio.

1. Introduction

Descriptive language is almost never isolated fromother modalities. Advertisements come with imagesof the product that is being sold, social media pro-files contain both descriptions and images of the userwhile multimedia websites that play audio and videohave associated descriptions and opinions of the con-tent. Consider the task of creating an advertisementto sell an item. An algorithm that can model bothtext descriptions and pictures of the item would al-low a user to (a): search for pictures given a textdescription of the desired content; (b): find similaritem descriptions given uploaded images; and (c): au-tomatically generate text to describe the item givenpictures. What these tasks have in common is theneed to go beyond simple bag-of-word representationsof text alone to model complex descriptions with an

Proceedings of the 31 st International Conference on Ma-chine Learning, Beijing, China, 2014. JMLR: W&CP vol-ume 32. Copyright 2014 by the author(s).

associated modality.

In this paper we introduce multimodal neural languagemodels, models of natural language that can be condi-tioned on other modalities. A multimodal neural lan-guage model represents a first step towards tacklingthe previously described modelling challenges. Unlikemost previous approaches to generating image descrip-tions, our model makes no use of templates, structuredmodels, or syntactic trees. Instead, it relies on wordrepresentations learned from millions of words andconditioning the model on high-level image featureslearned from deep neural networks. We introduce twomethods based on the log-bilinear model of Mnih &Hinton (2007): the modality-biased log-bilinear modeland the factored 3-way log-bilinear model. We thenshow how to learn word representations and image fea-tures together by jointly training our language mod-els with a convolutional network. Experimentation isperformed on three datasets with image-text descrip-tions: IAPR TC-12, Attributes Discovery, and theSBU datasets. We further illustrate capabilities of ourmodels through quantitative retrieval evaluation andvisualizations of our results.

2. Related Work

Our related work can largely be separated into threegroups: neural language models, image content de-scription and multimodal representation learning.

Neural Language Models: A neural languagemodel improves on n-gram language models by re-ducing the curse of dimensionality through the use ofdistributed word representations. Each word in thevocabulary is represented as a real-valued feature vec-tor such that the cosine of the angles between thesevectors is high for semantically similar words. Severalmodels have been proposed based on feed-forward net-works (Bengio et al., 2003), log-bilinear models (Mnih& Hinton, 2007), skip-gram models (Mikolov et al.,2013) and recurrent neural networks (Mikolov et al.,2010; 2011). Training can be sped up through theuse of hierarchical softmax (Morin & Bengio, 2005) ornoise contrastive estimation (Mnih & Teh, 2012).


Figure 1. Left two columns: Sample description retrieval given images. Right two columns: description generation. Eachdescription was initialized to ‘in this picture there is’ or ‘this product contains a’, with 50 subsequent words generated.

Image Description Generation: A growing bodyof research has explored how to generate realistic textdescriptions given an image. Farhadi et al. (2010)consider learning an intermediate meaning space toproject image and sentence features allowing them toretrieve text from images and vice versa. Kulkarniet al. (2011) construct a CRF using unary potentialsfrom objects, attributes and prepositions and high-order potentials from text corpora, using an n-grammodel for decoding and templates for constraints.To allow for more descriptive and poetic generation,Mitchell et al. (2012) propose the use of syntactictrees constructed from 700,000 Flickr images and textdescriptions. For large scale description generation,Ordonez et al. (2011) showed that non-parametricapproaches are effective on a dataset of one millionimage-text captions. More recently, Socher et al.(2014) show how to map sentence representations fromrecursive networks into the same space as images. Wenote that unlike most existing work, our generated textcomes directly from language model samples withoutany additional templates, structure, or constraints.

Multimodal Representation Learning: Deeplearning methods have been successfully used to learnrepresentations from multiple modalities. Ngiam et al.(2011) proposed using deep autoencoders to learnfeatures from audio and video, while Srivastava &

Salakhutdinov (2012) introduced the multimodal deepBoltzmann machine as a joint model of images andtext. Unlike Srivastava & Salakhutdinov (2012), ourproposed models are conditional and go beyond bag-of-word features. More recently, Socher et al. (2013)and Frome et al. (2013) propose methods for mappingimages into a text representation space learned froma language model that incorporates global context(Huang et al., 2012) or a skip-gram model (Mikolovet al., 2013), respectively . This allowed Socher et al.(2013); Frome et al. (2013) to perform zero-shot learn-ing, generalizing to classes the model has never seenbefore. Similar to our work, the authors combine con-volutional networks with a language model but ourwork instead focuses on text generation and retrievalas opposed to object classification.

The remainder of the paper is structured as follows.We first review the log-bilinear model of Mnih & Hin-ton (2007) as it forms the foundation for our work. Wethen introduce our two proposed models as well as howto perform joint image-text feature learning. Finally,we describe our experiments and results.

3. The Log-Bilinear Model (LBL)

The log-bilinear language model (LBL) (Mnih & Hin-ton, 2007) is a deterministic model that may be viewed


as a feed-forward neural network with a single linearhidden layer. As a neural language model, the LBLoperates on word representation vectors. Each wordw in the vocabulary is represented as a D-dimensionalreal-valued vector rw ∈ RD. Let R denote the K ×Dmatrix of word representation vectors where K is thevocabulary size. Let (w1, . . . wn−1) be a tuple of n− 1words where n−1 is the context size. The LBL modelmakes a linear prediction of the next word representa-tion as

r =

n−1∑i=1

C(i)rwi, (1)

where C(i), i = 1, . . . , n− 1 are D×D context param-eter matrices. Thus, r is the predicted representationof rwn

. The conditional probability P (wn = i|w1:n−1)of wn given w1, . . . , wn−1 is

P (wn = i|w1:n−1) =exp(rT ri + bi)∑K

j=1 exp(rT rj + bj), (2)

where b ∈ RK is a bias vector with a word-specific biasbi. Eq. 2 may be seen as scoring the predicted repre-sentation r of wn against the actual representation rwn

through an inner product, followed by normalizationbased on the inner products amongst all other wordrepresentations in the vocabulary. In the context of afeed-forward neural network, the weights between theoutput layer and linear hidden layer is the word rep-resentation matrix R where the output layer uses asoftmax activation. Learning can be done with stan-dard backpropagation.

4. Multimodal Log-Bilinear Models

Suppose that along with each training tuple of words(w1, . . . wn) there is an associated vector x ∈ RM cor-responding to the feature representation of the modal-ity to be conditioned on, such as an image. Assumefor now that these features are computed in advance.In Section 5 we show how to jointly learn both textand image features.

4.1. Modality-Biased Log-Bilinear Model(MLBL-B)

Our first proposed model is the modality-biased log-bilinear model (MLBL-B) which is a straightforwardextension of the LBL model. The MLBL-B model addsan additive bias to the next predicted word represen-tation r which is computed as

r =

(n−1∑i=1

C(i)rwi

)+ C(m)x, (3)

where C(m) is a D×M context matrix. Given the pre-dicted next word representation r, computing the con-ditional probability P (wn = i|w1:n−1,x) of wn given

w1, . . . , wn−1 and x remains unchanged from the LBLmodel. The MLBL-B can be viewed as a feedforwardnetwork by taking the LBL network and adding a con-text channel based on the modality x, as shown inFig. 2a. This model also shares resemblance to thequadratic model of Grangier et al. (2006). Learningin this model involves a straightforward application ofbackpropagation as in the LBL model.

4.2. The Factored 3-way Log-Bilinear Model(MLBL-F)

A more powerful model to incorporate modality con-ditioning is to gate the word representation matrix Rby the features x. By doing this, R becomes a ten-sor for which each feature x can specify its own hid-den to output weight matrix. More specifically, letR(1), . . . ,R(m) be K × D matrices specified by fea-ture components 1, . . . ,M of x. The hidden to outputweights corresponding to x are computed as

Rx =

M∑i=1

xiR(i), (4)

where Rx denotes the word representations with re-spect to x. The motivation for using a modality spe-cific word representation is as follows. Suppose x isan image containing a picture of a cat, with contextwords (there, is, a). A language model that is trainedwithout knowledge of image features would score thepredicted next word representation r high with wordssuch as dog, building or car. If each image has a cor-responding word representation matrix Rx, the rep-resentations for attributes that are not present in theimage would be modified such that the inner productof r with the representation of cat would score higherthan the inner product of r with the representationsof dog, building or car.

As is, the tensor R requires K × D × M parame-ters which makes using a general 3-way tensor im-practical even for modest vocabulary sizes. A com-mon solution to this approach (Memisevic & Hinton,2007; Krizhevsky et al., 2010) is to factor R into threelower-rank matrices Wfr ∈ RF×D, Wfx ∈ RF×M andWfh ∈ RF×K , such that

Rx = (Wfh)> · diag(Wfxx) ·Wfr, (5)

where diag(·) denotes the matrix with its argument onthe diagonal. These matrices are parametrized by F ,the number of factors, as shown in Fig. 2b.

Let E = (Wfr)>Wfh denote the D × K matrix ofword embeddings. Given the context w1, . . . , wn−1,the predicted next word representation r is given by:

r =

(n−1∑i=1

C(i)E(:, wi)

)+ C(m)x, (6)


rw1

rw2

rw3

r

C1C2

C3

Cm

h

steam

ship

in

aardvarkabacus

...zebra

R

(a) Modality-Biased Log-Bilinear Model (MLBL-B)

rw1

rw2

rw3

r

C1C2

C3

Cm

h

steam

ship

in

aardvarkabacus

...zebra

Wfx

WfhWf r

(b) Factored 3-way Log-Bilinear Model (MLBL-F)

Figure 2. Our proposed models. Left: The predicted next word representation r is a linear prediction of word featuresrw1 , rw2 , rw3 (blue connections) biased by image features x. Right: The word representation matrix R is replaced by afactored tensor for which the hidden-to-output connections are gated by x.

where E(:, wi) denotes the column of E for the wordrepresentation of wi. Given a predicted next word rep-resentation r, the factor outputs are

f = (Wfr r) • (Wfxx), (7)

where • is a component-wise product. The condi-tional probability P (wn = i|w1:n−1,x) of wn givenw1, . . . , wn−1 and x can be written as

P (wn = i|w1:n−1,x) =exp((Wfh(:, i))>f + bi

)∑Kj=1 exp

((Wfh(:, j))>f + bj

) ,where Wfh(:, i) denotes the column of Wfh corre-sponding to word i. We call this the MLBL-F model.As with the LBL and MLBL-B models, training can beachieved using a straightforward application of back-propagation. Unlike the other models, extra care needsto be taken when adjusting the learning rates for thematrices of the factored tensor.

It is sensible that pre-computed word embeddingscould be used as a starting point to training, as op-posed to random initialization of the word representa-tions. Indeed, all of our experiments use the embed-dings of Turian et al. (2010) for initialization. In thecase of the LBL and MLBL-B models, each pre-trainedword embedding can be used to initialize the rows ofR. In the case of the MLBL-F model where R is afactored tensor, we can let E be the D × K matrixof pre-trained embeddings. Since E = (Wfr)>Wfh,we can initialize the MLBL-F model with pre-trainedembeddings by simply applying an SVD to E.

5. Joint Image-Text Feature Learning

Up until now we have not made any assumptions onthe type of modality being used for the feature repre-sentation x. In this section, we consider the case wherethe conditioned modality consists of images and showhow to jointly learn image and word features alongwith the model parameters.

One way of incorporating image representation learn-ing is to use a convolutional network for which the

outputs are used either as an additive bias or for gat-ing. Gradients from the loss could then be backprop-agated from the language model through the convolu-tional network to update filter weights. Unfortunately,learning on every image in this architecture is com-putationally demanding. Since each training tuple ofwords comes with an associated image, then the num-ber of training elements becomes large even with amodest size training set. For example, if the trainingset consisted of 10,000 images and each image had atext description of 20 words, then the number of train-ing elements for the model becomes 200,000. For largeimage databases this could quickly scale to millions oftraining instances.

To speed up computation, we follow Wang et al.(2012); Swersky et al. (2013) and learn our convolu-tional networks on small feature maps learned usingk-means as opposed to the original images. We followthe pipeline of Coates & Ng (2011). Given training im-ages, r × r patches are randomly extracted, contrastnormalized and whitened. These are used for traininga dictionary with spherical k-means. These filters areconvolved with the image and a soft activation encod-ing is applied. If the image is of dimensions nV ×nH×3and kf filters are learned, the resulting feature mapsare of size (nV −r+1)×(nH−r+1)×kf . Each slice ofthis region is then split into a G×G grid for which fea-tures within each region are max-pooled. This resultsin an output of size G×G×kf . It is these outputs thatare used as inputs to the convolutional network. Forall of our experiments, we use G = 9 and kf = 128.

Each 9×9×128 input is convolved with 64 3×3 filtersresulting in feature maps of size 7× 7× 64. Rectifiedlinear units (ReLUs) are used for activation followedby a response normalization layer (Krizhevsky et al.,2012). The response-normalized feature maps are thenmax-pooled with a pooling window of 3×3 and a strideof 2, resulting in outputs of size 3× 3× 64. One fully-connected layer with ReLU activation is added. It isthe feature responses at this layer that are used ei-ther for additive biasing or gating in the MLBL-B andMLBL-F models, respectively.


6. Generation and Retrieval

The standard approach to evaluating language modelsis through perplexity

log2C(w1:n|x) = − 1

N

∑w1:n

log2P (wn = i|w1:n−1,x),

where w1:n−1 runs through each subsequence of lengthn−1 and N is the length of the sequence. Here we useperplexity not only as a measure of performance butalso as a link between text and the additional modality.

First, consider the task of retrieving training imagesfrom a text query w1:N . For each image x in the train-ing set, we compute C(w1:N |x) and return the imagesfor which C(w1:N |x) is lowest. Intuitively, images whenconditioned on by the model that achieve low perplex-ity are those that are a good match to the query de-scription.

The task of retrieving text from an image query istrickier for the following reasons. It is likely thatthere are many ‘easy’ sentences for which the languagemodel will assign low perplexity to independent of thequery image being conditioned on. Thus, instead of re-trieving text from the training set for which C(w1:N |x)is lowest conditioned on the query image x, we insteadlook at the ratio C(w1:N |x)/C(w1:N |x) where x denotesthe mean image in the training set (computed in fea-ture space). Thus, if w1:N is a good explanation ofx, then C(w1:N |x) < C(w1:N |x) and we can simply re-trieve the text for which this ratio is smallest.

While this leads to better search results, it is conceiv-able that using the image itself as a query for otherimages and returning their corresponding descriptionsmay in itself work well as a query strategy. For ex-ample, an image taken at night would ideally return adescription describing this, which would be more likelyto occur if we first retrieved nearby images which werealso taken at night. We found the most effective wayof performing description retrieval is as follows: firstretrieve the top kr training images as a shortlist basedon the Euclidean distance between x and images in thetraining set. Then retrieve the descriptions for whichC(w1:N |x)/C(w1:N |x) is smallest for each descriptionw1:N in the shortlist. We found that combining thesetwo strategies is more effective than using either alone.In the case when a convolutional network is used, wefirst map the images through the convolutional net-work and use the output representations for computingdistances.

Finally, we generate text given an image as follows:Suppose we are given an initialization w1:n−1, wheren − 1 is the context size. We compute P (wn =i|w1:n−1,x) and obtain a sample w from this distribu-tion, appending w to our initialization. This procedureis then repeated for as long as desired.

7. Experiments

We perform experimental evaluation of our proposedmodels on three publicly available datasets:

IAPR TC-12 This data set consists of 20,000 imagesacross various domains, such as landscapes, portraits,indoor and sports scenes. Accompanying each image isa text description of one to three sentences describingthe content of the image. The dataset was initiallyreleased for cross-lingual retrieval (Grubinger et al.,2006) but has since been used extensively for othertasks such as image annotation. We used a publiclyavailable train/test split for our experiments.

Attribute Discovery This dataset contains roughly40,000 images related to products such as bags, cloth-ing and shoes as well as subcategories of each product,such as high-heels and sneakers. Each image is accom-panied by a web-retrieved text description which oftenreads as an advertisement for the product. Unlike theIAPR dataset, the text descriptions are not guaran-teed to be descriptive of the image and often containnoisy, unrelated text. This dataset was proposed as ameans of discovering visual attributes from noisy text(Berg et al., 2010). We used a random train/test splitfor our experiments which will be made publicly avail-able.

SBU Captioned Photos We obtained a subset ofroughly 400,000 images from the SBU dataset (Or-donez et al., 2011) which contain images and short textdescriptions. This dataset is used to induce word em-beddings learned from both images and text for qual-itative comparison.

7.1. Details of Experiments

We perform four experiments, three of which are quan-titative and one of which is qualitative:

Bleu Evaluation Our main evaluation criteria isbased on Bleu (Papineni et al., 2002). Bleu was de-signed for automated evaluation of statistical machinetranslation and can be used in our setting to measuresimilarity of descriptions. Previous work on generat-ing text descriptions for images use Bleu as a meansof evaluation, where the generated sentence is usedas a candidate for the gold standard reference gen-eration. Given the diversity of possible image de-scriptions, Bleu may penalize candidates which are ar-guably descriptive of image content as noted by Kulka-rni et al. (2011) and may not always be the most effec-tive evaluation (Hodosh et al., 2013), though Bleu re-mains the standard evaluation criteria for such models.Given a model, we generate a candidate description asdescribed in Section 6, generating as many words asthere are in the reference sentence and compute theBleu score of the candidate with the reference. Thisis repeated over all test points ten times, in order toaccount for the variability in the generated sentences.


Table 1. Sample neighbors (by cosine similarity) of words learned from the SBU dataset. First row: neighbors fromCollobert & Weston (2008) (C&W). Second row: neighbors from a LBL model (without images). Third row: neighborsfrom a MLBL-F model (with images).

tranquil sensuous somber bleak cheerful drearygloomy dismal slower feeble realistic brighter strong

hazy stormy foggy crisp cloudless dulllaptop dorm desk computer canteen darkroom

classroom pub cabin library bedroom office cottagelibrary desk restroom office cabinet kitchenbamboo silk gold bark flesh crab

flower bird tiger monster cow fish leafplant flowers fruit green plants rose

breakwater icefield lagoon nunnery waterway walkwaylighthouse monument lagoon kingdom mosque skyline truck

pier ship dock castle marina poolchampionship trophy bowl league tournament cups

cup cider bottle needle box fashion shoebag bottle container oil net jam

shorelines topography vegetation convection canyons slopesterrain seas paces descent yards rays floors

headland chasm creekbed ranges crest pamagirri

For baselines, we also compare against the log-bilinearmodel was well as image-conditioned models condi-tioned on random images. This allows us to obtainfurther evidence of the relevance of generated text. Fi-nally, we compare against the models of Gupta et al.(2012) and Gupta & Mannem (2012) who report Bleuscores for their models on the IAPR dataset.1

Perplexity Evaluation Each of our proposed mod-els are trained on both datasets and the perplexityof the language models are evaluated. As baselines,we also include the basic log-bilinear model as wellas two n-gram models. To evaluate the effectivenessof using pre-trained word embeddings, we also traina log-bilinear model where the word representationsare randomly initialized. We hypothesize that image-conditioned models should result in lower perplexitythan models which are only trained on text withoutknowledge of their associated images.

Retrieval Evaluation We quantitatively evaluatethe performance of our model for doing retrieval. Firstconsider the task of retrieving images from sentencequeries. Given a test sentence, we compute the modelperplexity conditioned on each test image and rankeach image accordingly. Let kr denote the number ofretrieved images. We define a sentence to be correctlymatched if the matching image to the sentence query isranked in the top kr images sorted by model perplexity.Retrieving sentences from image queries is performedequivalently. Since our models use a shortlist (see Sec-tion 6) of nearest images for retrieving sentences, werestrict our search to images within the shortlist, for

1We note that an exact comparison cannot be madewith these methods since Gupta & Mannem (2012) assumetags are given as input along with images and both meth-ods apply 10-fold CV. The use of tags can substantiallyboost the relevance of generated sentences. Nonetheless,these methods provide context for our results.

which the matching sentence is guaranteed to be in.

For additional comparison, we include a strong image-based bag-of-words baseline to determine whether alanguage model (and word ordering) is necessary forimage-description retrieval tasks. This model works asfollows: given image features, we learn a linear trans-formation onto independent logistic units, one for eachword in the description. Descriptions are scored as− 1

N

∑w1:n

logP (wn = w|x). For retrieving images, weproject each image and rank those which result in thehighest description score. For retrieving sentences, wereturn those which result in the highest score given theword probabilities computed from the image. Since weuse a shortlist for our models when performing sen-tence retrieval, we also use the same shortlist (relativeto the image features used) to allow for fair compari-son. A validation batch was used to tune the weightdecay.

Qualitative Results We trained a LBL model anda MLBL-F model on the SBU examples. Both lan-guage models were trained on the same text, but theMLBL-F also conditioned on images using DeCAF fea-tures (Donahue et al., 2013). Both models were trainedusing perplexity as a criteria for early stopping, andwith the same context size and vocabulary. Table 1shows sample nearest neighbors from both models.When trained on images and text, the MLBL-F modelcan learn to capture both visual and semantic similar-ities, resulting in very different nearest neighbors thanthe LBL model and C&W embeddings. These wordembeddings will be made publicly available.

We use three types of image features in our experi-ments: Gist (Oliva & Torralba, 2001), DeCAF (Don-ahue et al., 2013), and features learned jointly with aconvolutional net.


Table 2. Results on IAPR TC-12. PPL refers to perplexitywhile B-n indicates Bleu scored with n-grams. Back-offGTn refers to n-grams with Katz backoff and Good-Turingdiscounting. Models which use a convolutional network areindicated by -conv, while -conv-R indicates using randomimages for conditioning. skmeans refers to the features ofKiros & Szepesvari (2012).

Model type PPL. B-1 B-2 B-3

Back-off GT2 54.5 0.323 0.145 0.059Back-off GT3 55.6 0.312 0.131 0.059LBL 20.1 0.327 0.144 0.068MLBL-B-conv-r 28.7 0.325 0.143 0.069MLBL-B-skmeans 18.0 0.349 0.161 0.079MLBL-F-skmeans 20.3 0.348 0.165 0.085MLBL-B-Gist 20.8 0.348 0.164 0.083MLBL-F-Gist 28.8 0.341 0.151 0.074MLBL-B-conv 20.6 0.349 0.165 0.085MLBL-F-conv 21.7 0.341 0.156 0.073MLBL-B-DeCAF 24.7 0.373 0.187 0.098MLBL-F-DeCAF 21.8 0.361 0.176 0.092Gupta et al. - 0.15 0.06 0.01Gupta & Mannem - 0.33 0.18 0.07

7.2. Details of Training

Each of our language models were trained using thefollowing hyperparameters: all context matrices useda weight decay of 1.0×10−4 while word representationsused a weight decay of 1.0 × 10−5. All other weightmatrices, including the convolutional network filtersuse a weight decay of 1.0×10−4. We used batch sizes of20 and an initial learning rate of 0.2 (averaged over theminibatch) which was exponentially decreased at eachepoch by a factor of 0.998. Gated methods used aninitial learning rate of 0.02. Initial momentum was setto 0.5 and was increased linearly to 0.9 over 20 epochs.The word representation matrices were initialized tothe 50 dimensional pre-trained embeddings of Turianet al. (2010). We used a context size of 5 for each ofour models. Perplexity was computed starting withword C + 1 for all methods where C is the largestcontext size used in comparison (5 in our experiments).Perplexity was not evaluated on descriptions shorterthan C + 3 words for all models. Since features usedhave varying dimensionality, an additional layer wasadded to map images to 256 dimensions, so that acrossall experiments the input size to the bias and gatingunits are equivalent. Note that we did not explorevarying the word embedding dimensionalities, contextsizes or number of factors.

For each of our experiments, we split the training setinto 80% training and 20% validation. Each modelwas trained while monitoring the perplexity on thevalidation set. Once the perplexity no longer improvedfor 5 epochs, the objective value on the training setwas recorded. The training and validation sets werethen fused and training continued until the objective

Table 3. Results on the Attributes Discovery dataset.

Model type PPL. B-1 B-2 B-3

Back-off GT2 117.7 0.163 0.033 0.009Back-off GT3 93.4 0.166 0.032 0.011LBL 97.6 0.161 0.031 0.009MLBL-B-conv-r 154.4 0.166 0.035 0.012MLBL-B-Gist 95.7 0.185 0.044 0.013MLBL-F-Gist 115.1 0.182 0.042 0.013MLBL-B-conv 99.2 0.189 0.048 0.017MLBL-F-conv 113.2 0.175 0.042 0.014MLBL-B-DeCAF 98.3 0.186 0.045 0.014MLBL-F-DeCAF 133.0 0.178 0.041 0.012

value on the validation batch matched the recordedtraining objective. At this point, training stopped andevaluation was performed on the test set.

7.3. Generation and Perplexity Results

Tables 2 and 3 show results on the IAPR and At-tributes dataset, respectively. On both datasets, eachof our multimodal models outperforms both the log-bilinear and n-gram models on Bleu scores. Our mul-timodal models also outperform Gupta et al. (2012)and result in comparable performance to Gupta &Mannem (2012). It should be noted that Gupta &Mannem (2012) assumes that both images and tagsare given as input, where the presence of tags givesubstantial information about general image content.What is perhaps most surprising is that simple lan-guage models independent of images can also achievenon-trivial Bleu scores. For further comparison, wecomputed Bleu scores on the convolutional MLBL-Bmodel when random images are used for condition-ing. Moreover, we also computed Bleu scores on IAPRwith LBL and MLBL-B-DeCAF when stopwords areremoved, obtaining (0.166, 0.052, 0.013) and (0.224,0.082, 0.028) respectively. This gives us strong evi-dence that the gains in Bleu scores are obtained di-rectly from capturing and associating word represen-tations from image content.

One observation from our results is that perplexitydoes not appear to be correlated with Bleu scores.2

On the IAPR dataset, the best perplexity is obtainedusing the MLBL-B model with fixed features, eventhough the best Bleu scores are obtained with a con-volutional model. Similarly, both Back-off GT3 andLBL have the lowest perplexities on the Attributesdataset but are worse with respect to Bleu. Usingmore than 3-grams did not improve results on eitherdataset. For additional comparison, we also ran anexperiment training LBL on both datasets using ran-dom word initialization, achieving perplexity scores of23.4 and 109.6. This indicates the benefit of initializa-

2This is likely due to high variance on held-out perplex-ities due to the shortness of text. We note that perplexityis lower on the training set with multimodal models.


Table 4. F-scores for retrieval on IAPR TC-12 when a textquery is used to retrieve images (T → I) or when an imagequery is used to retrieve text (I → T ). Each row corre-sponds to DeCAF, Conv and Gist features, respectively.

T → I I → T

bow mlbl-b mlbl-f bow mlbl-b mlbl-f

0.890 0.889 0.899 0.755 0.731 0.5680.726 0.788 0.851 0.687 0.719 0.7360.832 0.799 0.792 0.599 0.675 0.612

tion from pre-trained word representations. Perhapsunsurprisingly, perplexity is much worse on the con-volutional MLBL-B model when random images areused for conditioning.

7.4. Retrieval Results

Tables 4 and 5 illustrate the results of our retrievalexperiments. In the majority of our experiments ei-ther the multimodal models outperform or are com-petitive with the bag-of-words baseline. The baselinewhen combined with DeCAF features is exceptionallystrong. Perhaps this is unsurprising, given that thesefeatures were trained to predict object classes on Im-ageNet. The generality of these features also make iteffective for predicting word occurrences, particularlyif they are visual. For non-DeCAF experiments, ourmodels improve on the baseline for 6 out of 8 tasksand result in near similar performance on another.The MLBL-F model performed best when combinedwith a convolutional net on IAPR while the MLBL-Bmodel performed better on the remaining tasks. All12 retrieval curves are included in the supplementarymaterial.

7.5. Qualitative results

The supplementary material contains qualitative re-sults from our models. In general, the model does agood job at retrieving text with general characteristicsof a scene or retrieving the correct type of product onthe Attributes Discovery dataset, being able to dis-tinguish between different kinds of sub-products, suchas shoes and boots. The most common mistakes thatthe model makes are retrieving text with extraneousdescriptions that do not exist in the image, such as de-scribing people that are not present. We also observederrors on shorter queries where single words, such assunset and lake, indicate key visual concepts that themodel is not able to pick up on.

For generating text, the model was initialized with‘in this picture there is’ or ’this product contains a’and proceeded to generate 50 words conditioned onthe image. The model is often able to describe the

Table 5. F-scores for retrieval on Attributes Discovery.

T → I I → T

bow mlbl-b mlbl-f bow mlbl-b mlbl-f

0.808 0.852 0.835 0.579 0.580 0.5040.730 0.839 0.815 0.607 0.590 0.5760.826 0.844 0.818 0.555 0.621 0.579

general content of the image, even if it does not getspecifics correct such as colors of clothing. This givesvisual confirmation of the increased Bleu scores fromour models. Several additional results are included onthe web page of the first author.

8. Conclusion

In this paper we proposed multimodal neural languagemodels. We described two novel language models andshowed in the context of image-text learning how tojointly learn word representations and image features.Our models can obtain improved Bleu scores to exist-ing approaches for sentence generation while generallyoutperforming a strong bag-of-words baseline for de-scription and image retrieval.

To our surprise, we found additive biasing with high-level image features to be quite effective. A key ad-vantage of the multiplicative model though is speed oftraining: even with learning rates an order of magni-tude smaller these models typically required substan-tially fewer epochs to achieve the same performance.Unlike MLBL-B, MLBL-F requires additional care inearly stopping and learning rate selection.

This work takes a first step towards generating imagedescriptions with a multimodal language model andsets a baseline when no additional structures are used.For future work, we intend to explore adding addi-tional structures to improve syntax as well as combin-ing our methods with a detection algorithm.

Acknowledgments

We would also like to thank the anonymous re-viewers for their valuable comments and suggestions.This work was supported by Google and ONR GrantN00014-14-1-0232.

References

Bengio, Yoshua, Ducharme, Rejean, Vincent, Pascal, andJanvin, Christian. A neural probabilistic languagemodel. J. Mach. Learn. Res., 3:1137–1155, March 2003.

Berg, Tamara L, Berg, Alexander C, and Shih, Jonathan.Automatic attribute discovery and characterization fromnoisy web data. In ECCV, pp. 663–676. Springer, 2010.

Coates, Adam and Ng, Andrew. The importance of encod-


ing versus training with sparse coding and vector quan-tization. In ICML, pp. 921–928, 2011.

Collobert, Ronan and Weston, Jason. A unified architec-ture for natural language processing: Deep neural net-works with multitask learning. In ICML, pp. 160–167.ACM, 2008.

Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman,Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. De-caf: A deep convolutional activation feature for genericvisual recognition. arXiv preprint arXiv:1310.1531,2013.

Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin,Young, Peter, Rashtchian, Cyrus, Hockenmaier, Julia,and Forsyth, David. Every picture tells a story: Gen-erating sentences from images. In ECCV, pp. 15–29.Springer, 2010.

Frome, Andrea, Corrado, Greg S, Shlens, Jon, Ben-gio, Samy, Dean, Jeffrey, and MarcAurelio Ranzato,Tomas Mikolov. Devise: A deep visual-semantic em-bedding model. 2013.

Grangier, David, Monay, Florent, and Bengio, Samy. Adiscriminative approach for the retrieval of images fromtext queries. In Machine Learning: ECML 2006, pp.162–173. Springer, 2006.

Grubinger, Michael, Clough, Paul, Muller, Henning, andDeselaers, Thomas. The iapr tc-12 benchmark: A newevaluation resource for visual information systems. InInternational Workshop OntoImage, pp. 13–23, 2006.

Gupta, Ankush and Mannem, Prashanth. From image an-notation to image description. In NIP, pp. 196–204.Springer, 2012.

Gupta, Ankush, Verma, Yashaswi, and Jawahar, CV.Choosing linguistics over vision to describe images. InAAAI, 2012.

Hodosh, Micah, Young, Peter, and Hockenmaier, Julia.Framing image description as a ranking task: data, mod-els and evaluation metrics. JAIR, 47(1):853–899, 2013.

Huang, Eric H, Socher, Richard, Manning, Christopher D,and Ng, Andrew Y. Improving word representations viaglobal context and multiple word prototypes. In ACL,pp. 873–882, 2012.

Kiros, Ryan and Szepesvari, Csaba. Deep representationsand codes for image auto-annotation. In NIPS, pp. 917–925, 2012.

Krizhevsky, Alex, Hinton, Geoffrey E, et al. Factored 3-way restricted boltzmann machines for modeling naturalimages. In AISTATS, pp. 621–628, 2010.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoff. Im-agenet classification with deep convolutional neural net-works. In NIPS, pp. 1106–1114, 2012.

Kulkarni, Girish, Premraj, Visruth, Dhar, Sagnik, Li,Siming, Choi, Yejin, Berg, Alexander C, and Berg,Tamara L. Baby talk: Understanding and generatingsimple image descriptions. In CVPR, pp. 1601–1608.IEEE, 2011.

Memisevic, Roland and Hinton, Geoffrey. Unsupervisedlearning of image transformations. In CVPR, pp. 1–8.IEEE, 2007.

Mikolov, Tomas, Karafiat, Martin, Burget, Lukas, Cer-nocky, Jan, and Khudanpur, Sanjeev. Recurrent neuralnetwork based language model. In INTERSPEECH, pp.1045–1048, 2010.

Mikolov, Tomas, Deoras, Anoop, Povey, Daniel, Burget,Lukas, and Cernocky, Jan. Strategies for training largescale neural network language models. In ASRU, pp.196–201. IEEE, 2011.

Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean,Jeffrey. Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781, 2013.

Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch,Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota,Berg, Tamara, Stratos, Karl, and Daume III, Hal.Midge: Generating image descriptions from computervision detections. In EACL, pp. 747–756, 2012.

Mnih, Andriy and Hinton, Geoffrey. Three new graphicalmodels for statistical language modelling. In ICML, pp.641–648. ACM, 2007.

Mnih, Andriy and Teh, Yee Whye. A fast and simple algo-rithm for training neural probabilistic language models.arXiv preprint arXiv:1206.6426, 2012.

Morin, Frederic and Bengio, Yoshua. Hierarchical proba-bilistic neural network language model. In AISTATS,pp. 246–252, 2005.

Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam,Juhan, Lee, Honglak, and Ng, Andrew. Multimodal deeplearning. In ICML, pp. 689–696, 2011.

Oliva, Aude and Torralba, Antonio. Modeling the shapeof the scene: A holistic representation of the spatial en-velope. IJCV, 42(3):145–175, 2001.

Ordonez, Vicente, Kulkarni, Girish, and Berg, Tamara L.Im2text: Describing images using 1 million captionedphotographs. In NIPS, pp. 1143–1151, 2011.

Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu,Wei-Jing. Bleu: a method for automatic evaluation ofmachine translation. In ACL, pp. 311–318. ACL, 2002.

Socher, Richard, Ganjoo, Milind, Manning, Christopher D,and Ng, Andrew. Zero-shot learning through cross-modal transfer. In NIPS, pp. 935–943, 2013.

Socher, Richard, Le, Quoc V, Manning, Christopher D,and Ng, Andrew Y. Grounded compositional semanticsfor finding and describing images with sentences. TACL,2014.

Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodallearning with deep boltzmann machines. In NIPS, pp.2231–2239, 2012.

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan. Multi-task bayesian optimization. In NIPS, 2013.

Turian, Joseph, Ratinov, Lev, and Bengio, Yoshua. Wordrepresentations: a simple and general method for semi-supervised learning. In ACL, pp. 384–394. Associationfor Computational Linguistics, 2010.

Wang, Tao, Wu, David J, Coates, Adam, and Ng, An-drew Y. End-to-end text recognition with convolutionalneural networks. In ICPR, pp. 3304–3308. IEEE, 2012.

Documents

Multimodal Neural Language Models