Multimodal Transformer for Multimodal Machine Translation · 2002) and METEOR (Denkowski and Lavie,2014), which have been used in most previous works. 3.2 Datasets We build and test

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4346–4350July 5 - 10, 2020. c©2020 Association for Computational Linguistics

4346

Multimodal Transformer for Multimodal Machine Translation

Shaowei Yao, Xiaojun WanCenter for Data Science, Peking University

Wangxuan Institute of Computer Technology, Peking UniversityThe MOE Key Laboratory of Computational Linguistics, Peking University

{yaosw,wanxiaojun}@pku.edu.cn

Abstract

Multimodal Machine Translation (MMT) aimsto introduce information from other modality,generally static images, to improve the transla-tion quality. Previous works propose variousincorporation methods, but most of them donot consider the relative importance of multi-ple modalities. In MMT, equally treating textand images may encode too much irrelevant in-formation from images which may introducenoise. In this paper, we propose the multi-modal self-attention in Transformer to solvethe issues above. The proposed method learnsthe representations of images based on thetext, which avoids encoding irrelevant informa-tion in images. Experiments and visualizationanalysis demonstrate that our model benefitsfrom visual information and substantially out-performs previous works and competitive base-lines in terms of various metrics.

1 Introduction

Multimodal machine translation (MMT) is a novelmachine translation (MT) task which aims at de-signing better translation systems using contextfrom an additional modality, usually images (SeeFigure 1). It initially organized as a shared taskwithin the First Conference on Machine Transla-tion (Specia et al., 2016; Elliott et al., 2017; Bar-rault et al., 2018). Current works focus on thedataset named Multi30k (Elliott et al., 2016), amultilingual extension of Flickr30k dataset withtranslations of the English image descriptions intodifferent languages.

Previous works propose various incorporationmethods. Calixto and Liu (2017) utilize global im-age features to initialize the encoder/decoder hid-den states of RNN. Elliott and Kadar (2017) modelthe source sentence and reconstruct the image repre-sentation jointly via multi-task learning. Recently,Ive et al. (2019) propose a translate-and-refine ap-

Figure 1: An Example for Multimodal Machine Trans-lation.

proach using two-stage decoder based on Trans-former (Vaswani et al., 2017). Calixto et al. (2019)put forward a latent variable model to learn the in-teraction between visual and textual features. How-ever, in multimodal tasks the different modalitiesusually are not equally important. For example, inMMT the text is obviously more important thanimages. Although the image carries richer infor-mation, it also contains more irrelevant content.If we directly encode the image features, it mayintroduce a lot of noise.

To address the issues above, we propose the mul-timodal Transformer. The proposed model does notdirectly encode image features. Instead, the hiddenrepresentations of images are induced from the textunder the guide of image-aware attention. Mean-while, we introduce a better way to incorporateinformation from other modality based on a graphperspective of Transformer. Experimental resultsand visualization show that our model can makegood use of visual information and substantiallyoutperforms the current state of the art.

2 Methodology

Our model is adapted from Transformer and it isalso an encoder-decoder architecture, consisting ofstacked encoder and decoder layers. The focus ofour work is to build a powerful encoder to incorpo-rate the information from other modality. Thus, wewill first begin with an introduction to the incorpo-

4347

ration method. Then we will detail the multimodalself-attention. The final representations of text andimages are sent to the sequence decoder to generatethe target text.

2.1 Incorporating MethodThe method of incorporating information fromother modality is based on a graph perspectiveof Transformer. The core of Transformer is self-attention which employs the multi-head mecha-nism. Each attention head operates on an input se-quence x = (x1, ..., xn) of n elements where xi ∈Rd, and computes a new sequence z = (z1, ..., zn)of the same length where z ∈ Rd:

zi =n∑

j=1

αij

(xjW

V)

(1)

where αij is weight coefficient computed by a soft-max function:

αij = softmax

((xiW

Q) (xjW

K)T

√d

)(2)

W V ,WQ,W V ∈ Rd×d are layer-specific trainableparameter matrices.

Thus we can see that each word representationis induced from all the other words. If we considerevery word to be a node, then Transformer can beregarded as a variant of GNN which treats eachsentence as a fully-connected graph with wordsas nodes (Battaglia et al., 2018; Yao et al., 2020).In traditional MT tasks, the source sentence graphonly contains nodes with text information. If wewant to incorporate information from other modal-ity, we should add the nodes with other modalityinformation into the source graph. Therefore, asthe words are local semantic representations of thesentence, we extract the spatial features which arethe semantic representations of local spatial regionsof the image. We add the spatial features of theimage as pseudo-words in the source sentence andfeed it into the multimodal self-attention layer.

2.2 Multimodal Self-attentionAs stated before, in MMT the text and images arenot equally important. Directly encoding imageswhich contain a lot of irrelevant content may intro-duce noise. Therefore, we propose the multimodalself-attention to encode multimodal information.In multimodal self-attention, the hidden represen-tations of the image are induced from text under

Figure 2: Multimodal self-attention

the guide of image-aware attention which providesa latent adaptation from the text to the image. Avisual representation is illustrated in Figure 2.

Formally, we consider two modalities text andimg, with two entries from each of them denotedby xtext ∈ Rn×d and ximgW img ∈ Rp×d, respec-tively. The output of multimodal self-attention iscomputed as follows:

ci =n∑

j=1

αij

(xtextj W V

)(3)

where αij is weight coefficient computed by a soft-max function:

αij = softmax

(xiW

Q) (xtextj WK

)T√d

(4)

where c ∈ R(n+p)×d is the hidden representationof words and the image. At last layer, c is fed intosequence decoder to generate target sequence. Wecan see that the hidden representations of the imageis only induced from words but under the guideof image-aware attention. The extracted spatialfeatures of the image are not directly encoded inthe model. Instead, they adjust the attention ofeach word to compute the hidden representationsof the image. In each encoder layer we also employ

4348

residual connections between each layer as well aslayer normalization. And the decoder are followedthe standard implemention of Transformer.

3 Experiment

3.1 Baselines and Metrics

We compare the performance of our model with pre-vious kinds of models: (1) sequence-to-sequencemodel only trained on text data (LSTM, Trans-former). (2) Previous works trained on both textand image data. We evaluated the translation qual-ity of our model in terms of BLEU (Papineni et al.,2002) and METEOR (Denkowski and Lavie, 2014),which have been used in most previous works.

3.2 Datasets

We build and test our model on the Multi30kdataset (Elliott et al., 2016), which consists of twomultilingual expansions of the original Flickr30kdataset referred to as M30kT and M30KC , respec-tively. Multi30k contains 30k images, and for eachof the images, M30kT has one of its English de-scription manually translated into German by aprofessional translator. M30KC has five Englishdescriptions and five German descriptions, but theGerman descriptions were crowdsourced indepen-dently from their English versions. The training,validation, test sets of Multi30k contain 29k, 1014and 1k instances respectively. We use M30kT asthe original training data and M30kC for buildingadditional back-translated training data followingCalixto et al. (2019). We present our experimentresults on English-German (En-De) Test2016. Weuse LSTM trained on the textual part of the M30KTdataset (De-En, the original 29k sentences) withoutimages to build a back-translation model (Sennrichet al., 2016), and then apply this model to translate145k monolingual German description in M30kCinto English as additional training data. This partof data we refer to as back-translated data.

3.3 Settings

We preprocess the data by tokenizing and lower-casing. Word embeddings are initialized usingpretrained 300-dimensional Glove vectors. we ex-tract spatial image features from the last convolu-tional layer of ResNet-50. The spatial features are7× 7× 2048-dimensional vectors which are repre-sentations of local spatial regions of the image.

Our encoder and decoder have both 6 layers with300-dimensional word embeddings and hidden

Model BLEU4 METEOR

LSTM 36.8 54.9Transformer 37.8 55.3

IMGD (Calixto and Liu, 2017) 37.3 55.1NMTSRC+IMG (Calixto et al., 2017) 36.5 55.0Transformer+Att (Ive et al., 2019) 36.9 54.5Del+obj (Ive et al., 2019) 38.0 55.6VMMTF (Calixto et al., 2019) 37.6 56.0

Ours 38.7 55.7

+ back-translated data

IMGD (Calixto and Liu, 2017) 38.5 55.9NMTSRC+IMG (Calixto et al., 2017) 37.1 54.5VMMTF (Calixto et al., 2019) 38.4 56.3

Ours 39.5 56.9

Table 1: Comparison results on the Multi30k test set.The best baseline results are underlined. Bold high-lights statistically significant improvements.

states. We employ 10 heads here and dropout=0.1.We used Adam optimizer (Kingma and Ba, 2014)with β1 = 0.9, β2 = 0.98 and minibatches of size32 or 128 (depends on if add the back-translateddata). Meanwhile, we increase learning rate lin-early for the first warmup steps, and decrease itthereafter proportionally to the inverse square rootof the step number. We used warmup steps =8000. The similar learning rate schedule is adoptedin (Vaswani et al., 2017).

3.4 Results

The results of all methods are shown in Table 1. Wecan see our Transformer baseline has comparableresults compared to most previous works, Whentrained on the original data, our model substantiallyoutperforms the SoTA according to BLEU and getsa competitive result according to METEOR. More-over, we note that our model surpasses the text-onlybaseline by above 1 BLEU points. It demonstratesthat our model benefits a lot from the visual modal-ity.

To further investigate our model performance onmore data, we also train the models with additionalback-translated data, and the comparison resultsare shown in the lower part of Table 1. We cansee that almost all models get improved with theadditional training data, but our model obtains themost improvements and achieving new SoTA re-sults on all metrics. It demonstrates that our modelwill perform better on the larger dataset.

4349

Figure 3: Translation cases and Visualization. Colored words represent some of the improvements.

3.5 Visualization Analysis

Figure 3 depicts translations for two cases in thetest set. Colors highlight improvement. Further-more, we visualize the contributions of differentlocal regions of the image in different attentionheads, which shows our model can focus on theappropriate regions of the image. For example, ourmodel pays more attention to the building and theperson in the first case, and thus the model under-stands that the person is working on the buildingrather than just standing there. In the second case,most attention heads attend to the balance beamand the jean dress of the girl, avoiding errors in thetranslation.

3.6 Ablation Study

To further study the influence of the individual com-ponents in our model, we conduct ablation experi-ments to better understand their relative importance.The results are presented in Table 2. Firstly, weinvestigate the effect of multimodal self-attention.As shown in the second columns (replace with self-attention) in Table 2. If we simply concatenate theword vectors with the image features and then per-form self-attention, we will lose 0.6 BLEU scoreand 0.4 METEOR score. Inspired by Elliott (2018),we further examine the utility of the image by theadversarial evaluation. When we replace all inputimages with a blank picture, the performance ofthe model drops a lot. When we replace all input

images with a random image (the context of imagedoes not match the description in the sentence pair),the model performs even worse than the text-onlymodel. The image here is actually a noise whichdistracts the translation.

BLEU4 MEMTEOR

Full Model 38.7 55.7- replace with self-attention 38.1 55.3- replace with blank images 37.1 54.8- replace with random images 36.7 54.5

Table 2: Influence of different components in ourmodel.

4 Conclusion

In this paper, we propose the multimodal self-attention to consider the relative importance be-tween different modalities in the MMT task. Thehidden representations of less important modality(image) are induced from the important modality(text) under the guide of image-aware attention.The experiments and visualization show that ourmodel can make good use of multimodal infor-mation and get better performance than previousworks.

There are various multimodal tasks where mul-tiple modalities have different relative importance.In future work, we would like to investigate theeffectiveness of our model in these tasks.

4350

Acknowledgments

This work was supported by National Natural Sci-ence Foundation of China (61772036), MSRA Col-laborative Research Program, and Key Laboratoryof Science, Technology and Standard in Press In-dustry (Key Laboratory of Intelligent Press MediaTechnology). We thank the anonymous reviewersfor their helpful comments. Xiaojun Wan is thecorresponding author.

ReferencesLoıc Barrault, Fethi Bougares, Lucia Specia, Chiraag

Lala, Desmond Elliott, and Stella Frank. 2018. Find-ings of the third shared task on multimodal machinetranslation. In Proceedings of the Third Conferenceon Machine Translation: Shared Task Papers, pages304–323.

Peter W Battaglia, Jessica B Hamrick, Victor Bapst,Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Ma-teusz Malinowski, Andrea Tacchetti, David Raposo,Adam Santoro, Ryan Faulkner, et al. 2018. Rela-tional inductive biases, deep learning, and graph net-works. arXiv preprint arXiv:1806.01261.

Iacer Calixto and Qun Liu. 2017. Incorporating globalvisual features into attention-based neural machinetranslation. In EMNLP, pages 992–1003.

Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-attentive decoder for multi-modal neuralmachine translation. In ACL, pages 1913–1924.

Iacer Calixto, Miguel Rios, and Wilker Aziz. 2019. La-tent variable model for multi-modal translation. InACL, pages 6392–6405.

Michael Denkowski and Alon Lavie. 2014. Meteor uni-versal: Language specific translation evaluation forany target language. In Proceedings of the NinthWorkshop on Statistical Machine Translation, pages376–380.

Desmond Elliott. 2018. Adversarial evaluation of mul-timodal machine translation. In EMNLP, pages2974–2978.

Desmond Elliott, Stella Frank, Loıc Barrault, FethiBougares, and Lucia Specia. 2017. Findings of thesecond shared task on multimodal machine transla-tion and multilingual image description. In Proceed-ings of the Second Conference on Machine Transla-tion, pages 215–233.

Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30K: Multilingual English-German image descriptions. In Proceedings of the5th Workshop on Vision and Language, pages 70–74.

Desmond Elliott and Akos Kadar. 2017. Imaginationimproves multimodal translation. In Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing, pages 130–141.

Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019.Distilling translations with visual awareness. InACL, pages 6525–6538.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In ICLR.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL, pages 311–318.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation modelswith monolingual data. In ACL, pages 86–96.

Lucia Specia, Stella Frank, Khalil Simaan, andDesmond Elliott. 2016. A shared task on multi-modal machine translation and crosslingual imagedescription. In Proceedings of the First Conferenceon Machine Translation: Volume 2, Shared Task Pa-pers, pages 543–553.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 5998–6008.

Shaowei Yao, Tianming Wang, and Xiaojun Wan.2020. Heterogeneous graph transformer for graph-to-sequence learning. In Proceedings of the Associ-ation for Computational Linguistics (ACL).

Documents

Multimodal Transformer for Multimodal Machine Translation · 2002) and METEOR (Denkowski and Lavie,2014), which have been used in most previous works. 3.2 Datasets We build and test