All-in-One Image-Grounded Conversational Agents · that obtains good results on all vision and language tasks considered, and improves the state of the art in image-grounded conversational

All-in-One Image-Grounded Conversational Agents

Da Julowast Kurt Shuster Y-Lan Boureau and Jason WestonFacebook AI Research

daju kshuster ylan jasefbcom

AbstractAs single-task accuracy on individual language andimage tasks has improved substantially in the lastfew years the long-term goal of a generally skilledagent that can both see and talk becomes more fea-sible to explore In this work we focus on lever-aging individual language and image tasks alongwith resources that incorporate both vision and lan-guage towards that objective We design an archi-tecture that combines state-of-the-art Transformerand ResNeXt modules fed into a novel attentivemultimodal module to produce a combined modeltrained on many tasks We provide a thorough anal-ysis of the components of the model and transferperformance when training on one some or all ofthe tasks Our final models provide a single systemthat obtains good results on all vision and languagetasks considered and improves the state of the artin image-grounded conversational applications

1 IntroductionA picture may be worth a thousand words but combiningpictures and words is even better There are many ways tomarry vision and language an image can be a great conversa-tion starter or discussion point text accompanying an imagecan be a mere descriptive caption or some witty commen-tary Humans can seamlessly blend these skills and use themfor interaction depending on the given setting and need

In order to probe this range of skills a large set of image-and-text tasks have been devised by researchers coveringimage captioning [Young et al 2014 Chen et al 2015Shuster et al 2019a] visual question answering [Goyal etal 2017 Das et al 2017] and dialogue based on an im-age [Mostafazadeh et al 2017 Shuster et al 2018] Re-cent years have seen tremendous progress in both vision [Heet al 2016 Girshick et al 2018 Mahajan et al 2018]and language [Vaswani et al 2017a Radford et al 2019Devlin et al 2019] applications and in all these individualtasks [Goyal et al 2017 Shuster et al 2018 Shuster et al2019a] as well so the time is ripe for exploring the possibilityof a multi-tasking agent that would do well on all of them

lowastContact Author

In this work we design an architecture that leverages ex-isting state-of-the-art vision and language modules and com-bines them with a novel attentive multimodal combiner mod-ule The module can learn when and how to attend betweenthe two modalities depending on the particular inputs andimproves over a standard attention mechanism Our workalso provides a detailed analysis of what works and what doesnot We perform multiple ablation experiments to comparewhat types of architectures training objectives and optimiza-tion strategies work best for what tasks or for achieving themost balanced performance across the board We thus obtainmodels that improve the state of the art over several individ-ual image-grounded conversational tasks and a single systemthat is capable of doing well on all the image-grounded lan-guage tasks we consider

2 Tasks

We first detail separate language and vision tasks that are con-sidered from prior work and then describe the combined vi-sion and language tasks we consider for training an entirearchitecture for building an image-grounded conversationalagent A summary of these tasks is also provided in Table 1

21 Language-based

Large-scale text corpora are commonly used to pre-train textencoders we use these methods that have been developedin prior work In particular we first consider BERT-basedrepresentations [Devlin et al 2019] from [Humeau et al2019] which use 150 million (context response) pairs ex-tracted from Wikipedia and Toronto Books To make use ofdata potentially more related to dialogue and of a more col-loquial nature we also use pre-training based on pushshiftioReddit [Mazare et al 2018 Humeau et al 2019] consistingof 174 million (context response) pairs

22 Vision-based

Similarly large-scale image datasets are commonly used topre-train image encoders in particular ImageNet [Deng et al2009] (128 million images) Instagram images [Mahajan etal 2018] (35 billion images) and the Visual Genome (108kImages with 28 Million attributes) [Krishna et al 2017]

Modalities Task Train Valid Test Cands Images Utterances Images Utterances Images Utterances

Language Wiki + Tor Books - 150m - - - - -pushshiftio Reddit - 174m - - - - -

VisionImageNet 128m - - - - - -Instagram 35b - - - - - -Visual Genome 108077 - - - - - -

VisionCOCO 82783 414113 5000 25000 5000 25000 5000Flickr30k 29000 145000 1014 5070 1000 5000 1000Personality-Captions 186858 186858 5000 5000 10000 50000 500

+ Image-Chat 186782 355862 5000 15000 9997 29991 100

LanguageImage-Chat QA 19702 19702 1129 1129 2224 2224 100IGC - - 1613 4839 2591 7773 100VQA 82783 443757 40504 214354 81834 447793 3129

Table 1 Dataset statistics for all relevant datasets During evaluation gold responses are scored against other candidates (Cands)

23 Vision + LanguageIn the combined tasks we consider images and language arepossible inputs and the output is a text response from theagent The goal is that the tasks when multi-tasked can teachan agent how to respond appropriately in different situationsusing different skills

COCO Captions The COCO Captions dataset [Chen et al2015] requires that a model given an image predicts a cap-tion that factually summarizes the scene for example ldquoa largebus sitting next to a very tall buildingrdquo In the dataset used forthe 2015 challenge there are about 83k training images and414k captions as images are captioned multiple times and alarge validation set of about 40k images Some works havemerged some or all images from that validation set into thetraining set (we indicate this with an asterisk in Table 9) Inthis work we only train on the 83k images of the originaltrain set to avoid training on images that also appear in theVQA validation set and use the validation and test sets of 5kimages each from [Karpathy and Fei-Fei 2017]

Flickr30k Flickr30k [Young et al 2014] is also a caption-ing dataset with factual summaries although it is smaller with29k training images and 145k captions

Personality Captions (PC) In contrast to the previous twodatasets Personality Captions [Shuster et al 2019a] attemptsto model human style when people speak about imagesWhile the training set also consists of (image response) pairseach one also has a given style label out of 215 possible stylessuch as ldquoSympatheticrdquo ldquoOptimisticrdquo or ldquoDramaticrdquo The cap-tions authored by humans then tend to be less factual in toneand rather than simply stating what is in the image they aremore conversational eg ldquoThis sandwich looks so deliciousMy goodnessrdquo It consists of about 187k training imageswith one caption each

Image Chat (IC) Image Chat [Shuster et al 2018] is anextension of the Personality Captions dataset to full dialogueIt also uses the same 215 style traits and images as inputbut human-human conversations have been collected basedon the images and traits instead with each speaker pair in agiven chat being assigned a possibly different random trait

The training set consists of the same 187k images with 356ktotal conversation turns

Image Chat QA (ICQA) Image Chat QA is the extractionof all the question-answer pairs that appear in the Image Chatdataset to evaluate performance in answering such conver-sational image-grounded questions The questions have beenextracted heuristically by assuming a question contains a or starts with who what when where why or how This ex-tracts about 20k such training questions

Image-Grounded Conversations (IGCQ and IGCQA)Image-Grounded Conversations (IGC) [Mostafazadeh et al2017] is also a conversational dataset between pairs of hu-mans given an image It does not contain a training set butonly validation and test portions The conversations are threeturns each in the format of (context question response) tu-ples We refer to the task of forming a question given thecontext as IGCQ and the task of responding to the questionas IGCQA

VQA Visual QA [Goyal et al 2017] is a task involvingopen-ended questions about images which require an under-standing of vision language and commonsense knowledgeto answer such as ldquowhere is the child sittingrdquorsquo or ldquowho iswearing the glassesrdquo It contains 83k training images and444k QA pairs Note this line of work has also been extendedto multiple questions in sequence [Das et al 2017] but we donot consider that task here

3 Related WorkSeparately in the NLP field and in the vision field large ad-vancements have been recently made in terms of the qualityof learnt representations

In NLP word embedding representations [Bengio et al2003 Collobert and Weston 2008 Mikolov et al 2013Joulin et al 2017] have given way to multi-sentence multi-layer self-attentive representations through Transformerswith pre-training on large corpora such as Wikipedia andToronto books [Vaswani et al 2017a Radford et al 2018Devlin et al 2019 Bakhtin et al 2019 Radford et al2019] In dialogue it has been shown that pre-training on ut-terances from large-scale conversational corpora such as from

pushshiftio Reddit improves over large pre-training over re-sources like Wikipedia because they are more related to thetask [Mazare et al 2018 Humeau et al 2019 Shuster et al2019b] When training on downstream tasks multi-taskinglanguage tasks is also starting to become a more exploredarea [Collobert and Weston 2008 McCann et al 2018Raffel et al 2019]

In vision conventional convolutional neural networks [Le-Cun et al 1990 Krizhevsky et al 2012] have been upgradedand improved by deeper ResNet architectures that incorporateskip connections [He et al 2016] trained through ImageNet[Deng et al 2009] On tasks such as VQA which explic-itly ask questions about object properties Faster R-CNN fea-tures [Girshick et al 2018] which incorporate object detec-tion algorithms have been shown to perform well On taskswith large coverage of everyday images and commonsenseknowledge about them Instagram training has been shownto perform well [Mahajan et al 2018 Shuster et al 2018Shuster et al 2019a]

Given this improved performance across different modal-ities a natural next step is methods that combine these ap-proaches for multimodal tasks involving language and visionSeveral recent approaches have been built with this goal in-cluding Vilbert [Lu et al 2019a] VisualBERT [Li et al2019b] LXMERT [Tan and Bansal 2019] Unicoder-vl [Liet al 2019a] Vl-bert [Su et al 2019] and UNITER [Chenet al 2019] A common theme is to borrow some of thepre-training ideas from BERT but apply them to pre-trainingboth language and vision and then fine-tune these modelson downstream tasks Another recent work multi-tasks 12vision and language tasks at once [Lu et al 2019b] Some-what differing from our work the end tasks considered arenot to aimed to build a unified conversational agent where theoutput is dialogue but include any task including languageand vision of some form for example caption-based image-retrieval referring expressions and region to phrase ground-ing most of which we do not consider here Recently [Shus-ter et al 2019b] proposed to multi-task to build a conversa-tional agent but using mostly language-only tasks (10 tasks)although it does include two of the image tasks we considerhere

4 Methods

Our model is a retrieval architecture that outputs a candidateresponse from the training set Like most multimodal archi-tectures it comprises a text encoder an image encoder and away to combine the two However unlike recent models thatuse various cross attention mechanisms to get the joint rep-resentation of the final context our model simply uses a so-called multimodal combiner An extra style encoder is alsoadded to represent the different style traits inside the Person-ality Captions and Image Chat tasks and to differentiate themfrom the other tasks The model finally scores possible out-put candidates using either a ranking or a classification headdepending on the task An overview of the model architectureis given in Figure 1

Figure 1 Overview of our model in its TransResNet-3AMMC (At-tentive Multimodal combiner) variant The non-attentive variant(TransResNet-MMC) has a single Transformer combiner instead ofthe three split Transformers followed by a weighted sum shown here

41 Text EncodersWe use two text encoders one for the input context and onefor the output candidates The output candidate encoder en-codes the candidate utterances that will be scored and thecontext encoder encodes the text input Depending on thetask the context can be the previous dialogue history or aquestion posed about the image or a combination of the twoBoth text encoders are pre-trained Transformers The finaloutput of the candidate encoder is a single vector per candi-date obtained by taking the mean of the per-token encodingsFor the context encoder we retain the per-token output encod-ings thus comprising the same length as the input sequencefor input into the multimodal combiner During multimodaltraining we fine-tune both text encoders

42 Image EncoderWe consider two possible image encoders in our model Thefirst is a ResNeXt-based model trained on 35 billion In-stagram images [Mahajan et al 2018] which we refer toas ResNeXt-IG-35B This encodes the image into a 2048-dimensional vector The weights are fixed during the sub-sequent training process The second is an improved FasterR-CNN model [Girshick et al 2018] trained on the VisualGenome dataset [Krishna et al 2017] We fix the networkup to the fc6 layer and fine-tune the fc7 weights as in [Jianget al 2018] We extract 100 2048-dimensional vectors (100-channel Faster R-CNN features) In addition to trying thesemodels independently we also investigate using both of theirfeatures concatenated together as input to the multimodalcombiner

43 Multimodal CombinerThe Multimodal Combiner (MMC) is used to combine theencodings from different components of the model This in-cludes the 100-channel Faster R-CNN features the ResNeXt-IG-35B features the sequence-based context features (whichdepend on the text length) and the encoding from the styleencoder Prior to combination the individual encodings arenormalized with their own layer-norm layer each is then fedinto the MMC with a positional embedding to indicate whichfeature type it is The Multimodal Combiner is simply aTransformer [Vaswani et al 2017b] encoder without its em-bedding layer thus self-attention is applied to all featuresthat then go through linear layers A mean operation is per-formed in the end to get a single vectorial representation ofthe whole multimodal context This joint representation isthen either used for a dot product with the candidate encod-ings (for ranking) or sent to an additional linear layer (forclassification) as detailed in Sec 45

44 Attentive Multimodal CombinerWe further propose an Attentive Multimodal Combiner(AMMC) shown in Fig 1 where multiple Transformers areused and then combined through an attention mechanism po-tentially allowing them to focus on different skills We usethe style encoding as the query forward it to a linear layer ofthe same output dimension as the number of Transformers inthe multimodal combiner (ie 2 to 4 denoted as 2AMMC3AMMC and 4AMMC) followed by a softmax We henceuse those outputs to perform a weighted sum of the outputsfrom all the multimodal Transformers This attention mech-anism thus learns which Transformers to rely on more forwhich inputs allowing the model to switch between skills

45 Output Heads and LossFor tasks like VQA where there is one factually correct an-swer and many wrong answers it has been shown that strongperformance can be achieved using a classification head con-sidering all the possible most frequent answers as classes andusing a binary cross entropy loss1 We thus also consider thisapproach for VQA For open-ended problems in which theremay be multiple right answers eg those in Image Chat weconsider an alternative approach of using a ranking head Inthis approach the gold label is contrasted with a subsampleof negative candidates during training (chosen as the labelsof other examples in the batch) but still using the binary crossentropy loss which scales well to huge candidate sets Wecompare these two methods in this work and also consider-ing training both at the same time We use batch sizes of 256 512 and adam for optimization For multi-tasking we exper-imented with various kinds of dataset weighting schemes butin the end we went for simplicity and report results of sam-pling from tasks equally so that the same number of updatesare done per task in an epoch which was difficult to improveupon

1As used in Pythia [Singh et al 2019 Singh et al 2018] (httpsgithubcomfacebookresearchpythia)

5 ExperimentsWe now describe our experiments in which we perform anal-ysis and ablations of the different kinds of modules and inputswe use for training and final results on our full architectureFor all models we choose hyperparameters on the validationset(s) and report on the test set for VQA the numbers arereported on the test-dev set All experiments were conductedin ParlAI [Miller et al 2017] and we plan to make the codepublicly available

Text Encoding We consider different possible text en-codings with different pre-training schemes starting fromrandom weights before training on our language + visiontasks starting from initialized word embeddings from fast-Text [Joulin et al 2017] only starting from BERT weights[Devlin et al 2019] and starting from two versions ofpushshiftio Reddit training ie Transformers with 79Mparameters from [Mazare et al 2018] and 128M param-eters from [Humeau et al 2019] After initialization wethen fine-tune the entire TransResNet-MMC using ResNeXt-IG-35B image features on three tasks separately COCOFlickr30k and Image Chat The results are given in Table 3

We observe large improvements in accuracy with moretext pre-training for example on COCO going from 407with no pre-training to 501 with BERT BERT outperformspushshiftio Reddit-128M slightly on COCO and Flickr30kwhereas pushshiftio Reddit-128M outperforms BERT on Im-age Chat We hypothesize this is because the language ismore formal on COCO and Flickr matching BERT that istrained with Wikipedia whereas Image Chat is more collo-quial matching pushshiftio Reddit However the results areclose and on average pushshiftio Reddit-128M does slightlybetter We thus use the latter in all subsequent experiments2

Image Encoding We next consider different possible im-age encodings via different architectures and pre-trainingschemes ResNeXt-IG-35B [Mahajan et al 2018] Faster R-CNN features [Girshick et al 2018] and finally a combina-tion of ResNeXt-IG-35B and Faster R-CNN features Afterinitialization we then fine-tune the entire TransResNet-MMCon four tasks COCO Flickr30k Image Chat and VQA Weevaluate these settings both with single task fine-tuning andwith multi-task training The results are given in Table 2

Faster R-CNN features are superior on VQA which re-quires fine-grained localization of objects in order to answerquestions while ResNeXt-IG-35B features are superior onFlickr30k and Image Chat which require a wide array ofcommonsense knowledge of different scenes On averageacross the tasks (last column) however they provide simi-lar performance As they provide different qualities they area good candidate for combination We thus provide both asinput to our model and obtain superior single-task results onCOCO Flickr30k and VQA with results on Image Chat asgood as with ResNeXt-IG-35B and better than with FasterR-CNN Multi-tasking performance also improves over pre-vious results We thus adopt this combination strategy in sub-sequent experiments

2Another choice would have been to combine them but we didnot do that here

Image Encoder COCO Flickr30k Image Chat VQA Avg

ResNeXt-IG-35B ST 507 753 564 619 611MT 480 770 562 620 608

Faster R-CNN ST 493 682 542 663 595MT 521 724 532 663 610

ResNeXt-IG-35B+ Faster R-CNN ST 573 797 564 670 651MT 512 817 552 664 637

Table 2 Comparison between image representations as part of our TransResNet-MMC architecture with either single-task (ST) or multi-task(MT) training evaluating on COCO Flickr30k Image Chat and VQA and reporting average (Avg) performance across the tasks

Text Encoder COCO Flickr30k Image Chat Avg

from scratch 407 655 376 480fastText init 449 690 456 532BERT 501 720 521 581Reddit-79M 443 684 503 543Reddit-128M 488 718 552 586

Table 3 Comparison between text encoding Transformer pre-training methods when used as part of TransResNet-MMC report-ing accuracy on the respective test sets of three tasks as well as theaverage (Avg)

Early Stop COCO Fl30k PC IC ICQA VQA

COCO 540 834 550 505 439 661Fl30k 514 830 559 531 472 603

IC 524 813 588 559 514 665VQA 534 819 580 540 306 666Avg 512 817 580 552 499 664

Table 4 Training TransResNet-MMC on all tasks but only perform-ing early stopping on one specific dataset compared to stopping onthe average accuracy across all datasets (ldquoAvgrdquo)

Fine Tune COCO Fl30k PC IC ICQA VQA

COCO 596 765 340 318 300 582Flickr30k 507 840 542 521 471 608

IC 524 813 588 559 514 665VQA 366 656 471 386 307 662All 512 817 580 552 499 664

Table 5 Training TransResNet-MMC on all tasks and then fine-tuning on each of the tasks compared to the original best performingmulti-task model (called ldquoAllrdquo)

VQA Model training class head ranking head

Classification head 670 na

Ranking head na 540

Multi-head training 661 635

Table 6 Training VQA with either a classification head a rankinghead or multi-tasking both Multi-tasking both helps the rankinghead improve

COCO Fl30k IC

ResNeXt-IG-35B wo MMC 488 718 552with MMC 507 753 566

ResNeXt-IG-35B wo MMC 536 756 469+ Faster R-CNN with MMC 573 797 564

Table 7 Comparison of with and without (wo) the multimodal com-biner (MMC) as part of our TransResNet architecture for COCOFlickr30k (Fl30k) and Image Chat (IC) using either ResNeXt-IG-35B (ResNeXt-IG) features alone or in combination with Faster R-CNN features The MMC provides gains in all cases

Multimodal Combiner We next assess the impact of themultimodal combiner module in our architecture we first an-alyze the non-attentive version We consider either using itor replacing it with a simple sum over feature type represen-tations see Section 43 We compare these alternatives onthree tasks COCO Flickr30k and Image Chat and exam-ine performance both for our best performing combinationfeatures as well as for ResNeXt-IG-35B alone The resultsare given in Table 7 We see that without this component ofthe architecture the model can still give somewhat reason-able performance However by combining modalities with aTransformer architecture we do see improvements across alltasks The MMC module takes as input a sequence-based rep-resentation of the context history (token-level representation)We also experimented with giving the mean sequence repre-sentation as input to the MMC instead which gave worse re-sults (5 regression on IC) We thus report subsequent exper-iments using the full combiner using sequence-based inputsFreezing versus Fine-Tuning Encoders We compare theperformance of our models when either freezing the imageand text encoders after pre-training or fine-tuning them inaddition to the multimodal combiner of the language and vi-sion tasks If they are frozen only the multimodal combineris trained Table 8 presents the results comparing multi-task performance across several of our tasks There are clearwins from fine-tuning the encoders on the multimodal train-ing dataRanking vs Classification Head We compare the perfor-mance of training VQA with the classification and rankingheads or training both at the same time The results areshown in Table 6

Training with a classification head alone (first row) pro-vides the best performance on VQA However transfer to

COCO Fl30k PC IC ICQA VQA

Freeze 279 572 406 406 376 645Fine-tune 512 817 580 552 499 664

Table 8 Training TransResNet-MMC on all tasks with freezing ornot the text and image encoders

other tasks shown by evaluating them using the ranking headgives poor results understandably as that has not been trainedfor (see Table 10 row 4) Using a ranking head to train VQAgives far worse performance on VQA We attribute this tothe subsampling of negative candidates in the loss functionrather than considering sim3k possible candidates at once inthe classification head The last row of Table 6 shows the per-formance of training both heads at once and then evaluatingthe two heads This dramatically improves the performanceof the ranking head on VQA as the classification head helpsthe model attain good weights

Single Task Results Using our best approach we now re-port final results fine-tuned on each task independently Theresults are given in Table 10 We report results across allevaluation sets for a given training target Eg the first rowshows the model performance when training with COCOevaluated on the test sets of COCO Flickr30k PersonalityCaptions Image Chat Image Chat QA VQA IGCQ IGCQAand VQA As expected we observe better results on the testset of the task being trained on than on other test sets How-ever there is some transfer between some of the tasks Forexample training on COCO gives non-trivial performance onFlickr30k and vice-versa although Flickr30k training helpsless probably due to its smaller size

Multi-Task Results Results of our Multi-Task models aregiven in Table 10 last four rows We first assess the per-formance of the MMC MT model (without an attentive mul-timodal combiner) We achieve marginally superior perfor-mance on Personality Captions and Flickr30k and marginallyinferior performance on the other tasks compared to our bestsingle-task (ST) models but in a single conversational agentThe final column showing average performance across tasksmakes this point clear as those numbers are vastly superiorfor the multi-task models Like our single-task counterpartsmany of these results are still well above previous state of theart eg on Personality Captions and Image Chat and withinrange of the state of the art on COCO and Flickr30k

Multi-Task Results with Attentive Multimodal CombinerWe next assess the effect of Multi-Task training with multi-ple Transformers in the multimodal combiner (2 3 or 4 in-stead of the single Transformer in our base architecture) Thebottom four rows in Table 10 show that using the attentivemultimodal combiner leads to improvements in average per-formance over all tasks with 2AMMC achieving the best re-sults on PC and IC tasks of all methods and 3AMMC beingslightly better on average Note that the early stopping cri-terion for these experiments is the average performance overall tasks which leads to performance gains shifting betweentasks among architectures while the average itself is con-trolled This could be altered by selecting a different stop-

ping criterion as detailed further below and in Table 4 Ta-ble 11 breaks down the performance obtained on all tasks byeach of the Transformers in the 3AMMC There are strikingdifferences between the tasks as to how performance is splitamong the three MMCs on VQA MMC-1 and MMC-2 havenear 0 performance while MMC-3 performs as well as the fullsystem but this comes at the expense of much worse perfor-mance on all the conversational tasks compared to MMC-1and MMC-2 On PC MMC-1 performs nearly as well asthe full system and much better than MMC-2 and MMC-3The overall best performance on all other tasks requires com-bining all three MMCs To see if the performance gains ofAMMC come just from the network being larger we com-pare to MMC modules with more layers up to an equivalentsize see Table 12 The results show that making standardMMC larger only hurts performance Increasing the numberof MMC heads similarly degrades performance (results notshown) These results highlight the benefits of the AMMCdesign

Multi-Tasking Small vs Large Tasks The tasks we dosee a performance gain in when multi-tasking Flickr30k andPersonality-Captions tend to be smaller tasks where there isa larger related task also in the multi-tasking set in this caseCOCO and Image Chat To investigate the effects of train-ing set size on multi-tasking transfer we thus conducted ex-periments to see if we observe the same effects of improve-ment on another dataset if we downsampled it We thus con-sider adjusting the training set size of COCO to be the samesize as Flickr30k and then consider multiples of that sizeand observe the change in performance with changing sizeWe compare single-task training on that subset to multi-tasktraining with all other tasks and that subset For these ex-periments we considered a smaller hyperparameter sweep forsimplicity with a multimodal combiner of 2 layers and sweepacross different number of heads for the multi-head attentionexplaining the slightly lower results We perform early stop-ping on COCO The results are given in Table 14 We observefor single-task training a drop from 54 accuracy to 421 aswe go from 83k examples down to 29k Multi-tasking withthe full COCO dataset also yields the same 54 accuracywhich makes it appear that multi-tasking is not useful for gen-eralization However subsampling COCO reveals a differentstory ndash the smaller the training set the more the multi-taskinghelps with a gap of 421 to 493 in the 29k training exam-ple case As researchers who construct new tasks often collectlarge scale datasets this means multi-tasking will often haveless effect than is observed in a few-shot setup

Multi-Tasking + Single-Task Fine-Tuning While our goalis to make a single agent that is good at all our tasks we alsoinvestigate if multi-tasking can help improve performance ona single task by either multi-tasking and early stopping ona particular task or multi-tasking and then fine-tuning on aparticular task

Early stopping test results are shown in Table 4 We re-port for each task out of COCO Flickr30k Image Chat andVQA the performance on the test set of that task itself as wellas transfer performance to the other tasks These results canbe compared to the results of optimizing multi-task perfor-

Model Training data COCO Flickr30k PC IC ICQA IGCQ IGCQA VQA

ExistingModels

SCAN [Lee et al 2018] 504 674 - - - - - -SCG [Shi et al 2019] 566 718 - - - - - -Unicoder-VL [Li et al 2019a] 623 862 - - - - - -Unicoder-VL wo pre-training - 730 - - - - - -UNITER Base 633 847 - - - - - 723

UNITER Large 666 882 - - - - - 732

HDC [Nguyen and Okatani 2019] 422 716 - - - - - 693

VisualBERT (ST) [Li et al 2019b] - - - - - - - 708

VisualBERT (ST) wo pre-training - - - - - - - 702

ViLBERT (ST) [Lu et al 2019a] - - - - - - - 706

ViLBERT (ST) wo pre-training - - - - - - - 690

Pythia [Jiang et al 2018]3 - - - - - - - 667Pythia 3 - - - - - - - 692

ViLBERT (MT) [Lu et al 2019c] - - - - - - - 726

ViLBERT (MT + FT) - - - - - - - 732

TransResNet [Shuster et al 2018] 443 684 535 503 492 217 224 -

Table 9 Previous state-of-the-art results indicates results achieved by training with some or all of the validation set added to the train setwhereas we only train on the train set Note that the results in this table correspond to a different model for each column as the architecturesare fine-tuned on each task separately rather than training a single architecture in a multi-task way The ViLBERT (MT) model is a multi-taskmodel but uses image retrieval settings on COCO and Flickr30k that are not comparable to the results presented here The Pythia models onrows 9 and 10 are the same except they are trained with the VQA train set and VQA train + valid set respectively thus we list both numbers

Arch Training data COCO Flickr30k PC IC ICQA IGCQ IGCQA VQA AvgMMC ST COCO 572 694 240 165 131 134 103 03 255MMC ST Flickr30k 277 797 230 163 138 158 122 02 236MMC ST IC 200 405 573 563 552 358 433 04 386MMC ST VQA 00 03 12 13 12 16 19 670 93MMC MT + FT 596 840 588 559 514 302 411 665 560MMC MT 512 817 580 552 499 257 384 664 533

2AMMC MT 542 820 595 569 523 281 381 656 5463AMMC MT 527 829 585 561 524 314 398 669 5514AMMC MT 532 818 587 562 545 318 358 659 548

Table 10 Multi-tasking test results of our models The first four rows show the transfer performance of our TransResNet-MMC model trainedon a single task (ST) indicated in the Training data column The fifth row shows a multi-task model which is then fine-tuned (MT+FT) oneach single task separately (each column corresponds to a separate model we hence report average performance in gray italics) The bottomfour rows compare performance of single multi-task models with different types of multimodal combiners The multi-task performance isclose to single-task performance and in some cases better across several tasks The attentive multimodal combiner (AMMC) obtains the bestoverall average performance

Arch COCO Flickr30k PC IC ICQA IGCQ IGCQA VQA Avg3AMMC (MMC-1) MT 247 616 489 451 278 240 221 11 3203AMMC (MMC-2) MT 311 506 195 260 278 274 335 00 2703AMMC (MMC-3) MT 310 619 213 130 95 89 130 669 282

Table 11 Results on each dataset when we evaluate our 3AMMC model by only taking a single MMC output as the context representationThe first MMC alone already gives good performance on PC and IC and the third on VQA All three are needed for some of the tasks

MMC Arch (Compare to) COCO Flickr30k PC IC ICQA IGCQ IGCQA VQA Avg4 Layers (2 AMMC) MT 514 810 560 533 485 291 399 667 5336 Layers (3 AMMC) MT 486 781 571 534 497 308 368 661 5268 Layers (4 AMMC) MT 356 656 362 336 318 266 268 590 394

Table 12 Test results when we train our MMC models with varying numbers of layers which we compare to our AMMC model sizesIncreasing the number of MMC layers only hurts performance

Image Output

Task Coco

TransResNet MMC there is a broken tree log on the ground

Task Coco

TransResNet MMC A large grass covered field under a mountain

Task Flickr30k

TransResNet MMC A chaparral landscape scene void of human residence

Task Flickr30k

TransResNet MMC A plane flying sideways

Task VQAContext What is the color of the mountain

TransResNet MMC gray

Task VQAContext Does it appear to be rainy

TransResNet MMC no

Task Personality Captions (Style Happy)

TransResNet MMC Wow what a beautiful and perfect shade of pink and red I am so captivated

Task Personality Captions (Style Attractive)

TransResNet MMC Wow I would love for someone to create this setting in the sand for me

Task Image Chat (Style Compassionate)

Context Round 1 Something about the pattern calms meTransResNet MMC The architecture calms you

Task Image Chat (Style Emotional)Context Round 1 Airplanes are scary to get on you never know if it will crash or not

Round 2 But these are professional pilots thoughTransResNet MMC They are and for many people they mean a lot My grandfather loved planes

Table 13 Example output from our TransResNet MMC multi-task model for different tasks

mance in the last row rdquoAvgrdquo see also Table 10 (sixth row)There are clear gains for each task that is early stopped but atlarge expense for the other tasks For example fine-tuning onCOCO gives 540 compared to 512 when multi-taskingbut is still worse than the 572 when training as a single-task Transfer to Flickr30k is still good likely as they are sim-ilar tasks but Image Chat results are then poor On Flickr theearly stopping result of 830 is superior to both the multi-task result of 817 and the single-task result of 797 This

can be explained by Flickr30k being smaller than COCO andthus benefiting from multi-tasking more as we explained inthe previous section

Multi-tasking followed by single task fine-tuning test re-sults are shown in Table 5 (also summarized in Table 10)Generally these are superior to the multi-tasking per-taskearly stopping results For example fine-tuning on COCOgives 596 compared to 540 when multi-tasking andearly stopping or even 572 when training as a single-task

COCO Train Size Multi-Task Single-Task10x Flickr30k (29000) 493 42115x Flickr30k (43500) 516 50320x Flickr30k (58000) 537 51925x Flickr30k (72500) 538 536Full Size (82783) 540 540

Table 14 Accuracy on COCO test set when downsampling COCOduring training to the same size as the Flickr30k training set ormultiples thereof Smaller training sets are clearly helped by multi-tasking Eventually there is enough data of the single task

so it is the best result we obtain over all methods We alsoachieve our best results in this fashion on Flickr30k ForVQA the validation results were higher (not shown) in thissetting but resulted in slightly inferior test numbers so asmall amount of overfitting occurs

Comparison to Existing Results We give results from pre-vious work in Table 9 Our results compare favorably on theconversational tasks ie PC IC ICQA IGCQ and IGCQAFor the COCO Flickr30k and VQA tasks our results arewithin range of the state of the art but are surpassed by someof the methods We note that on COCO others used the vali-dation set for training whereas we did not (see Sec 23 we donot want multi-task experiment train and valid data to over-lap) For VQA we report the number from Pythia3 as a com-parison point as that method uses the train set only withoutVQA data augmentation from the Visual Genome VisDialor other data augmentations (similar to us) and we used theirsetup as a starting point for our implementation Our numer-ical results are comparable to theirs

Larger-Scale Cross-Module Pre-training Some of thebest-performing methods on a subset of tasks rely on large-scale cross-module pre-training [Chen et al 2019 Li etal 2019a Lu et al 2019a] which leads to better perfor-mance but requires gigantic multimodal datasets like Con-ceptual Captions [Sharma et al 2018] or Visual Genome[Krishna et al 2017] as shown in Table 15 as well ashigh computing resources (eg 882 and 3645 V100 GPUhours for UNITER-base and UNITER-large respectively)Pre-training on COCO alone as done in [Li et al 2019b]gives more limited improvement (see Table 9) Our approachcombines vision and text encoders with minimal additionalmultimodal training resources required Even counting allthe multi-task datasets used for training adds up to only 1Mimage-sentence (I-S) pairs resulting in training that takesaround 40 V100 GPU hours We expect larger-scale cross-module pre-training would also improve the performance ofour models but this is beyond the scope of this work

Example Predictions We show example predictions of ourMMC multi-task model in Table 13 We take test images andfor COCO Flickr30k Personality Captions and Image Chatwe show the top ranked candidate using the ranking headranking all utterances from the given training set For VQA

3httpslearnpythiareadthedocsioenlatesttutorialspretrained modelshtmlpretrained-models

Model Dataset Size (I-S Pair)UNITER COCOVGCCSBUC 96 MViLBERT CC 30 MUnicoder-VL CC SBUC 38 M

Table 15 Sizes of multimodal pre-training datasets in terms ofimage-sentence pairs Our model obtains comparable results on alltasks without any cross-module pre-training on large datasets suchas Visual Genome (VG) Conceptual Captions (CC) or SBU Cap-tions (SBUC) Thus multi-tasking can be viewed as a strong alter-native to large-scale pre-trainining considering its simplicity andeffectiveness in terms of computation power

we show the output of the classification head We observethat the same underlying model can produce a diverse rangeof outputs depending on the task ranging from factual cap-tioning to conversations grounded on the image

6 ConclusionIn order to build an image-grounded conversational agent wehave assembled disparate multimodal tasks and built a sin-gle architecture for multi-task training on them incorporatinga novel attentive multimodal combination module Throughdetailed analysis and ablations we have shown that our ap-proach can obtain strong performance across a number oftasks Future work could investigate further how these skillsare blended during interaction rather than evaluate them asstand-alone tasks and consider more tasks

7 AcknowledgementsWe are grateful to Amanpreet Singh and Vedanuj Goswamifor providing help and advice comparison results and fasterR-CNN image features We also thank Mary Williamson andEric Smith for very useful discussions

References[Bakhtin et al 2019] Anton Bakhtin Sam Gross Myle Ott

Yuntian Deng MarcrsquoAurelio Ranzato and Arthur SzlamReal or fake learning to discriminate machine from hu-man generated text arXiv preprint arXiv1906033512019

[Bengio et al 2003] Yoshua Bengio Rejean DucharmePascal Vincent and Christian Jauvin A neural probabilis-tic language model Journal of machine learning research3(Feb)1137ndash1155 2003

[Chen et al 2015] Xinlei Chen Hao Fang Tsung-Yi LinRamakrishna Vedantam Saurabh Gupta Piotr Dollarand C Lawrence Zitnick Microsoft coco captionsData collection and evaluation server arXiv preprintarXiv150400325 2015

[Chen et al 2019] Yen-Chun Chen Linjie Li Licheng YuAhmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng andJingjing Liu UNITER Learning UNiversal Image-TExtRepresentations arXiv e-prints page arXiv190911740Sep 2019

[Collobert and Weston 2008] Ronan Collobert and JasonWeston A unified architecture for natural language pro-cessing Deep neural networks with multitask learning InProceedings of the 25th international conference on Ma-chine learning pages 160ndash167 ACM 2008

[Das et al 2017] Abhishek Das Satwik Kottur KhushiGupta Avi Singh Deshraj Yadav Jose MF Moura DeviParikh and Dhruv Batra Visual dialog In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition pages 326ndash335 2017

[Deng et al 2009] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei Imagenet A large-scalehierarchical image database In 2009 IEEE conference oncomputer vision and pattern recognition pages 248ndash255Ieee 2009

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for ComputationalLinguistics Human Language Technologies Volume 1(Long and Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota June 2019 Association for ComputationalLinguistics

[Girshick et al 2018] Ross Girshick Ilija RadosavovicGeorgia Gkioxari Piotr Dollar and Kaiming He De-tectron httpsgithubcomfacebookresearchdetectron2018

[Goyal et al 2017] Yash Goyal Tejas Khot DouglasSummers-Stay Dhruv Batra and Devi Parikh Makingthe V in VQA matter Elevating the role of image under-standing in Visual Question Answering In Conference onComputer Vision and Pattern Recognition (CVPR) 2017

[He et al 2016] Kaiming He Xiangyu Zhang ShaoqingRen and Jian Sun Deep residual learning for image recog-nition In Proceedings of the IEEE conference on computervision and pattern recognition pages 770ndash778 2016

[Humeau et al 2019] Samuel Humeau Kurt ShusterMarie-Anne Lachaux and Jason Weston Poly-encodersTransformer architectures and pre-training strategies forfast and accurate multi-sentence scoring arXiv preprintarXiv190501969 2019

[Jiang et al 2018] Yu Jiang Vivek Natarajan Xinlei ChenMarcus Rohrbach Dhruv Batra and Devi Parikh Pythiav01 the Winning Entry to the VQA Challenge 2018arXiv e-prints page arXiv180709956 Jul 2018

[Joulin et al 2017] Armand Joulin Edouard Grave PiotrBojanowski and Tomas Mikolov Bag of tricks for effi-cient text classification In Proceedings of the 15th Con-ference of the European Chapter of the Association forComputational Linguistics Volume 2 Short Papers pages427ndash431 Valencia Spain April 2017 Association forComputational Linguistics

[Karpathy and Fei-Fei 2017] Andrej Karpathy and Li Fei-Fei Deep visual-semantic alignments for generating im-

age descriptions IEEE Trans Pattern Anal Mach Intell39(4)664ndash676 April 2017

[Krishna et al 2017] Ranjay Krishna Yuke Zhu OliverGroth Justin Johnson Kenji Hata Joshua KravitzStephanie Chen Yannis Kalantidis Li-Jia Li David AShamma et al Visual genome Connecting language andvision using crowdsourced dense image annotations Inter-national Journal of Computer Vision 123(1)32ndash73 2017

[Krizhevsky et al 2012] Alex Krizhevsky Ilya Sutskeverand Geoffrey E Hinton Imagenet classification with deepconvolutional neural networks In Advances in neural in-formation processing systems pages 1097ndash1105 2012

[LeCun et al 1990] Yann LeCun Bernhard E Boser John SDenker Donnie Henderson Richard E Howard Wayne EHubbard and Lawrence D Jackel Handwritten digitrecognition with a back-propagation network In Advancesin neural information processing systems pages 396ndash4041990

[Lee et al 2018] Kuang-Huei Lee Xi Chen Gang HuaHoudong Hu and Xiaodong He Stacked Cross At-tention for Image-Text Matching arXiv e-prints pagearXiv180308024 Mar 2018

[Li et al 2019a] Gen Li Nan Duan Yuejian Fang MingGong Daxin Jiang and Ming Zhou Unicoder-VL A Uni-versal Encoder for Vision and Language by Cross-modalPre-training arXiv e-prints page arXiv190806066 Aug2019

[Li et al 2019b] Liunian Harold Li Mark Yatskar Da YinCho-Jui Hsieh and Kai-Wei Chang VisualBERT A Sim-ple and Performant Baseline for Vision and LanguagearXiv e-prints page arXiv190803557 Aug 2019

[Lu et al 2019a] Jiasen Lu Dhruv Batra Devi Parikh andStefan Lee ViLBERT Pretraining task-agnostic visi-olinguistic representations for vision-and-language tasksarXiv preprint arXiv190802265 2019

[Lu et al 2019b] Jiasen Lu Vedanuj Goswami MarcusRohrbach Devi Parikh and Stefan Lee 12-in-1 Multi-task vision and language representation learning arXivpreprint arXiv191202315 2019

[Lu et al 2019c] Jiasen Lu Vedanuj Goswami MarcusRohrbach Devi Parikh and Stefan Lee 12-in-1 Multi-Task Vision and Language Representation Learning arXive-prints page arXiv191202315 Dec 2019

[Mahajan et al 2018] Dhruv Mahajan Ross Girshick Vig-nesh Ramanathan Kaiming He Manohar Paluri YixuanLi Ashwin Bharambe and Laurens van der Maaten Ex-ploring the limits of weakly supervised pretraining In Vit-torio Ferrari Martial Hebert Cristian Sminchisescu andYair Weiss editors Proceedings of the European Confer-ence on Computer Vision pages 185ndash201 Cham 2018Springer International Publishing

[Mazare et al 2018] Pierre-Emmanuel Mazare SamuelHumeau Martin Raison and Antoine Bordes Trainingmillions of personalized dialogue agents In Proceedings

of the 2018 Conference on Empirical Methods in Natu-ral Language Processing pages 2775ndash2779 BrusselsBelgium October-November 2018 Association forComputational Linguistics

[McCann et al 2018] Bryan McCann Nitish ShirishKeskar Caiming Xiong and Richard Socher The naturallanguage decathlon Multitask learning as questionanswering arXiv preprint arXiv180608730 2018

[Mikolov et al 2013] Tomas Mikolov Ilya Sutskever KaiChen Greg S Corrado and Jeff Dean Distributed repre-sentations of words and phrases and their compositional-ity In Advances in neural information processing systemspages 3111ndash3119 2013

[Miller et al 2017] Alexander Miller Will Feng Dhruv Ba-tra Antoine Bordes Adam Fisch Jiasen Lu Devi Parikhand Jason Weston ParlAI A dialog research softwareplatform In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing SystemDemonstrations pages 79ndash84 Copenhagen DenmarkSeptember 2017 Association for Computational Linguis-tics

[Mostafazadeh et al 2017] Nasrin Mostafazadeh ChrisBrockett Bill Dolan Michel Galley Jianfeng GaoGeorgios Spithourakis and Lucy Vanderwende Image-grounded conversations Multimodal context for naturalquestion and response generation In Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1 Long Papers)pages 462ndash472 Taipei Taiwan November 2017 AsianFederation of Natural Language Processing

[Nguyen and Okatani 2019] Duy-Kien Nguyen andTakayuki Okatani Multi-task learning of hierarchi-cal vision-language representation In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition pages 10492ndash10501 2019

[Radford et al 2018] Alec Radford Karthik NarasimhanTim Salimans and Ilya Sutskever Improving languageunderstanding by generative pre-training 2018

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Raffel et al 2019] Colin Raffel Noam Shazeer AdamRoberts Katherine Lee Sharan Narang Michael MatenaYanqi Zhou Wei Li and Peter J Liu Exploring the limitsof transfer learning with a unified text-to-text transformerarXiv preprint arXiv191010683 2019

[Sharma et al 2018] Piyush Sharma Nan Ding SebastianGoodman and Radu Soricut Conceptual captions Acleaned hypernymed image alt-text dataset for automaticimage captioning In Proceedings of ACL 2018

[Shi et al 2019] Botian Shi Lei Ji Pan Lu Zhendong Niuand Nan Duan Knowledge aware semantic concept ex-pansion for image-text matching In Proceedings of the

Twenty-Eighth International Joint Conference on Artifi-cial Intelligence IJCAI-19 pages 5182ndash5189 Interna-tional Joint Conferences on Artificial Intelligence Orga-nization 7 2019

[Shuster et al 2018] Kurt Shuster Samuel Humeau An-toine Bordes and Jason Weston Engaging image chatModeling personality in grounded dialogue arXiv preprintarXiv181100945 2018

[Shuster et al 2019a] Kurt Shuster Samuel Humeau Hexi-ang Hu Antoine Bordes and Jason Weston Engaging im-age captioning via personality In Proceedings of the IEEEConference on Computer Vision and Pattern Recognitionpages 12516ndash12526 2019

[Shuster et al 2019b] Kurt Shuster Da Ju Stephen RollerEmily Dinan Y-Lan Boureau and Jason Weston Thedialogue dodecathlon Open-domain knowledge and im-age grounded conversational agents arXiv preprintarXiv191103768 2019

[Singh et al 2018] Amanpreet Singh Vedanuj GoswamiVivek Natarajan Yu Jiang Xinlei Chen Meet Shah Mar-cus Rohrbach Dhruv Batra and Devi Parikh Pythia-aplatform for vision amp language research In SysML Work-shop NeurIPS volume 2018 2018

[Singh et al 2019] Amanpreet Singh Vivek NatarajanMeet Shah Yu Jiang Xinlei Chen Dhruv Batra DeviParikh and Marcus Rohrbach Towards vqa models thatcan read In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2019

[Su et al 2019] Weijie Su Xizhou Zhu Yue Cao Bin LiLewei Lu Furu Wei and Jifeng Dai Vl-bert Pre-trainingof generic visual-linguistic representations arXiv preprintarXiv190808530 2019

[Tan and Bansal 2019] Hao Tan and Mohit BansalLXMERT Learning cross-modality encoder represen-tations from transformers In Proceedings of the 2019Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP)pages 5099ndash5110 Hong Kong China November 2019Association for Computational Linguistics

[Vaswani et al 2017a] Ashish Vaswani Noam ShazeerNiki Parmar Jakob Uszkoreit Llion Jones Aidan NGomez Ł ukasz Kaiser and Illia Polosukhin Attentionis all you need In I Guyon U V Luxburg S BengioH Wallach R Fergus S Vishwanathan and R Garnetteditors Advances in Neural Information Processing Sys-tems 30 pages 5998ndash6008 Curran Associates Inc 2017

[Vaswani et al 2017b] Ashish Vaswani Noam ShazeerNiki Parmar Jakob Uszkoreit Llion Jones Aidan NGomez Lukasz Kaiser and Illia Polosukhin AttentionIs All You Need arXiv e-prints page arXiv170603762Jun 2017

[Young et al 2014] Peter Young Alice Lai Micah Hodoshand Julia Hockenmaier From image descriptions to visualdenotations New similarity metrics for semantic inference

over event descriptions Transactions of the Associationfor Computational Linguistics 267ndash78 2014

1 Introduction

2 Tasks

21 Language-based

22 Vision-based

23 Vision + Language

3 Related Work

4 Methods

41 Text Encoders

42 Image Encoder

43 Multimodal Combiner

44 Attentive Multimodal Combiner

45 Output Heads and Loss

5 Experiments

6 Conclusion

7 Acknowledgements