12
5910 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018 Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs Kun Fu , Jin Li, Junqi Jin, and Changshui Zhang, Fellow, IEEE Abstract— Image captioning aims to generate natural language sentences to describe the salient parts of a given image. Although neural networks have recently achieved promising results, a key problem is that they can only describe concepts seen in the training image-sentence pairs. Efficient learning of novel concepts has thus been a topic of recent interest to alleviate the expensive manpower of labeling data. In this paper, we propose a novel method, Image-Text Surgery, to synthesize pseudoimage-sentence pairs. The pseudopairs are generated under the guidance of a knowledge base, with syntax from a seed data set (i.e., MSCOCO) and visual information from an existing large-scale image base (i.e., ImageNet). Via pseudodata, the captioning model learns novel concepts without any corresponding human-labeled pairs. We further introduce adaptive visual replacement, which adap- tively filters unnecessary visual features in pseudodata with an attention mechanism. We evaluate our approach on a held- out subset of the MSCOCO data set. The experimental results demonstrate that the proposed approach provides significant performance improvements over state-of-the-art methods in terms of F1 score and sentence quality. An ablation study and the qualitative results further validate the effectiveness of our approach. Index Terms— Image captioning, novel concept, pseudodata, visual attention. I. BACKGROUND I MAGE captioning is a task that a machine learns to generate natural language sentences to describe the salient parts of an image. It has been an important topic, as it involves understanding images and language modeling. The image and language modules are often separately learned in traditional methods (see [1], [2]). Recently, neural network methods have achieved a remarkable advancement by jointly modeling image and text. In general, they use an encoder–decoder framework that first encodes images into visual representations (e.g., [3]–[10]) or lexical representations (e.g., [11]–[13]) using a convolutional neural network (CNN) [14] and then decodes Manuscript received November 4, 2016; revised October 5, 2017 and February 12, 2018; accepted March 5, 2018. Date of publication April 5, 2018; date of current version November 16, 2018. This work was supported in part by the NSFC under Grant 61473167, in part by the Beijing Natural Science Foundation under Grant L172037, and in part by the German Research Foundation (DFG) in Project Crossmodal Learning under Grant NSFC 61621136008 and Grant DFG TRR-169. (Corresponding author: Kun Fu.) The authors are with the Department of Automation, Tsinghua University, Beijing 100084, China, also with the State Key Lab- oratory of Intelligent Technologies and Systems, Tsinghua Univer- sity, Beijing 100084, China, and also with the Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2018.2813306 Fig. 1. Encoder–decoder framework of neural image captioning models. It encodes the input image into visual or lexical representations with a CNN and then decodes the representations to natural language sentence using an LSTM. the representations to natural language using a long short-term memory network (LSTM) [15]. Fig. 1 shows the structure of the encoder–decoder framework. The neural models aim to learn P ( S| I ), the conditional probability of sentence, S, and given image, I . By rewrit- ing the sentence, S, (with length T ) as a word sequence {w 1 ,w 2 , ··· ,w T }, we obtain the following: P ( S| I ) = P (w 1 ,w 2 ,...,w T | I ) = T t =1 P (w t |w <t , I ) (1) where w <t are the words before w t . The conditional prob- ability, P ( S| I ), is decomposed into a product of one-step transition probabilities, which is a time series that can be modeled by the recurrent neural networks such as LSTMs. The model is trained by maximizing the likelihood of cor- pora of human-labeled image-sentence pairs. However, such a training process requires a large number of paired image and text data. Moreover, the model is not capable of describing concepts unseen in the training pairs where a neural network captioner often needs thousands of sentences written by human annotators as well as the corresponding images to learn novel concepts. However, this labeling process is costly in terms of man hours, and thus, it is impractical to scale the captioning model to a wide range of unseen concepts. Several works have proposed the ways to alleviate the difficulty of learning novel concepts. Reference [16] shared the parameters in the softmax layer with word embedding vectors, which allows the neural model to incrementally update the parameters corresponding to novel words. This method decreased the amount of training data required, but zero-shot learning is not possible. Reference [17] proposed a composi- tional captioner that learns novel concepts from an unpaired image base and text corpora. However, the additional pretrain- ing stage of unpaired data decreases the extendibility of the model, and the learning of novel concepts is implemented by 2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

5910 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018

Image-Text Surgery: Efficient Concept Learning inImage Captioning by Generating Pseudopairs

Kun Fu , Jin Li, Junqi Jin, and Changshui Zhang, Fellow, IEEE

Abstract— Image captioning aims to generate natural languagesentences to describe the salient parts of a given image. Althoughneural networks have recently achieved promising results, a keyproblem is that they can only describe concepts seen in thetraining image-sentence pairs. Efficient learning of novel conceptshas thus been a topic of recent interest to alleviate the expensivemanpower of labeling data. In this paper, we propose a novelmethod, Image-Text Surgery, to synthesize pseudoimage-sentencepairs. The pseudopairs are generated under the guidance of aknowledge base, with syntax from a seed data set (i.e., MSCOCO)and visual information from an existing large-scale image base(i.e., ImageNet). Via pseudodata, the captioning model learnsnovel concepts without any corresponding human-labeled pairs.We further introduce adaptive visual replacement, which adap-tively filters unnecessary visual features in pseudodata with anattention mechanism. We evaluate our approach on a held-out subset of the MSCOCO data set. The experimental resultsdemonstrate that the proposed approach provides significantperformance improvements over state-of-the-art methods interms of F1 score and sentence quality. An ablation study andthe qualitative results further validate the effectiveness of ourapproach.

Index Terms— Image captioning, novel concept, pseudodata,visual attention.

I. BACKGROUND

IMAGE captioning is a task that a machine learns togenerate natural language sentences to describe the salient

parts of an image. It has been an important topic, as it involvesunderstanding images and language modeling. The image andlanguage modules are often separately learned in traditionalmethods (see [1], [2]). Recently, neural network methodshave achieved a remarkable advancement by jointly modelingimage and text. In general, they use an encoder–decoderframework that first encodes images into visual representations(e.g., [3]–[10]) or lexical representations (e.g., [11]–[13]) usinga convolutional neural network (CNN) [14] and then decodes

Manuscript received November 4, 2016; revised October 5, 2017 andFebruary 12, 2018; accepted March 5, 2018. Date of publication April 5,2018; date of current version November 16, 2018. This work was supportedin part by the NSFC under Grant 61473167, in part by the Beijing NaturalScience Foundation under Grant L172037, and in part by the German ResearchFoundation (DFG) in Project Crossmodal Learning under Grant NSFC61621136008 and Grant DFG TRR-169. (Corresponding author: Kun Fu.)

The authors are with the Department of Automation, TsinghuaUniversity, Beijing 100084, China, also with the State Key Lab-oratory of Intelligent Technologies and Systems, Tsinghua Univer-sity, Beijing 100084, China, and also with the Tsinghua NationalLaboratory for Information Science and Technology, Beijing 100084,China (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2018.2813306

Fig. 1. Encoder–decoder framework of neural image captioning models.It encodes the input image into visual or lexical representations with a CNNand then decodes the representations to natural language sentence using anLSTM.

the representations to natural language using a long short-termmemory network (LSTM) [15]. Fig. 1 shows the structure ofthe encoder–decoder framework.

The neural models aim to learn P(S|I ), the conditionalprobability of sentence, S, and given image, I . By rewrit-ing the sentence, S, (with length T ) as a word sequence{w1, w2, · · · , wT }, we obtain the following:

P(S|I ) = P(w1, w2, . . . , wT |I ) =T∏

t=1

P(wt |w<t , I ) (1)

where w<t are the words before wt . The conditional prob-ability, P(S|I ), is decomposed into a product of one-steptransition probabilities, which is a time series that can bemodeled by the recurrent neural networks such as LSTMs.The model is trained by maximizing the likelihood of cor-pora of human-labeled image-sentence pairs. However, such atraining process requires a large number of paired image andtext data. Moreover, the model is not capable of describingconcepts unseen in the training pairs where a neural networkcaptioner often needs thousands of sentences written by humanannotators as well as the corresponding images to learn novelconcepts. However, this labeling process is costly in terms ofman hours, and thus, it is impractical to scale the captioningmodel to a wide range of unseen concepts.

Several works have proposed the ways to alleviate thedifficulty of learning novel concepts. Reference [16] sharedthe parameters in the softmax layer with word embeddingvectors, which allows the neural model to incrementally updatethe parameters corresponding to novel words. This methoddecreased the amount of training data required, but zero-shotlearning is not possible. Reference [17] proposed a composi-tional captioner that learns novel concepts from an unpairedimage base and text corpora. However, the additional pretrain-ing stage of unpaired data decreases the extendibility of themodel, and the learning of novel concepts is implemented by

2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

FU et al.: ITS: EFFICIENT CONCEPT LEARNING IN IMAGE CAPTIONING BY GENERATING PSEUDOPAIRS 5911

heuristically transferring softmax parameters, which may notbe optimal. Reference [18] further combined the parameter-sharing technique of [16] and the framework of [17]. In theirmodel, all the modules were jointly trained under a mixedobjective loss, which included captioning loss, classificationloss, and language likelihood.

An issue of the frameworks used by [17] and [18] is that thelanguage distribution in common text corpora is rather differ-ent from that in image caption corpora. Namely, a domain gapexists between novel concepts and original concepts. In [18],removing the in-domain text (i.e., caption data sets sentences)from unpaired text corpora leads to a 27.7% decrease inperformance. However, the existence of in-domain text is avery strong assumption that does not hold in many practicalcases. In this paper, we propose to learn novel concepts fromsynthesized pseudodata. Different from [17] and [18], we usea knowledge base (instead of text corpora) that acts as thedirect knowledge source of novel concepts. The syntax in theexisting data set is reused, and thus, the issue of languagebias is eliminated. Generating pseudodata also provides animportant perspective to alleviate the lack of strong superviseddata. It is beneficial, especially for neural networks, whichtypically require large data sets. For example, [19] generatedand exploited large-scale pseudotraining data to improve theperformance of zero pronoun resolution.

II. OVERVIEW

The motivation of this paper is to leverage the semanticstructure of image and text to efficiently learn novel concepts.We noticed that both images and sentences consist of severalsemantic meaningful components that can be shared acrossimage-sentence pairs. For example, “a zebra/giraffe in a greengrassy field” shares the context “in a green grassy field.”Combining zebra or giraffe with the context is both logicallycorrect. Such a semantic structure enables a more efficient wayto learn novel concepts. Suppose the system has learned theconcept of giraffe but has never seen a zebra, it can learn todescribe a zebra in a field, just by recognizing zebra and know-ing the fact that a zebra can be in a grassy field like a giraffe.The image and sentence are thus decoupled—the requireddata sources of novel concepts consist of: an independentimage base providing visual information and an independentknowledge base providing logic information. We term theconcepts appearing in existing image-sentence pairs as seedconcepts. The core task of this paper is to efficiently learnnovel concepts from (1) unpaired data sources + (2) seedconcepts’ pairs (see Fig. 2 for a schematic of this framework).

The first challenge is to understand how to jointly learn frompaired data and unpaired data. The forms of unpaired data arerather different from those of paired data: the image base con-sists of images grouped by categories, whereas the knowledgebase is organized as (S, R, O) entries, where S, R, and Ostand for subject, relation, and object, respectively. We pro-pose to generate pseudopairs for novel concepts, namely,we complement seed pairs with information from unpaireddata to synthesize new image-sentence pairs. The words andvisual regions of seed concepts are replaced by those of novel

Fig. 2. Schematic of our framework. The bottom model failed to recognizezebra, which is absent in the seed pairs. Our approach takes advantage of theunpaired knowledge base and image base. Data from these two resource-cheapsources allow the captioning system to efficiently scale to novel concepts.Without any human-labeled image-sentence pair for zebra, the top modelcorrectly describes zebra.

concepts. In this way, the unpaired data are converted to thesame form as that of the original paired data—the captioningmodel can establish uniform learning of seed concepts andnovel concepts. We refer to this replacement-based method ofgenerating pseudopairs as image-text surgery (ITS). Section IIIintroduces ITS in detail.

The second challenge comes with the replacement of animage. Unlike replacing words, which can be done with highprecision, replacing regions in an image is difficult due to theunsatisfactory precision of recognition and segmentation. Thatis, we cannot exactly remove the visual components of seedconcepts and fill in those of novel concepts on the pixel level.We propose the adaptive visual replacement (AVR) methodto operate on high-level semantic features instead of pixels.For pseudodata, AVR concatenates the visual representationsof seed images and novel images and then uses an attentionmechanism to adaptively filter the unnecessary information(i.e., the visual components of seed concepts). The advantagesof AVR include: 1) it avoids the needs for extra recognition andsegmentation modules and 2) it learns the alignments betweenimage regions and concepts from the data, thus making thereplacement data driven. Details are introduced in Section IV.

We base the experiments on MSCOCO [20], a populardata set for image captioning. Eight concepts are held outfrom MSCOCO to be used as the novel concepts. Theirimage-sentence pairs are not accessible during training andvalidation and are used to evaluate performance in the teststage. The F1 score is used to measure how well the captioningmodel recognizes novel concepts, whereas the sentence qualityis evaluated by the automatic metrics that are widely usedin machine translation and image captioning. Our approachprovides significant performance improvements over state-of-the-art methods in terms of the F1 score and sentence quality.An ablation study was conducted to measure the contributionsof key techniques in our approach. We present examplecaptions of novel concepts and also visualize the attentiontransition. Such qualitative results further demonstrate the

Page 3: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

5912 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018

effectiveness of our approach. The experiments are describedin Section V.

The main contributions of this paper are summarized asfollows.

1) We propose a method (ITS) to synthesize pseudoimage-sentence pairs from an unpaired image base and knowl-edge base. To our best knowledge, this is the first workto leverage pseudodata in image captioning tasks.

2) The pseudodata benefit from unifying the form of theoriginal data and unpaired data and from eliminatingthe issue of the language domain gap.

3) We introduce an adaptive method, AVR, to replacethe visual components of concepts. With an attentionmechanism, it selects relevant information in a dynamicway.

4) We validate the effectiveness of the proposed approachon a held-out subset of the MSCOCO data set. Ourapproach outperforms state-of-the-art methods, with sig-nificant improvements on F1 scores and sentence quality.

III. GENERATE PSEUDOPAIRS

We refer to the concepts in the original pairs as seedconcepts and the concepts to be learned from unpaired dataas novel concepts. The language style in image captioning isplain, and the seed sentences are enough to cover the grammar.When learning novel concepts, most of the effort is on learningto recognize them, not learning new grammar or syntax.

Thus, we propose to synthesize pseudopairs by reusing thesyntax of seed pairs and filling in information from novelconcepts. The basic idea is to replace words and image regionsof seed concepts with those of novel concepts. The rationalityof replacements is evaluated by an external knowledge base.The visual information of novel concepts comes from anexternal image base with only category labels. The generationprocess consists of four steps: 1) construct a knowledge base;2) calculate concept similarity; 3) synthesize a sentence forthe pseudopair; and 4) generate visual representations for thepseudopair. We introduce the details in the following sections.

A. Construct Knowledge Base

A knowledge base (KB) is usually in the form of(S, R, O) triples, where S, R, and O, respectively, stand forsubject, relation, and object. Each triple is a support thatthe relationship (S, R, O) is logically correct. For example,(cat, on, sofa) tells us that a cat can reasonably appear on asofa. We construct our KB from the text corpus. Comparedto image-sentence pairs laboriously labeled by human, a textcorpus is easier to obtain and covers a wider range of concepts.

For a sentence in the text corpus, we do part-of-speech (POS) tagging and only keep the N (noun), V (verb),and P (preposition) components. We treat each V/VP/P phraseas a relation R. The nouns that are adjacent to R are regarded ascorrelated subject S and object O, respectively. We lemmatizethe nouns and verbs to eliminate the influence of plurals andtenses. Some concepts may share words in their names, e.g.,dog and hot dog. To avoid ambiguity, we take the principle of

maximum matching length—choosing the longest noun phrasematched in WordNet [21], a large lexical database of English.

An example: for the sentence “People stand in a smallfarmer’s market with vegetables,” we get (people/N, standin/VP, farmer’s market/N, with/P, vegetable/N) after POS tag-ging and lemmatization and then extract two triples: (people,stand in, farmer’s market) and (farmer’s market, with, veg-etable). Note that farmer’s market, not market, is extractedaccording to the principle of maximum matching length.

The language distribution has a heavy impact on theeffectiveness of the knowledge base. For example, we canhardly extract useful relationships for image captioning taskfrom a collection of poems. When no text corpus withsuitable distribution is available, knowledge base built fromhuman-labeled relationships (denoted as KB-RELA) can bean alternative choice. We denote the knowledge base builtfrom text descriptions as KB-DESC. Relationships in KB-DESC

are automatically extracted from text with very few (if any)manpower, whereas relationships in KB-RELA are directlywritten by human annotators—absent any image. The lattermakes a tradeoff by eliminating the dependence of languagedistribution in the text corpus, but increasing manpower cost.

B. Measure Concept Similarity

To ensure the quality of pseudopairs, the replacement ofconcepts should be restricted to a reasonable range. We pro-pose a method of measuring concept similarity based on theknowledge base. The large number of possible replacementsis pruned using the similarities.

For a concept c and a triple (c, R, O), we define c’sbackground as (R, O). For triple (S, R, c), we similarly define(S, R) as c’s background. We convert the KB triples into(c, b) pairs, where b is the background, and denote the setof all backgrounds in KB as B . If the triple of (c, b ∈ B)exists in KB, c and b are regarded as a “match.” Given thatthe similar concepts are more likely to share a background,we measure the similarity based on the background overlapbetween two concepts.

We denote xc as the feature of c. The |B|-dimensionalbinary vector xc has one in i th position if c and bi ∈ Bmatch and 0 otherwise. This is similar to the bag-of-wordsfeature, but the word counts are ignored. By doing this,we assume equal importance over the backgrounds; however,it is usually not the case. Common backgrounds, such as(in, room), can be matched to a lot of concepts and thusbe less informative than (in, kitchen). We therefore multiplyeach background by a weight similar to term frequency-inversedocument frequency (TF-IDF)

x (i)c = 1c↔bi × log

|C||c̃ ∈ C : c̃↔ bi | (2)

where C is the concept set, x (i)c is the i th dimention of xc,

and c ↔ bi means that c and bi match. The importance ofbackgrounds is decreased according to their popularity.

Since a concept may behave very differently when used assubject and as object, we model these two cases separately.Namely, we treat c appearing in the subject and c appearing

Page 4: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

FU et al.: ITS: EFFICIENT CONCEPT LEARNING IN IMAGE CAPTIONING BY GENERATING PSEUDOPAIRS 5913

Algorithm 1 Context-Specific Replacement StrategyInput: cn: novel concept

C: set of seed conceptsKB: knowledge baseSs : set of sentences in human-labeled dataNmax: maximum number of replacements

1 Find the top 5 similar concepts to cn in C , and denote asCs

2 Initialize Sn ← ∅3 for S in Ss do4 if S contains any concept cs in Cs then5 do POS-tagging for S, obtaining (cs, b)6 if (cn, b) ∈ KB then7 replace words about cs with words about cn in

S, obtaining the synthetic sentence S�8 Sn ← Sn ∪ {S�}9 if |Sn | > Nmax then

10 end loop

11 return Sn: the set of pseudosentences of cn

in the object as two independent concepts when computingsimilarities. The similarity between c and c̃ is measured bythe cosine distance between the feature vectors xc and xc̃

S(c, c̃) = < xc, xc̃ >

|xc| · |xc̃| , xc, xc̃ ∈ R|C |. (3)

If c and c̃ have no sharing backgrounds, the similarityS(c, c̃) = 0. For a certain novel concept, only its top fivesimilar concepts in seed pairs are considered for replacement.

C. Generate Pseudosentences

Let cn be a novel concept and Cs be its top five similarconcepts in seed pairs. The to-be-replaced sentence candidatesform a set Ss , in which each sentence comes from seed pairsand corresponds to one concept of Cs . A naive way is toreplace words of cs by words of cn in every sentence of Ss .However, such a global strategy fails to consider the specificcontext of the sentence; thus, it may lead to unreasonablereplacements, such as “a brown dog barks in the yard” →“a cat barks in the yard.”

We propose a context-specific replacement strategy. For asentence in Ss , we first do POS tagging and lemmatizationto obtain the background b of cs and then look up (cn , b)in KB to check whether the combination of cn and b islogically correct or not. If (cn , b) is not supported by theknowledge base, the sentence after replacement is nonsense,and such a replacement should be avoided. Back to thedog–cat example, the replacement “a brown dog barks inthe yard” → “a cat barks in the yard” is skipped, since(bark, in yard) is not a valid background of cat in KB.With the consideration of specific contexts in replacementcandidates, we perform precise surgery to generate more rea-sonable pseudosentences. Algorithm 1 describes the algorithmof context-specific replacement.

Since this paper focuses on exploring how pseudodatacan help a captioning system to recognize more concepts,we ignore many issues studied in natural language process-ing (NLP) and leave them as future work. For example,the attributes of novel concepts are dropped, and the singu-lar/plural forms are not adaptively changed. Many techniquesin NLP deal with such issues; however, they are not the maintopics in this paper.

D. Encode Images Into Visual Feature Representations

Before introducing how to generate pseudoimages, we firstdescribe our image encoding method. The encoding processconverts an image into a set of visual feature representations.

To encode an image, two methods have been popular inprevious work: 1) convert the whole image into a feature vectorthat contains the global visual information or 2) segment theimage into regions, and obtain a set of feature vectors thatcontain local visual information. Method (2) is a more naturalchoice for this paper—we need the visual elements to bedecoupled. We draw inspiration from [3] and [5], both ofwhich represent images with multiple regions. Reference [3]uses multiscale regions from object proposals, whereas [5]uses simple tiled regions. Though multiscale regions in [3]produced better captioning performance, the region proposalmethod is complicated and nondifferentiable. To simplify thepipeline and focus on the novel concept learning task, we takethe simpler tiled regions. Specifically, we resize the image into448×448, slide a 224×224 window with stride 64, and finallyget 7× 7 regions.

We choose the VGG-16 network [22] to encode each ofthe 49 regions into semantic representations. The VGG-16network receives a 224 × 224 input region, alternates theconvolutional layer with the nonlinear layer, and ends withtwo fully connected layers. The 4096-dimensional output of“fc7,” namely, the last fully connected layer, is regarded asthe semantic representation of the input region. We acceleratethe computation by adapting the VGG-16 network to a fullyconvolutional network [23]. In this manner, the convolutionaloperations of all regions are shared. The parameters of VGG-16 are pretrained using the ImageNet data set [24] under theimage classification task.

So far, an image is represented as a set of region features

R = {r1, r2, . . . , rR}, R = 49, ri ∈ R4096. (4)

E. Generate Pseudoimage Representations

Unlike replacing words in a sentence, replacing concepts inan image are more complicated. An intuitive way is imageblending: digging out the pixels of the seed concept fromthe image and then filling in the novel concept. However,it has two drawbacks in practice. First, a detector shouldbe trained in advance to separate the seed concepts and thebackground. It requires precise recognition and location of thevarious objects in the image, which is still a challenging taskin computer vision. Moreover, we need to retrain the detectorevery time when enlarging the concept range. Second, to fitthe shape of the background, the patch of the novel concept

Page 5: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

5914 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018

Fig. 3. Pipeline of generating pseudopairs for novel concepts. We take zebra as example. The process includes two steps: 1) query the knowledge base totarget an image-sentence pair Ss–Is in the human-labeled data—giraffe in this example—and then replace the words “tall giraffe” with “zebra” to generate apseudosentence Sn and 2) query the image base to obtain a zebra image In and then encode In as well as Is to visual representations Rs ∪ Rn . The processis repeated until enough pseudopairs have been generated.

usually has to do rescaling and distorting—this may lead towrong scale and shape relationship between the backgroundand the novel concept. To alleviate the distortion problem,replacement of the visual feature representations is a possiblesolution. But it still needs an additional algorithm to matchthe concepts and the regions.

We propose the AVR method, which is robust to distortionand learns the matching between concepts and regions in anend-to-end manner. AVR concatenates the visual representa-tions of the novel concept to those of the seed concept andfilters the unnecessary visual information of the seed conceptby an attention mechanism [5]. The attention mechanism aswell as our captioning model are introduced in Section IV-A,and AVR will be discussed in detail in Section IV-B. Here,we only describe how to deal with images of novel concepts.

We query the image base with a novel concept to fetchan image In , which is randomly selected among the imagescorresponding to the query concept. The image In as well asthe image Is in the original seed pairs are converted to visualfeature representations Rn and Rs , respectively. We denote theseed sentence as Ss and the pseudosentence as Sn . The newtraining data consist of two parts: Ss -Rs pairs for seed conceptsand Sn-(Rs∪Rn) pseudopairs for novel concepts. Note that theunnecessary features in Rs ∪ Rn will be adaptively processedby the attention mechanism. The whole process of generatingpseudopairs is shown in Fig. 3.

IV. ADAPTIVE VISUAL REPLACEMENT

In this section, we first describe the attention-based caption-ing model, then introduce how to use the attention mechanismto adaptively replace concepts in images, and finally give thelearning details.

A. Captioning Model

The sentence generation can be decomposed into steps.At each step, the captioning model first reads visual contextfrom the image and historical information from the previousstate and then uses them to predict a new word. We hypoth-esize there is a hidden state process {ht } governing theinformation flow among these steps. Roughly speaking, at stept , the model goes through three stages: 1) attending to a newvisual context vt ; 2) updating the hidden state ht with ht−1,vt , and the previous word wt−1; and 3) generating the newword wt given ht and wt−1.

1) Generate the Visual Context With Attention: We first mapthe region features to a 512-dimensional space by multiplyinga projection matrix P ∈ R

512×4096. With the dimension-reduced features, we then build the soft attention mecha-nism [5] in which the visual context vt is a weighted sumof region representations

vt =R∑

i=1

αit P · ri (5)

where αit is the attention weight of region i at step t and iscomputed as follows:

αit ∝ exp{ fv (P · ri , ht−1)} ∀ i = 1, 2, . . . , R (6)

where fv (·) is the mapping function before the softmax layerof a one-hidden multilayer perceptron. Note that the hiddenstate ht−1 takes an important role in deciding αit and thereforebrings attention the time dynamics.

2) Update Hidden State Using LSTM: Considering thecomplicated dynamic of the hidden state {ht }, we model itwith an LSTM network [15], which uses gates and additivememory to alleviate the problem of gradient vanishing and

Page 6: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

FU et al.: ITS: EFFICIENT CONCEPT LEARNING IN IMAGE CAPTIONING BY GENERATING PSEUDOPAIRS 5915

Fig. 4. Information flow during the sentence generation process. We take a short sentence, “a zebra is walking,” as an example. The process starts with azero initialization h0 and a special token BEGIN and then sequentially predicts words until an END token is generated. At each step, the model first attendsto a new visual context, then updates LSTM, and finally generates the new word’s distribution.

gradient exploding. The inputs of LSTM at step t include theprevious hidden state ht−1, the newly generated visual contextvt , and the previous word wt−1. The words are represented asone hot vectors. The word vectors can be obtained by Q ·wt ,where Q ∈ R

512×|V | is the word embedding matrix and |V |is the vocabulary size. We have the updating rule of LSTM

xt = [ht−1; vt ; Q · wt−1]it = σ(Wi xt + bi )

ot = σ(Woxt + bo)

ft = σ(W f xt + b f )

ct = ft � ct−1 + it � tanh(Wg xt + bg)

ht = ot � tanh(ct ) (7)

where � is elementary multiplication, σ(·) and tanh(·) are ele-mentarily applied sigmoid and hyperbolic tangent functions, it ,ot , and ft stand for input-output, and forget gates, respectively,and ct is the memory cell. The sizes of the memory cell andthe hidden state are both 512.

3) Predict New Word: The next word’s distribution is com-puted by a softmax regression on the hidden state ht and theprevious word wt−1

p(wt |w<t , I ) = softmax(Ww[Q ·wt−1; ht ] + bw). (8)

To select a word, either a stochastic strategy that samplesa word from the distribution or a deterministic strategy thatselects the word with the largest probability can be used.A deterministic strategy is often preferred, because the outputsof the stochastic strategy are unstable, namely, the generatedsentence of an image is not fixed in different running times.Moreover, sampling is more likely to break the syntax correct-ness of the generated sentence than deterministically selecting,since improper words may be sampled. The beam searchtechnique is usually taken in the deterministic strategy tofurther improve the results [3], [8], [16]. In short, beam searchkeeps several best-by-now branches when generating wordsand finally selects the sentence with the largest probability(the most likely sentence). In this paper, we use a beam size

of 3. Fig. 4 summarizes the information flow in the captioningmodel.

B. Adaptive Visual Replacement

Due to the difficulty of recognition and segmentation,the visual components of seed concepts cannot be exactlyremoved from the image on the pixel level. Thus, we replacethe visual information of concepts on high-level semantic fea-tures, which are more robust. For a pseudosentence, we con-catenate the features of novel image Rn to the features of seedimage Rs to have Rn∪Rs . This new feature set includes visualnoise, namely, the regions corresponding to seed concepts inRs and the background regions irrelative to novel concepts inRn . Their information is unnecessary for pseudosentences.

The attention mechanism is able to adaptively block thevisual noise by learning the alignments between image regionsand words. The strong correlations between regions and wordsare gradually learned from the large number of weak correla-tions between images (region bags) and sentences (word bags).A similar idea is successfully applied in multiple-instancelearning. Intuitively, the regions of seed concepts should bewell attended for seed sentences (when they are region-of-interest) and be ignored for pseudosentences (when they arevisual noise). In this way, though we cannot explicitly removethe visual noise, we provide a data-driven mechanism for thecaptioning model to choose visual information it really needs.The noises are blocked by small attention [small weightsin (5)] and thus take little effect on the sentence generationfor novel concepts.

Leveraging the attention mechanism to adaptively replacevisual components is advantageous due to two aspects: 1) itavoids extra recognition and segmentation modules, makingthe pipeline simpler and 2) it learns alignments between imageregions and concepts from data, thus making the replacementdata driven.

To fit the purpose of adaptive replacement, we initializeLSTM with zero states. In [3] and [5], the LSTM’s states h andc are initialized using the mean of the visual representations.

Page 7: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

5916 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018

The image-dependent initialization may help to quickly graspglobal visual information, which is, however, not favoredin our setting. Feeding regions to the initialization modulewithout filtering are contradictory to the use of attentionmechanism, which selectively feeds visual information to theLSTM to adaptively filter the visual components of no interest.

C. Learning Details

Our captioning model learns P(w1,w2, · · · ,wT+1|w0, I ),the conditional probability of the word sequence of a givenimage. Note that w0 and wT+1 are two special tokens (BEGINand END) concatenated to each sentence to indicate the begin-ning and ending. We train the model by maximizing likelihoodof training image-sentence pairs. Concretely, we minimize thenegative log-likelihood loss

L = − 1

N

n

1

Tn

Tn+1∑

t=1

log p(w

(n)t |w(n)

<t , I (n))

(9)

where N is the number of training sentences, Tn is the lengthof nth sentence, w

(n)<t stands for all the words before w

(n)t ,

and I (n) stands for the paired image. The ingredients jointlytrained under (9) include the LSTM, visual attention mod-ule fv (·), word predictor fw(·), visual feature’s dimension-reducing matrix P , and word embedding matrix Q. Note thatVGG-16 is not trained but fixed as a feature extractor in ourmodel.

The amount of pseudodata is an important hyperparameter.With very few pseudopairs, the captioning model has insuffi-cient learning on novel concepts. With too many pseudopairs,novel concepts dominate in the objective function and maybadly undermine the performance of seed data. We controlthe pseudopairs of each novel concept to be around 5000 bysampling the replacement proposals. It is also a number thatbalances the seed and novel concepts. The common conceptsin MSCOCO typically correspond to hundreds to thousandsof images, 1000 (with 5000 sentences) is a moderate value.Since this choice is a little tricky, we experimentally analyzethe effect in Section V-E4, finding that the performance onnovel concepts is robust to this hyperparameter within a widerange. The number 5000 is validated to be a reasonable choicefor MSCOCO.

V. EXPERIMENTS

A. Data

We base our experiments on MSCOCO [20], currently themost popular data set in the image captioning community.It has released two image sets—“train2014” containing 82 783images and “val2014” containing 40 504 images—and eachimage is paired with five human written captions. We use“train2014” as our training set and randomly split “val2014”to 20 504 and 20 000 subsets as our validation set and test set.

We hold out eight concepts from MSCOCO as the novelconcepts. Following the protocol in [17], the eight conceptsinclude bottle, bus, couch, microwave, pizza, racket, suitcase,and zebra. The data can thus be split into two parts, respec-tively, denoted as SEED and NOVEL. An image belongs to the

NOVEL part if any of its five captions mentions any NOVEL

concept, and belongs to the SEED part otherwise. In trainingstage, the model has no access to any images or sentencescorresponding to NOVEL concepts.

To learn novel concepts without labeling image-sentencepairs for them, our model needs two unpaired data sources.One source is an image base to provide the visual informationof the novel concepts. We use ImageNet [24], which con-tains more than 14 millions images organized according toa hypernym and hyponym hierarchy. By querying ImageNet,we can easily obtain a wide range of images whose conceptsare absent in MSCOCO. Another source is a knowledge baseto validate the logical correctness of concept replacements.Using the method described in Section III-A, we construct aknowledge base KB-DESC, from the text descriptions of theVisualGenome (VG) data set [25]. Note that for constructingthe KB-DESC, we merely treat VG as a text corpus, thoughit provides relation labels. As a comparison, we constructanother knowledge base, KB-RELA, using VG’s ground-truthrelations.

B. Evaluation Metrics

The performance of learning novel concepts is evaluatedon two aspects: how well the model can recognize a novelconcept and how good the quality of a generated sentence is.

For the evaluation of recognition performance, we borrowthe idea from [17] to use the F1 score. For a machine-generated sentence Sm and a ground-truth sentence Sg ,we compute the F1 score for concept c using the frequenciesof cases that are as follows.

1) True Positive: c in both Sg and Sm .2) True Negative: c in neither Sg nor Sm .3) False Positive: c in Sm but not in Sg .4) false negative: c in Sg but not in Sm .

A high F1 score for a novel concept indicates that the modelsuccessfully learn the novel concept with few decrease in theperformance on recognizing other concepts in the seed data.

For the evaluation of sentence quality, the F1 score is notsuitable. We leverage several automatic metrics, includingBLEU-1, BLEU-4 [26], METEOR [27], ROUGE-L [28],and CIDEr-D [29]. In essence, these metrics evaluate theagreements between the generated sentences and the groundtruths. We use the MSCOCO Python application programminginterface [20] to compute these metrics.

C. Competing Methods

We compare our approch with the state-of-the-art methodsdeep compositional captioning (DCC) [17] and novel objectcaptioner (NOC) [18]. Both of them learn novel conceptsfrom the unpaired text corpus and image base. For furtherevaluation, we additionally train two models as references.The first one is ALL, which is trained with both SEED andNOVEL pairs to show the performance when human-labeledpairs are available. The other one is NONE, which is trainedwith only SEED pairs, to show the poor performance when noeffort is made to learn novel concepts. Note that model NONE

cannot generate any word about NOVEL concept, because

Page 8: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

FU et al.: ITS: EFFICIENT CONCEPT LEARNING IN IMAGE CAPTIONING BY GENERATING PSEUDOPAIRS 5917

TABLE I

F1 SCORES ON TEST SET (SHOWN IN %)

these words have been removed in the training data and thusare not included in the vocabulary. Recall that the F1 scoreis computed by examining whether the generated sentencescontain words about NOVEL concepts; model NONE shouldhave complete 0% F1 scores for NOVEL concepts.

D. Implementation Details

The Stanford NLP tools [30] and natural languagetoolkit [31] were used in our experiment to do POS taggingof sentences. Our codes are written in Python, and the neuralnetworks are implemented with Theano [32]. The model isoptimized by Adam [33], a stochastic optimization algorithm,with a mini-batch size 64. We used the default hyperparame-ters suggested in Adam [33], except that we decreased thelearning rate to 0.0002. A universal token out-of-vocabularyis used to replace words with less than five occurrences in thecaption corpus—in this way, we filtered out the noisy words.Training is early stopped by monitoring the BLEU-4 score onthe validation set.

E. Quantitative Results

We have two variants of our model: ITS-DESC andITS-RELA. The former used KB-DESC to generate pseudopairs,whereas the latter used KB-RELA.

1) F1 Score: Table I lists the F1 scores of our models aswell as the competing methods. Not surprisingly, the modelNONE shows very poor performance in recognizing novelconcepts, since it cannot generate words that are absent inthe training data. With pseudopairs generated from the cheapsource data, our models ITS-DESC and ITS-RELA successfullylearned to recognize the novel concepts. Both of our modelsshow high F1 scores on concepts in the held-out subset.On certain concepts, such as bottle, bus, and zebra, our modelseven reach the level of the ALL model, which is trained withthousands of human-labeled pairs for novel concepts.

DCC [17] and NOC [18] are the two state-of-the-art meth-ods; however, they use stronger data access than ours. TheMSCOCO images and sentences, including the NOVEL part,are allowed to train their image module and language module(without pair information). We authors refer to MSCOCO asin-domain data. However, assumptions with such ideal in-domain data hardly hold in many practical cases—most ofthe external text corpora are not written for image captioning.In our model, we assume the strict nonexistence of NOVEL

data in the training stage, which is also the case in a real-worldscenario. Though we used weaker data access, our models still

achieved higher F1 scores than DCC and NOC. When usingthe same data access (versus DCC-out-domain and NOC-out-domain), our models outperform them by large margins. DCCand NOC suffer from the language bias between external textcorpora and MSCOCO (in-domain versus out-domain for DCCand NOC in Table I). In principle, ITS-DESC still suffers fromthis issue, because it needs to construct a knowledge basefrom texts. Though achieving close performance to ITS-RELA

in Table I, it benefits from the similar language distributionbetween VG and MSCOCO. ITS-RELA eliminates the issue oflanguage bias at the price of a bit more manpower on labelingrelationships.

The generated pseudopairs are not fixed at each run for thefollowing reasons: 1) the image of pseudopair is randomlyselected from the image base according to the target conceptand 2) the proposed replacements are sampled to control thenumber of pseudopairs. To evaluate our model’s robustness,we independently ran five experiments for both ITS-DESC andITS-RELA, with the mean F1 scores and standard variancesshown in Table I. The low variances on the F1 scores demon-strate that our model is robust to the image selection andreplacement sampling.

2) Sentence Quality: We computed BLEU-1, BLEU-4,METEOR, ROUGE-L, and CIDEr-D scores to measure thesentence quality. These metrics are a necessary comple-mentation to the F1 score. For an image containing zebra,an evaluation only using the F1 score will prefer the ill-formed sentence “zebra and zebra in a zebra” to the sentence“a giraffe standing in the field eating grass,” because theformer sentence mentions zebra. Table II shows the resultsof evaluating sentence quality. The metrics are, respectively,abbreviated as B-1, B-4, MT, RG, and CD and are multipliedby 100 for a better display. Both ITS-DESC and ITS-RELA

get 13.3% improvement on METEOR over the state-of-the-art DCC method. For metrics that are not reported in [17]and [18], we present the performance of our models for futurecomparison.

In Table III, we take a closer look at the sentence qualityfor novel concepts. The metrics are separately computed onNOVEL and SEED subsets. On the NOVEL subset, trainingwith pseudopairs significantly improves the sentence quality,whereas it is much lower for the model without pseudopairs(NONE). On the SEED subset, an interesting result is that theperformances of the four models are very close. It indicatesthat learning novel concepts by pseudopairs did little harm tothe original concepts in the seed data.

Page 9: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

5918 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018

TABLE II

SENTENCE QUALITY METRICS ON THE WHOLE TEST SET

TABLE III

SENTENCE QUALITY METRICS ON DIFFERENT PARTS OF THE TEST SET

TABLE IV

COMPARISON OF SENTENCE GENERATION STRATEGIES (ON VAL. SET)

3) Sentence Generation Strategy: The captioning perfor-mance can be greatly influenced by the sentence generationstrategy. We compared the stochastic strategy and beam searchwith various sizes. Table IV shows the performances of modelKB-RELA with different strategies on our validation set. Theresults indicate that beam search with size = 3 achieved thebest performance.

4) Number of Pseudopairs: The number of pseudopairs isa key hyperparameter. Excessive pseudodata may overwhelmthe original paired data, leading to a tendency of the modelto explain all images with novel concepts. However, insuffi-cient pseudopairs result in under learning of novel concepts.We denote the pseudopair number for each concept as γ .To quantitatively explore how γ influences the performance,we vary it with exponential steps. In Fig. 5, we plot theaverage F1 score, precision, and recall rate of the eight novelconcepts on the validation set. We observed an F1 scoreplateau—increasing pseudopairs after γ ≈ 5000 obtains fewgains in performance. In Fig. 6, we plot the F1 scores of threeseed concepts: truck, bed, and cake to evaluate the effect ofγ on the performance of the original concepts. The resultsshow that the performance of seed concepts drops rapidlyafter γ ≈ 5000. Recall that our model uses γ = 5000, whichmakes a good tradeoff between learning novel concepts andmaintaining original concepts.

We also observed that the performances of seed conceptsare not monotone to γ . In Fig. 6, the F1 scores of truck,

Fig. 5. Average F1 score, precision, and recall rate (ITS-RELA) on novelconcepts. The pseudopair number per concept varies from 100 to 51 200,as shown in logarithmic axis. After 5000 (black dot), the average F1 score ofnovel concepts stops increasing.

Fig. 6. F1 scores (ITS-RELA) on three seed concepts: truck, bed, and cake.The pseudopair number for each concept varies from 10 to 51 200, as shownin logarithmic axis. After 5000 (black dots), the F1 scores of the seed conceptsdrop rapidly.

bed, and cake have small increases before decreasing. If nopseudopair is provided, novel concepts cannot be recognizedby the captioning model and thus may be confused with seedconcepts. When the model enhances its ability to recognizenovel concepts with a small portion of pseudopairs, it recog-nizes seed concepts better as well.

5) On More Concepts: Aside from following the protocolof [17], we also evaluate our model utilizing more concepts ofMSCOCO by randomly generating a new NOVEL subset. Thenew NOVEL concepts include van, dress, spoon, coffee, chair,carriage, beer, and duck. In Fig. 7, our models show goodgeneralizing ability to these concepts. Our models achieveclose performance to the ALL model using human labels. Forinfrequent concepts, such as van and dress, pseudodata evenshow superiority over human-labeled data since more pairs canbe generated.

6) Ablations: Table V compares how different techniquescontribute to the overall performance. We evaluate four impor-tant components via ablation experiments. When dropping“0-init,” we use a multilayer perceptron to initialize LSTM’s c

Page 10: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

FU et al.: ITS: EFFICIENT CONCEPT LEARNING IN IMAGE CAPTIONING BY GENERATING PSEUDOPAIRS 5919

Fig. 7. F1 scores on a new NOVEL concept set. Our models achieve compa-rable performance to the ALL model trained with human labels. On infrequentconcepts, such as van, dress, beer, pseudodata even attain superiority since itcan generate more corresponding data.

TABLE V

ABLATION RESULTS OF ITS-RELA

and h as in [5] and [3]. When dropping “TF-IDF,” we removethe weighting term in(2) to equally consider each concept.When dropping “context,” the replacements of concepts arenot filtered by the sentence backgrounds. When dropping“attention,” the visual context vt is fixed to (1/R)

∑Ri=1 Ri ,

which removes the attention dynamics. We note that removingthe zero initialization or the attention mechanism leads to adramatic decrease of performance. It indicates that adaptivecontrol of the feeding visual information is crucial to ourmodel. TF-IDF weighting and sentence context both haveslight improvements in the performance.

We also compare different strategies on replacing conceptsin an image. For the method “image blending,” we follow themultiple-instance learning algorithm used in [11] to detect con-cepts. Since the detected regions in this algorithm are restrictedto rectangles with fixed sizes, the flexibility of replacement is abit undermined. We then simply resize the patch of the novelconcept to replace the seed concept region. For the method“region replacement,” we leverage the captioning model itselfto match regions and words (see Section IV-C2 of [3]).We then replace the region representation of the seed conceptto that of the novel concept. The results in Table VI validatethe effectiveness of our replacement strategy AVR.

7) Manpower: We roughly compare the manpower costedin three methods: 1) labeling image and caption pairs byhuman annotator: writing a caption costs about 20 s, and eachconcept needs about 5000 sentences on average. We need20 × 5000 ÷ 3600 ≈ 27.8 man-hours to learn a novelconcept; 2) generating pseudopairs using KB-RELA: take the

TABLE VI

COMPARISON ON REPLACEMENT STRATEGIES

TABLE VII

TOP FIVE SIMILAR CONCEPTS MEASURED BY KB-DESC AND KB-RELA

VG data set in our experiment as an example. After processing,it contains 490 000 relationships covering more than 7000concepts. Suppose we hire workers to label these relationships,we need 10 × 490 000 ÷ 3600 ≈ 1361.1 man-hours (assume10 s for a relationship). For each concept, it costs 1361.1÷7000 ≈ 0.194 man-hour, only 1/143 of the first method; and3) generating pseudopairs using KB-DESC: since the rela-tionships are automatically extracted from text by machines,this method costs very little (if any) manpower to generatepseudopairs.

F. Qualitative Results

We give qualitative results to better demonstrate how ourmodel works. The results are composed of: 1) a display of thetop five similar concepts to be replaced; 2) a visualizationof the attention transition on the pseudotraining pairs; and3) example captions generated by our model to describe novelconcepts.

1) Similar Concepts: In Table VII, we display the topfive similar concepts measured by KB-DESC and KB-RELA.To save space, only five novel concepts are displayed. ThoughKB-DESC is constructed from a text corpus, it obtains a closeresult in terms of measuring concept similarity to KB-RELA.

2) Example Captions: Fig. 8 shows example captions forthe eight novel concepts. For the model NONE trained onlywith paired data, the generated captions fail to describethe novel concepts. Instead, other objects in the image aredescribed, or even wrong concepts are mentioned. As acomparison, our models, ITS-DESC and ITS-RELA, correctlydescribe the novel concepts. We note that most of the sentencesgenerated by ITS-DESC and ITS-RELA are similar, indicatingthat the KB-DESC and KB-RELA knowledge bases led tosimilar learning results for novel concepts.

3) Visualization of Attention: The ablation experimentdemonstrated the improvement brought by the attention mech-anism. Here, we further show how it helps in our model by

Page 11: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

5920 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018

Fig. 8. Example captions of NOVEL concepts, generated by the NONE, ITS-DESC, and ITS-RELA models, respectively. Since the NONE model has never seenNOVEL concepts in training data, it fails to recognize any of them. Instead, it describes other objects in the image or even confuses them with wrong concepts.By learning with pseudopairs, ITS-DESC and ITS-RELA correctly describe the novel concepts. The words marked as red/green indicate the wrong/correctconcepts.

Fig. 9. Attention transition on pseudopairs. The brighter regions are thosewith larger attention weights. For the two examples, the captioning modelclearly attends to the regions of interest (bus in the top subfigure and zebra inthe bottom subfigure) and ignores the noisy regions (motorcycle and elephant).

visualizing the attention transitions on pseudodata. In Fig. 9,we plot the attention distribution for each word. The whiterparts in the image correspond to regions with larger atten-tion weights. Take the zebra data in Fig. 9 as an example.The regions correlated with elephant are now visual noisefor the pseudosentence, since only zebra is described afterreplacement. The visualization clearly shows that the cap-tioning model attends to the zebra regions in Rn and treeregions in Rs , which are the right information needed bythe pseudosentence. The noise of elephant is therefore softlyremoved by the attention mechanism. Our model also shows

good alignment between regions and words. When generatingw2 = “zebra,” the visual context v2 emphasizes the zebraregions. Though not all pseudoimages show such reasonableattention transition, the promising examples validate the poten-tial of the attention mechanism to learn noisy pseudodata.

VI. CONCLUSION

In this paper, we introduced the ITS method to synthe-size pseudodata for image captioning. It leverages infor-mation from unpaired data and reuses syntax in the origi-nal paired data. With pseudodata, the captioning model canefficiently learn novel concepts without any human-labeledimage-sentence pair, thus, dramatically decreasing the costof extending the concept range. The unpaired data are madeup from a knowledge base and a large-scale image base.We construct two knowledge bases: one by POS taggingon a text corpus and another by using human-labeled rela-tionships. The image base (i.e., ImageNet) is organized ascategories such that it is possible to query a concept tofetch the corresponding images. We also introduced AVR thatuses an attention mechanism to filter the visual noise. Theexperimental results on a held-out subset of MSCOCO showedthat the proposed approach provides significant improvementsover state-of-the-art methods in terms of F1 score and sentencequality. Our approach is robust to the amount of pseudodata,showing stable performance within a wide range of hyper-parameters. The ablation experiments validated that both thezero initialization of the LSTM and the attention mechanismsignificantly contribute to the final performance.

ACKNOWLEDGMENT

The authors would like to thank J. Elliott for improving thelanguage of this paper.

REFERENCES

[1] G. Kulkarni et al., “Baby talk: Understanding and generating simpleimage descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,no. 12, pp. 2891–2903, Dec. 2013.

[2] P. Kuznetsova, V. Ordonez, T. L. Berg, and Y. Choi, “TreeTalk: Compo-sition and compression of trees for image descriptions,” Trans. Assoc.Comput. Linguistics, vol. 2, no. 10, pp. 351–362, 2014.

[3] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning where to see andwhat to tell: Image captioning with region-based attention and scene-specific contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,no. 12, pp. 2321–2334, Dec. 2017.

Page 12: Image-Text Surgery: Efficient Concept Learning in Image ...static.tongtianta.site/paper_pdf/ac3b9a42-4dfd-11e... · Efficient learning of novel concepts has thus been a topic of

FU et al.: ITS: EFFICIENT CONCEPT LEARNING IN IMAGE CAPTIONING BY GENERATING PSEUDOPAIRS 5921

[4] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain imageswith multimodal recurrent neural networks,” in Proc. Int. Conf. Learn.Represent. (ICLR), 2015.

[5] K. Xu et al., “Show, attend and tell: Neural image caption generationwith visual attention,” in Proc. 32nd Int. Conf. Mach. Learn. (ICML),2015, pp. 2048–2057.

[6] J. Donahue et al., “Long-term recurrent convolutional networks forvisual recognition and description,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit. (CVPR), Jun. 2014, pp. 2625–2634.

[7] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments forgenerating image descriptions,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit. (CVPR), Jun. 2015, pp. 3128–3137.

[8] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neuralimage caption generator,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2015, pp. 3156–3164.

[9] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell:Lessons learned from the 2015 MSCOCO image captioning challenge,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 652–663,Apr. 2017.

[10] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars. (2015). “Guidinglong-short term memory for image caption generation.” [Online]. Avail-able: https://arxiv.org/abs/1509.04942

[11] H. Fang et al., “From captions to visual concepts and back,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,pp. 1473–1482.

[12] Q. Wu, C. Shen, L. Liu, A. Dick, and A. van den Hengel, “What valuedo explicit high level concepts have in vision to language problems?”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,pp. 203–212.

[13] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning withsemantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), Jun. 2016, pp. 4651–4659.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,pp. 2278–2324, Nov. 1998.

[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComput., vol. 9, no. 8, pp. 1735–1780, 1997.

[16] J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille,“Learning like a child: Fast novel visual concept learning from sentencedescriptions of images,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),Dec. 2015, pp. 2533–2541.

[17] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko,and T. Darrell, “Deep compositional captioning: Describing novel objectcategories without paired training data,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1–10.

[18] S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. Mooney, T. Darrell,and K. Saenko. (2016). “Captioning images with diverse objects.”[Online]. Available: https://arxiv.org/abs/1606.07770

[19] T. Liu, Y. Cui, Q. Yin, S. Wang, W. Zhang, and G. Hu. (2016). “Gen-erating and exploiting large-scale pseudo training data for zero pronounresolution.” [Online]. Available: https://arxiv.org/abs/1606.01603

[20] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, andC. L. Zitnick. (2015). “Microsoft COCO captions: Data collection andevaluation server.” [Online]. Available: https://arxiv.org/abs/1504.00325

[21] C. Leacock and M. Chodorow, “Combining local context and WordNetsimilarity for word sense identification,” WordNet, Electron. LexicalDatabase, vol. 49, no. 2, pp. 265–283, 1998.

[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in Proc. Int. Conf. Learn. Represent.(ICLR), 2015, pp. 1–14.

[23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2015, pp. 3431–3440.

[24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.

[25] R. Krishna et al., “Visual genome: Connecting language and vision usingcrowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123,no. 1, pp. 32–73, May 2017.

[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method forautomatic evaluation of machine translation,” in Proc. Annu. MeetingAssoc. Comput. Linguistics (ACL), 2002, pp. 311–318.

[27] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MTevaluation with improved correlation with human judgments,” in Proc.Annu. Meeting Assoc. Comput. Linguistics (ACL), Workshop, 2005,pp. 65–72.

[28] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”in Proc. Annu. Meeting Assoc. Comput. Linguistics (ACL) Workshop,2004, pp. 1–8.

[29] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-basedimage description evaluation,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), 2015, pp. 4566–4575.

[30] K. Toutanova and C. D. Manning, “Enriching the knowledge sourcesused in a maximum entropy part-of-speech tagger,” in Proc. Annu.Meeting Assoc. Comput. Linguistics (ACL), 2000, pp. 63–70.

[31] S. Bird, “NLTK: The natural language toolkit,” in Proc. COLING/ACLInteract. Present. Sessions, 2006, pp. 69–72.

[32] R. Al-Rfou et al.. (2016). “Theano: A python framework forfast computation of mathematical expressions.” [Online]. Available:https://arxiv.org/abs/1605.02688

[33] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in Proc. Int. Conf. Learn. Represent. (ICLR), 2015.

Kun Fu received the B.S. and Ph.D. degrees fromthe Department of Automation, Tsinghua University,Beijing, China, in 2011 and 2017, respectively.

His current research interests include machinelearning, deep learning, and computer vision.

Jin Li received the B.S. degree from the Departmentof Physics, Tsinghua University, Beijing, China,in 2014, and the M.S. degree from the Departmentof Automation, Tsinghua University, in 2017.

His current research interests include machinelearning and deep learning.

Junqi Jin received the B.S. and Ph.D. degrees fromthe Department of Automation, Tsinghua University,Beijing, China, in 2011 and 2016, respectively.

His current research interests include machinelearning, optimization, and deep learning.

Changshui Zhang (M’02–F’17) received theB.S. degree in mathematics from Peking University,Beijing, China, in 1986, and the M.S. and Ph.D.degrees in control science and engineering fromTsinghua University, Beijing, in 1989 and 1992,respectively.

In 1992, he joined the Department of Automation,Tsinghua University, where he is currently a Pro-fessor. His current research interests include patternrecognition and machine learning.