Punny Captions: Witty Wordplay in Image Descriptions · funny image. Similarly,Shahaf et al.(2015) andRadev et al.(2015) learn to rank cartoon cap-tions based on their funniness

Punny Captions: Witty Wordplay in Image Descriptions

Arjun Chandrasekaran1 Devi Parikh1,2 Mohit Bansal31Georgia Institute of Technology 2Facebook AI Research 3UNC Chapel Hill

{carjun, parikh}@gatech.edu [email protected]

Abstract

Wit is a form of rich interaction that is oftengrounded in a specific situation (e.g., a com-ment in response to an event). In this work,we attempt to build computational models thatcan produce witty descriptions for a given im-age. Inspired by a cognitive account of hu-mor appreciation, we employ linguistic word-play, specifically puns, in image descriptions.We develop two approaches which involve re-trieving witty descriptions for a given imagefrom a large corpus of sentences, or generat-ing them via an encoder-decoder neural net-work architecture. We compare our approachagainst meaningful baseline approaches viahuman studies and show substantial improve-ments. We find that when a human is sub-ject to similar constraints as the model regard-ing word usage and style, people vote the im-age descriptions generated by our model to beslightly wittier than human-written witty de-scriptions. Unsurprisingly, humans are almostalways wittier than the model when they arefree to choose the vocabulary, style, etc.

1 Introduction

“Wit is the sudden marriage of ideaswhich before their union were not per-ceived to have any relation.”

– Mark Twain

Witty remarks are often contextual, i.e.,grounded in a specific situation. Developing com-putational models that can emulate rich forms ofinteraction like contextual humor, is a crucial steptowards making human-AI interaction more nat-ural and more engaging (Yu et al., 2016). E.g.,witty chatbots could help relieve stress and in-crease user engagement by being more personableand human-like. Bots could automatically postwitty comments (or suggest witty responses) onsocial media, chat, or messaging.

(a) Generated: a poll (pole)on a city street at night.Retrieved: the light knight(night) chuckled.Human: the knight (night)in shining armor drove away.

(b) Generated: a bare (bear)black bear walking through aforest.Retrieved: another reporter isstanding in a bare (bear) brownfield.Human: the bear killed thelion with its bare (bear) hands.

Figure 1: Sample images and witty descriptions from 2models, and a human. The words inside ‘()’ (e.g., poleand bear) are the puns associated with the image, i.e.,the source of the unexpected puns used in the caption(e.g., poll and bare).

The absence of large scale corpora of witty cap-tions and the prohibitive cost of collecting such adataset (being witty is harder than just describingan image) makes the problem of producing con-textually witty image descriptions challenging.

In this work, we attempt to tackle the challeng-ing task of producing witty (pun-based) remarksfor a given (possibly boring) image. Our approachis inspired by a two-stage cognitive account of hu-mor appreciation (Suls, 1972) which states that aperceiver experiences humor when a stimulus suchas a joke, captioned cartoon, etc., causes an incon-gruity, which is shortly followed by resolution.

We introduce an incongruity in the perceiver’smind while describing an image by using an un-expected word that is phonetically similar (pun)to a concept related to the image. E.g., in Fig. 1b,the expectations of a perceiver regarding the image(bear, stones, etc.) is momentarily disconfirmedby the (phonetically similar) word ‘bare’. This

arX

iv:1

704.

0822

4v2

[cs

.CL

] 3

1 M

ay 2

018

incongruity is resolved when the perceiver parsesthe entire image description. The incongruity fol-lowed by resolution can be perceived to be witty.1

We build two computational models based onthis approach to produce witty descriptions for animage. First, a model that retrieves sentences con-taining a pun that are relevant to the image from alarge corpus of stories (Zhu et al., 2015). Second,a model that generates witty descriptions for animage using a modified inference procedure dur-ing image captioning which includes the specifiedpun word in the description.

Our paper makes the following contributions:To the best of our knowledge, this is the first workthat tackles the challenging problem of produc-ing a witty natural language remark in an every-day (boring) context. We present two novel mod-els to produce witty (pun-based) captions for anovel (likely boring) image. Our models rely onlinguistic wordplay. They use an unexpected punin an image description during inference/retrieval.Thus, they do not require to be trained with wittycaptions. Humans vote the descriptions from thetop-ranked generated captions ‘wittier’ than threebaseline approaches. Moreover, in a Turing test-style evaluation, our model’s best image descrip-tion is found to be wittier than a witty human-written caption2 55% of the time when the humanis subject to the same constraints as the machineregarding word usage and style.

2 Related Work

Humor theory. General Theory of Verbal Hu-mor (Attardo and Raskin, 1991) characterizes lin-guistic stimuli that induce humor but implement-ing computational models of it requires severelyrestricting its assumptions (Binsted, 1996).Puns. Zwicky and Zwicky (1986) classify puns asperfect (pronounced exactly the same) or imper-fect (pronounced differently). Similarly, Pepicelloand Green (1984) categorize riddles based on thelinguistic ambiguity that they exploit – phonologi-cal, morphological or syntactic. Kao et al. (2013)formalize the notion of incongruity in puns anduse a probabilistic model to evaluate the funninessof a sentence. Jaech et al. (2016) learn phone-editdistances to predict the counterpart, given a pun by

1Indeed, a perceiver may fail to appreciate wit if the pro-cess of ‘solving’ (resolution) is trivial (the joke is obvious) ortoo complex (they do not ‘get’ the joke).

2 This data is available on the author’s webpage.

drawing from automatic speech recognition tech-niques. In contrast, we augment a web-scraped listof puns using an existing model of pronunciationsimilarity.Generating textual humor. JAPE (Binsted andRitchie, 1997) also uses phonological ambiguityto generate pun-based riddles. While our task in-volves producing free-form responses to a novelstimulus, JAPE produces stand-alone “canned”jokes. HAHAcronym (Stock and Strapparava,2005) generates a funny expansion of a givenacronym. Unlike our work, HAHAcronym oper-ates on text, and is limited to producing sets ofwords. Petrovic and Matthews (2013) developan unsupervised model that produces jokes of theform, “I like my X like I like my Y, Z” .Generating multi-modal humor. Wang and Wen(2015) predict a meme’s text based on a givenfunny image. Similarly, Shahaf et al. (2015)and Radev et al. (2015) learn to rank cartoon cap-tions based on their funniness. Unlike typical, bor-ing images in our task, memes and cartoons areimages that are already funny or atypical. E.g.,“LOL-cats” (funny cat photos), “Bieber-memes”(modified pictures of Justin Bieber), cartoons withtalking animals, etc. Chandrasekaran et al. (2016)alter an abstract scene to make it more funny. Incomparison, our task is to generate witty naturallanguage remarks for a novel image.Poetry generation. Although our tasks are differ-ent, our generation approach is conceptually simi-lar to Ghazvininejad et al. (2016) who produce po-etry, given a topic. While they also generate andscore a set of candidates, their approach involvesmany more constraints and utilizes a finite stateacceptor unlike our approach which enforces con-straints during beam search of the RNN decoder.

3 Approach

Extracting tags. The first step in producing a con-textually witty remark is to identify concepts thatare relevant to the context (image). At times, theseconcepts are directly available as e.g., tags postedon social media. We consider the general casewhere such tags are unavailable, and automaticallyextract tags associated with an image.

We extract the top-5 object categories pre-dicted by a state-of-the-art Inception-ResNet-v2model (Szegedy et al., 2017) trained for imageclassification on ImageNet (Deng et al., 2009). Wealso consider the words from a (boring) image de-

Classifierman cell phone

a group of

standing

cell

TagsInput

CNN fRNN fRNN fRNN fRNN fRNN

sell

people

side

with

theat side (sighed)

fRNN

selloutdoor

fRNN

phones

fRNN

sell<stop>

rRNN

phones cell

rRNN rRNN

their

rRNN

fRNN

people

rRNN

of group

rRNN rRNN

a

rRNN

sell<stop>

their atpeople

sell (cell)

an event

rRNN

Filter puns

sell

sell

The street signs read thirty-eighth and eighth avenue.

“I have decided to sell the group to you .”

“you sell phones ?”

Retrieval corpus

phones

Image captioning

model

Gen

erat

ion

Sentence reads this way

Figure 2: Our models for generating and retrieving image descriptions containing a pun (see Sec. 3).

scription (generated from Vinyals et al. (2016)).We combine the classifier object labels and wordsfrom the caption (ignoring stopwords) to producea set of tags associated with an image, as shown inFig. 2. We then identify concepts from this collec-tion that can potentially induce wit.Identifying puns. We attempt to induce an incon-gruity by using a pun in the image description. Weidentify candidate words for linguistic wordplayby comparing image tags against a list of puns.

We construct the list of puns by mining the webfor differently spelled words that sound exactly thesame (heterographic homophones). We increasecoverage by also considering pairs of words with 0edit-distance, according to a metric based on fine-grained articulatory representations (AR) of wordpronunciations (Jyothi and Livescu, 2014). Ourlist of puns has a total of 1067 unique words (931from the web and 136 from the AR-based model).

The pun list yields a set of puns that are as-sociated with a given image and their phonolog-ically identical counterparts, which together formthe pun vocabulary for the image. We evaluate ourapproach on the subset of images that have non-empty pun vocabularies (about 2 in 5 images).Generating punny image captions. We intro-duce an incongruity by forcing a vanilla imagecaptioning model (Vinyals et al., 2016) to decodea phonological counterpart of a pun word associ-ated with the image, at a specific time-step dur-ing inference (e.g., ‘sell’ or ‘sighed’, showed inorange in Fig. 2). We achieve this by limiting thevocabulary of the decoder at that time-step to onlycontain counterparts of image-puns. In followingtime-steps, the decoder generates new words con-ditioned on all previously decoded words. Thus,the decoder attempts to generate sentences thatflow well based on previously uttered words.

We train two models that decode an image de-

scription in forward (start to end) and reverse(end to start) directions, depicted as ‘fRNN’ and‘rRNN’ in Fig. 2 respectively. The fRNN can de-code words after accounting for the incongruitythat occurs early in the sentence and the rRNN isable to decode the early words in the sentence af-ter accounting for the incongruity that can occurlater. The forward RNN and reverse RNN gener-ate sentences in which the pun appears in each ofthe first T and last T positions, respectively.3

Retrieving punny image captions. As an al-ternative to our approach of generating witty re-marks for the given image, we also attempt toleverage natural, human-written sentences whichare relevant (yet unexpected) in the given con-text. Concretely, we retrieve natural languagesentences4 from a combination of the Book Cor-pus (Zhu et al., 2015) and corpora from the NLTKtoolkit (Loper and Bird, 2002). The retrievedsentences each (a) contains an incongruity (pun)whose counterpart is associated with the image,and (b) has support in the image (contains an im-age tag). This yields a pool of candidate captionsthat are perfectly grammatical, a little unexpected,and somewhat relevant to the image (see Sec. D).Ranking. We rank captions in the candidate poolsfrom both generation and retrieval models, accord-ing to their log-probability score under the imagecaptioning model. We observe that the higher-ranked descriptions are more relevant to the imageand grammatically correct. We then perform non-maximal suppression, i.e., eliminate captions thatare similar5 to a higher-ranked caption to reduce

3For an image, we choose T = {1, 2, ..., 5} and beamsize = 6 for each decoder. This generates a pool of 5 (T) ∗ 6(beam size) ∗ 2 (forward + reverse decoder) = 60 candidates.

4To prevent the context of the sentence from distractingthe perceiver, we consider sentences with < 15 words. Over-all, we are left with a corpus of about 13.5 million sentences.

5Two sentences are similar if the cosine similarity be-

the pool to a smaller, more diverse set. We reportresults on the 3 top-ranked captions. We describethe effect of design choices in the supplementary.

4 Results

Data. We evaluate witty captions from our ap-proach via human studies. 100 random images(having associated puns) are sampled from the val-idation set of COCO (Lin et al., 2014).Baselines. We compare the wittiness of descrip-tions generated by our model against 3 qualita-tively different baselines, and a human-writtenwitty description of an image. Each of these evalu-ates a different component of our approach. Reg-ular inference generates a fluent caption that isrelevant to the image but is not attempting to bewitty. Witty mismatch is a human-written wittycaption, but for a different image from the one be-ing evaluated. This baseline results in a captionthat is intended to be witty, but does not attempt tobe relevant to the image. Ambiguous is a ‘punny’caption where a pun word in the boring (regular)caption is replaced by its counterpart. This cap-tion is likely to contain content that is relevant tothe image, and it contains a pun. However, the punis not being used in a fluent manner.

We evaluate the image-relevance of the topwitty caption by comparing against a boring ma-chine caption and a random caption (see supple-mentary).Evaluating annotations. Our task is to gener-ate captions that a layperson might find witty. Toevaluate performance on this task, we ask peopleon Amazon Mechanical Turk (AMT) to vote forthe wittier among the given pair of captions foran image. We collect annotations from 9 uniqueworkers for each relative choice and take the ma-jority vote as ground-truth. For each image, wecompare each of the generated 3 top-ranked and1 low-ranked caption against 3 baseline captionsand 1 human-written witty caption.6

Constrained human-written witty captions. Weevaluate the ability of humans and automaticmethods to use the given context and pun wordsto produce a caption that is perceived as witty. We

tween the average of the Word2Vec (Mikolov et al., 2013)representations of words in each sentence is ≥ 0.8.

6This results in a total of 4 (captions) ∗2 (generation +retrieval) ∗4 (baselines + human caption) = 32 comparisonsof our approach against baselines. We also compare the wit-tiness of the 4 generated captions against the 4 retrieved cap-tions (see supplementary) for an image (16 comparisons). Intotal, we perform 48 comparisons per image, for 100 images.

Figure 3: Wittiness of top-3 generated captions vs.other approaches. y-axis measures the % images forwhich at least one of K captions from our approachis rated wittier than other approaches. Recall steadilyincreases with the number of generated captions (K).

ask subjects on AMT to describe a given imagein a witty manner. To prevent observable struc-tural differences between machine and human-written captions, we ensure consistent pun vocab-ulary (utilization of pre-specified puns for a givenimage). We also ask people to avoid first personaccounts or quote characters in the image.Metric. 3, we report performance of the gener-ation approach using the Recall@K metric. ForK = 1, 2, 3, we plot the percentage of imagesfor which at least one of the K ‘best’ descriptionsfrom our model outperformed another approach.Generated captions vs. baselines. As we see inFig. 3, the top generated image description (top-1G) is perceived as wittier compared to all base-line approaches more often than not (the vote is>50% at K = 1). We observe that as K in-creases, the recall steadily increases, i.e., when weconsider the top K generated captions, increas-ingly often, humans find at least one of them tobe wittier than captions produced by baseline ap-proaches. People find the top-1G for a given imageto be wittier than mismatched human-written im-age captions, about 95% of the time. The top-1G isalso wittier than a naive approach that introducesambiguity about 54.2% of the time. When com-pared to a typical, boring caption, the generatedcaptions are wittier 68% of the time. Further, in ahead-to-head comparison, the generated captionsare wittier than the retrieved captions 67.7% of thetime. We also validate our choice of ranking cap-tions based on the image captioning model score.We observe that a ‘bad’ caption, i.e., one rankedlower by our model, is significantly less witty thanthe top 3 output captions.

Surprisingly, when the human is constrained to

(a) Generated: a bored(board) bench sits in front ofa window.Retrieved: Wedge sits onthe bench opposite Berry,bored (board).Human: could you pleasemake your pleas (please)!

(b) Generated: a loop(loupe) of flowers in a glassvase.Retrieved: the flour (flower)inside teemed with worms.Human: piece required forpeace (piece).

(c) Generated: a woman sell(cell) her cell phone in a city.Retrieved: Wright (right)slammed down the phone.Human: a woman sighed(side) as she regretted thesell.

(d) Generated: a bear that isbare (bear) in the water.Retrieved: water glistenedoff her bare (bear) breast.Human: you won’t hear acreak (creek) when the bearis feasting.

(e) Generated: a loop(loupe) of scissors and a pairof scissors.Retrieved: i continuedslicing my pear (pair) on thecutting board.Human: the scissors werenear, but not clothes (close).

(f) Generated: a female ten-nis player caught (court) inmid swing.Retrieved: i caught (court)thieves on the roof top.Human: the man made aloud bawl (ball) when shethrew the ball.

(g) Generated: a bored(board) living room with alarge window.Retrieved: anya sat on thecouch, feeling bored (board).Human: the sealing (ceil-ing) on the envelope resem-bled that in the ceiling.

(h) Generated: a parkingmeter with rode (road) in thebackground.Retrieved: smoke speakersighed (side).Human: a nitting of colordidn’t make the poll (pole)less black.

Figure 4: The top row contains selected examples of human-written witty captions, and witty captions generatedand retrieved from our models. The examples in the bottom row are randomly picked.

use the same words and style as the model, thegenerated descriptions from the model are foundto be wittier for 55% of the images. Note that in aTuring test, a machine would equal human perfor-mance at 50%7. This led us to speculate if the con-straints placed on language and style might be re-stricting people’s ability to be witty. We confirmedthis by evaluating free-form human captions.Free-form Human-written Witty Captions. Weask people on AMT to describe an image (usingany vocabulary) in a manner that would be per-ceived as funny. As expected, when comparedagainst automatic captions from our approach, hu-man evaluators find free-form human captions tobe wittier about 90% of the time compared to 45%in the case of constrained human witty captions.Clearly, human-level creative language with un-constrained sentence length, style, choice of puns,

7Recall that this compares how a witty description is con-structed, given the image and specific pun words. A Turingtest-style evaluation that compares the overall wittiness of amachine and a human would refrain from constraining thehuman in any way.

etc., makes a significant difference in the witti-ness of a description. In contrast, our automaticapproach is constrained by caption-like language,length, and a word-based pun list. Training mod-els to intelligently navigate this creative freedomis an exciting open challenge.Qualitative analysis. The generated witty cap-tions exhibit interesting features like alliteration(‘a bare black bear ...’) in Fig. 1b and 6c. Attimes, both the original pun (pole) and its counter-part (poll) make sense for the image (Fig. 1a). Oc-casionally, a pun is naively replaced by its counter-part (Fig. 6b) or rare puns are used (Fig. 6d). Onthe other hand, some descriptions (Fig. 4e and 4h)that are forced to utilize puns do not make sense.See supplementary for analysis of retrieval model.

5 ConclusionWe presented novel computational models in-spired by cognitive accounts to address the chal-lenging task of producing contextually witty de-scriptions for a given image. We evaluate the mod-

els via human-studies, in which they significantlyoutperform meaningful baseline approaches.

AcknowledgementsWe thank Shubham Toshniwal for his advice regard-ing the automatic speech recognition model. Thiswork was supported in part by: a NSF CAREERaward, ONR YIP award, ONR Grant N00014-14-12713, PGA Family Foundation award, Google FRA,Amazon ARA, DARPA XAI grant to DP and NVIDIAGPU donations, Google FRA, IBM Faculty Award, andBloomberg Data Science Research Grant to MB.

ReferencesSalvatore Attardo and Victor Raskin. 1991. Script the-

ory revis (it) ed: Joke similarity and joke representa-tion model. Humor-International Journal of HumorResearch 4(3-4):293–348.

Kim Binsted. 1996. Machine humour: An imple-mented model of puns. PhD Thesis, University ofEdinburgh .

Kim Binsted and Graeme Ritchie. 1997. Computa-tional rules for generating punning riddles. Humor:International Journal of Humor Research .

Arjun Chandrasekaran, Ashwin Kalyan, Stanislaw An-tol, Mohit Bansal, Dhruv Batra, C. Lawrence Zit-nick, and Devi Parikh. 2016. We are humor be-ings: Understanding and predicting visual humor. InCVPR.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. Imagenet: A large-scale hi-erarchical image database. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Con-ference on. IEEE, pages 248–255.

Marjan Ghazvininejad, Xing Shi, Yejin Choi, andKevin Knight. 2016. Generating topical poetry. InEMNLP.

Aaron Jaech, Rik Koncel-Kedziorski, and Mari Osten-dorf. 2016. Phonological pun-derstanding. In HLT-NAACL.

Preethi Jyothi and Karen Livescu. 2014. Revisitingword neighborhoods for speech recognition. ACL2014 page 1.

Justine T Kao, Roger Levy, and Noah D Goodman.2013. The funny thing about incongruity: A com-putational model of humor in puns. In Proceedingsof the Annual Meeting of the Cognitive Science So-ciety.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European Confer-ence on Computer Vision. Springer, pages 740–755.

Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In Proceedings of the ACL-02Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Com-putational Linguistics - Volume 1. Association forComputational Linguistics, ETMTNLP ’02.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems.

William J Pepicello and Thomas A Green. 1984. Lan-guage of riddles: new perspectives. The Ohio StateUniversity Press.

Sasa Petrovic and David Matthews. 2013. Unsuper-vised joke generation from big data. In ACL.

Dragomir Radev, Amanda Stent, Joel Tetreault, Aa-sish Pappu, Aikaterini Iliakopoulou, Agustin Chan-freau, Paloma de Juan, Jordi Vallmitjana, AlejandroJaimes, Rahul Jha, et al. 2015. Humor in collectivediscourse: Unsupervised funniness detection in thenew yorker cartoon caption contest. arXiv preprintarXiv:1506.08126 .

Dafna Shahaf, Eric Horvitz, and Robert Mankoff.2015. Inside jokes: Identifying humorous cartooncaptions. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining. ACM, pages 1065–1074.

Oliviero Stock and Carlo Strapparava. 2005. HA-HAcronym: A computational humor system. InACL.

Jerry M Suls. 1972. A two-stage model for the ap-preciation of jokes and cartoons: An information-processing analysis. The Psychology of Humor:Theoretical Perspectives and Empirical Issues .

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,and Alexander A Alemi. 2017. Inception-v4,inception-resnet and the impact of residual connec-tions on learning. In AAAI. pages 4278–4284.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2016. Show and tell: Lessonslearned from the 2015 mscoco image captioningchallenge. IEEE transactions on pattern analysisand machine intelligence .

William Yang Wang and Miaomiao Wen. 2015. Ican has cheezburger? a nonparanormal approach tocombining textual and visual information for pre-dicting and generating popular meme descriptions.In NAACL.

Zhou Yu, Leah Nicolich-Henkin, Alan W Black, andAlex I Rudnicky. 2016. A wizard-of-oz study on anon-task-oriented dialog systems that reacts to userengagement. In 17th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue. page 55.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of the IEEE In-ternational Conference on Computer Vision. pages19–27.

Arnold Zwicky and Elizabeth Zwicky. 1986. Imper-fect puns, markedness, and phonological similarity:With fronds like these, who needs anemones. FoliaLinguistica 20(3-4):493–503.

AppendixWe first present details and results for the addi-tional experiments that evaluate the relevance ofthe top generated caption from the model to theimage, and compare the retrieved captions to base-line approaches. We then briefly discuss the ratio-nales for the design choices that we made in ourapproach. Subsequently, we briefly describe somecharacteristics of pun words and follow it up withbrief qualitative analysis of the output from the re-trieval model. Finally, we present the annotationand evaluation interfaces that the human subjectsinteracted with.

A Additional Experiments

A.1 Relevance of witty caption to imageWe compared the relative relevance of the topwitty caption from our generation approachagainst a machine generated boring caption (ei-ther for the same image or for a different, ran-domly chosen image) in a pairwise comparison.We showed Turkers an image and a pair of cap-tions, and asked them to choose the more relevantcaption for the image. We see that on average, thegenerated witty caption is considered more rele-vant than a machine generated boring caption forthe same image 37.5% of the time. People foundthe generated witty caption to be more relevantthan a random caption 97.2% of the time. Thisshows that in an effort to generate witty content,our approach produces descriptions that are a lit-tle less relevant compared to a boring descriptionfor the image. But our witty caption is clearly stillrelevant to the image (almost always more relevantthan an unrelated caption).

A.2 Retrieved captions vs. baselinesHumans evaluate the wittiness of each of the 3top-ranked retrieved captions against baseline ap-proaches and a human witty caption. As we see

Figure 5: Comparison of wittiness of the top 3 cap-tions from our retrieval approach vs. other ap-proaches. The y-axis measures the % images forwhich at least one of K captions from our ap-proach is rated wittier than other approaches. Aswe increase the number of retrieved captions (K),recall steadily increases.

in Fig. 5, at K = 1, the top retrieved descriptionis found to be wittier than only a human-writtenwitty caption that is mismatched with the givenimage (witty mismatch) 83.8% of the time. Thetop retrieved caption is found less witty than evena typical caption (regular inference) about 63.4%of the time. Similarly, the retrieved caption is alsofound to be less witty than a naive method that pro-duces punny captions (ambiguous) about 62% ofthe time. We observe the trend that as K increases,recall also increases. On average, at least one ofthe top 3 retrieved captions is wittier than the (con-strained) human witty caption about 61.6% of thetime, compared to generated captions which arewittier 84.0% of the time.

Poor performance of retrieved captions could bedue to the fact that they are often not perfectly aptfor the given image since they are retrieved fromstory-based corpora. Please see Sec. D for exam-ples, and a more detailed discussion. As we willsee in the next section, these issues do not extendto the generation approach which exhibits strongperformance against baseline approaches, human-written witty captions and the retrieval approach.While these captions might evoke a sense of in-congruity, it is likely hard for the viewer to resolvethe alternate interpretation of the retrieved captionas being applicable to the image.

B Design choices

In this section, we describe how our architecturedesign and parameter choices in the architecturesinfluence witty descriptions. During the design ofour model, we made choices of parameters basedon observations from qualitative results. For in-stance, we experimented with different beam sizesto generate a set of high precision captions withfew false positives. We found that a beam sizeof 6 resulted in a sufficient number of candidatesentences which were reasonably accurate. Weextract image tags from the top-K predictions ofan image classifier. We experimented with differ-ent values of K, where K ∈ {1, 5, 10}. We alsotried using a score threshold, where classes pre-dicted with a score above the threshold were con-sidered valid image tags. We found that K = 5 re-sults in reasonable predictions. Determining a rea-sonable threshold on the other hand was difficultbecause for most images, class prediction scoresare extremely peaky. We also experimented withthe different positions that a pun counterpart canbe forced to appear in. Based on qualitative ex-amples, we found that the model generated wittydescriptions that were somewhat sensible when apun word appeared at any of the first or last 5positions of a sentence. We also experimentedwith a number of different methods to re-rank thecandidate of witty captions, e.g., language modelscore (Jozefowicz et al., 2016), image-sentencesimilarity score (Kiros et al., 2014), semantic sim-ilarity (using Word2Vec (Mikolov et al., 2013)) ofthe pun counterpart to the sentence, a priori prob-ability of the pun counterpart in a large corpus ofEnglish sentences to avoid rare / unfamiliar words,likelihood of the tag (under the image captioningmodel or the classifier as applicable). etc. Wequalitatively found that re-ranking using log. prob.score of the image captioning model, while beingthe simplest, resulted in the best set of candidatewitty captions.

C Pun List

Recall that we construct a list of puns by min-ing the web and based on automatic methods thatmeasure the similarity of pronunciation of words.Upon inspecting our list of puns, we observe that itcontains puns of many frequently used words andsome pun words that are rarely used in everydaylanguage, e.g., ‘wight’ (which is the counterpart of‘white’). Since a rare pun word can be distracting

to a perceiver, the corresponding caption might beharder to resolve, making it less likely to be per-ceived as witty. Thus, we see limited benefit in in-creasing the size of our pun list further to includewords that are used even less frequently.

D Qualitative analysis of retrieveddescriptions

The retrieved witty descriptions are retrieved fromstory-based corpora. They often contain sentencesthat describe a very specific situation or instance.Although these sentences are grounded in objectsthat are also present in the image, the entire sen-tence often contains a few words that are irrelevantfor a given image, as we see in Fig. 6b, Fig. 6dand Fig. 6e. This is a likely reason for why a re-trieved sentence containing a pun is perceived asless witty when compared with witty descriptionsgenerated for the image.

E Interface for ‘Be Witty!’

We ask people on Amazon Mechanical Turk(AMT) to create witty descriptions for the givenimage. We also ask them to utilize one of the givenpun words associated with the image. We showthem a few good and bad examples to illustrate thetask better. Fig. 7 shows the interface that we usedto collect these human-written witty descriptionsfor an image.

F Interface for ‘Which is wittier?’

We showed people on AMT two descriptions fora given image and asked them to click on the de-scription that was wittier for the image. The webinterface that we used to collect this data is shownin Fig. 8.

ReferencesRafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam

Shazeer, and Yonghui Wu. 2016. Exploringthe limits of language modeling. arXiv preprintarXiv:1602.02410 .

Ryan Kiros, Ruslan Salakhutdinov, and Richard SZemel. 2014. Unifying visual-semantic embeddingswith multimodal neural language models. arXivpreprint arXiv:1411.2539 .

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems.

(a) Generated: a read (red) stop sign on astreet corner.Retrieved: the sign read (red) :Human: i sighed (side) when we got offthe main street.

(b) Generated: a bored (board) bench sitsin front of a window.Retrieved: Wedge sits on the bench oppo-site Berry, bored (board).Human: could you please make yourpleas (please)!

(c) Generated: a man wearing a loop(loupe) and a bow tie.Retrieved: meanwhile the man’s headturned back to beau (bow) with his eyeswide.Human: he could be my beau (bow) withthat style.

(d) Generated: a loop (loupe) of flowersin a glass vase.Retrieved: the flour (flower) insideteemed with worms.Human: piece required for peace (piece).

(e) Generated: a female tennis playercaught (court) in mid swing.Retrieved: my shirt caught (court) fire.Human: the woman’s hand caught (court)in the center.

(f) Generated: broccoli and meet (meat)on a plate with a fork.Retrieved: “you mean white folk (fork)”.Human: the folk (fork) enjoyed the foodwith a fork.

Figure 6: Sample images and witty descriptions from our generation model, retrieval model and a hu-man. The puns (counterparts) that are used in captions (a) to (f) are read/sighed, bored, loop/beau,loop/flour/peace, caught and meet/folk respectively. The word in the parenthesis following each coun-terpart is the pun associated with the image. It is provided as a reference to the source of the unexpectedpun which is used in the caption.

Good examples

Witty sentence with a pun: Witty sentence with a pun: Witty sentence with a pun:Emotional wedding where the cake is in tiers. A woman at a dine and whine. A cat is pressing pause on the phone.

Bad examples

A bridesmaid is in tiers at a wedding. She will always whine after wine. Sleepy cat said, "Dance to the music without pause".[Pun word should make sense! This caption makes [No personal viewpoints. [Shouldn't be what a character in the picture might say!]sense for the *original* word but not for the pun.] No first person accounts!]

Keyboard shortcutsPrevious Left arrowNext Right arrow

>

Hi, my name is (F)Punky. I am an Artificial Intelligence (AI). I am learning how to be witty. Please write a caption about this image that

contains a pun.

Task: Write a witty sentence about the image containing one of the puns listed beside the image.

Please see a few good examples (green font) and bad examples (red font) below.

HIDE EXAMPLES

If you don't follow these instructions, your work will be rejected.

Task 1/5PREVIOUS NEXT

List of puns: waul (wall), wight (white), stile (style), poll (pole), sine (sign)

Write a caption about this image using waul, wight, stile, poll, or sine. Remember: The caption should be relevant to the image, and the sentence should make sense for: waul,wight, stile, poll, or sine.

Witty sentence here ...

Figure 7: AMT web interface for the ‘Be Witty!’ task.

Hi, my name is (F)Punky. I am an Artificial Intelligence (AI). I'm trying to learn to be witty by using puns while describing images. I'm not very good yet, and I'd like to

learn so I can slowly get better.

Please tell me which of the following two captions are wittier for this image. To give you a sense for what pun Iwas going for -- I'll also show you in parenthesis what I saw in the image which I then made a pun around.

Even if both captions seem not all that witty, please indicate the one that seems (ever so slightly) wittier.

I will benefit from this positive feedback! Thanks :)

Task 5/15

Which of the two captions for the image is wittier?

PREVIOUS NEXT

CAUGHT (COURT) A TENNIS PLAYER HITTING THE BALL .

THE TEDDY BEAR WERE BARE (BEAR)

Keyboard shortcutsTop caption Ctrl+j

Bottomcaption

Ctrl+k

Previous Ctrl+d

Next Ctrl+h

Figure 8: AMT web interface for the ‘Which is wittier?’ task.

Documents

Punny Captions: Witty Wordplay in Image Descriptions · funny image. Similarly,Shahaf et al.(2015) andRadev et al.(2015) learn to rank cartoon cap-tions based on their funniness