EmojiGAN: learning emojis distributions with a generative model · 2018-11-12 · conditioned on an image with a deep generative model. We use generative adversarial networks (GANs)

Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 273–279Brussels, Belgium, October 31, 2018. c©2018 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17

273

EmojiGAN: learning emojis distributions with a generative model

Bogdan Mazoure∗Department of Mathematics & Statistics

McGill University

Thang Doan∗

Desautels Faculty of ManagementMcGill University

Saibal RayDesautels Faculty of Management

McGill University

Abstract

Generative models have recently experienceda surge in popularity due to the developmentof more efficient training algorithms and in-creasing computational power. Models such asadversarial generative networks (GANs) havebeen successfully used in various areas such ascomputer vision, medical imaging, style trans-fer and natural language generation. Adver-sarial nets were recently shown to yield resultsin the image-to-text task, where given a set ofimages, one has to provide their correspondingtext description. In this paper, we take a simi-lar approach and propose a image-to-emoji ar-chitecture, which is trained on data from so-cial networks and can be used to score a givenpicture using ideograms. We show empiricalresults of our algorithm on data obtained fromthe most influential Instagram accounts.

1 Introduction

The spike in the amount of user-generated visualand textual data shared on social platforms such asFacebook, Twitter, Instagram, Pinterest and manyothers luckily coincides with the developmentof efficient deep learning algorithms (Perozziet al., 2014; Pennacchiotti and Popescu, 2011;Goyal et al., 2010). As humans, we can notonly share our ideas and thoughts through anyimaginable media, but also use social networksto analyze and understand complex interpersonalrelations. Researchers have access to a rich set ofmetadata (Krizhevsky, 2012; Liu et al., 2015) onwhich various computer vision (CV) and naturallanguage processing (NLP) algorithms can betrained.For instance, recent work in the area of imagecaptioning aims to provide a short description (i.e.caption) of a much larger document or image (Daiet al., 2017; You et al., 2016; Pu et al., 2016). Such

∗These authors contributed equally.

methods excel at conveying the dominant idea ofthe input. On the other hand, we use ideograms,also popular under the names of emojis or pic-tographs as a natural amalgam between annotationand summarization tasks. Note that, in this work,we use the terms emoji, ideogram and pictographinterchangeably to represent the intersection ofthese three domains. Ideograms bridge togetherthe textual and visual spaces by representinggroups of words with a concise illustration. Theycan be seen as surrogate functions which convey,up to a degree of accuracy, reactions of socialmedia users. Furthermore, because each emoji hasa corresponding text description, there is a directmapping from ideograms onto the word space.In this paper, we model the distribution of emojisconditioned on an image with a deep generativemodel. We use generative adversarial networks(GANs) (Goodfellow et al., 2014), which arenotoriously known to be harder to train thanother distributional models such as variationalauto-encoders (VAEs) (Kingma and Welling,2013) but tend to produce sharper results oncomputer vision tasks.

2 Related Work and Motivation

Since the release of word2vec by Mikolov andcolleagues in 2013 (Mikolov et al., 2013), vectorrepresentations of language entities have becomemore popular than traditional encodings such asbag-of-words (BOW) or n-grams (NG). Becauseword2vec operations preserve the original seman-tic meaning of words, concepts like word simi-larity and synonyms are well-defined in the newspace and correspond to closest neighbors of apoint according to some metric.The aforementionned word representation was fol-lowed by doc2vec (Le and Mikolov, 2014). Orig-

274

inally, doc2vec was meant to efficiently encodecollections of words as a whole. However, sinceempirical results suggest a similar performance forboth algorithms, researchers tend to opt for thesimpler and more interpretable word2vec model.One of the most recent and the most interest-ing vector embeddings has been emoji2vec (Eis-ner et al., 2016). It consists of more than 1,600symbol-vector pairs, each associating a Unicodecharacter to a real 300−dimensional vector. Theabundance of pictographs such as emojis on so-cial communication platforms suggests that word-only analyses are limited in their scope to cap-ture the full scale of interactions between individ-uals. Emojis’ biggest advantage is their univer-sality: no information is lost due to faulty trans-lations, mistyped characters or even slang words.In fact, emojis were designed to be more conciseand expressive than words. They, however, havebeen shown to suffer from varying interpretationswhich depend of factors such as viewing the pic-tograph on an iPhone or a Google Pixel (Milleret al., 2016). This in turn implies that the subjectof conversation highly impacts the choice of me-dia (text or emoji) picked by the user (Kelly andWatts, 2015). Reducing a whole media such asa public post or an advertisement image to a sin-gle emoji would almost certainly mean loosing therichness of information, which is why we suggestto instead model visual media as a conditional dis-tribution over emojis that users employ to scorethe image.Deep neural models have previously been used toanalyse pictographic data: (Cappallo et al., 2015)used them to assign the most likely emoji to a pic-ture, (Felbo et al., 2017) predicted the prevalentemotion of a sentence and (Zhao and Zeng, 2017)used recurrent neural networks (RNNs) to predictthe emoji which best describes a given sentence.We build on top of this work to propose Emoji-GAN − a model meant to generate realistic emo-jis based on an image. Since we are interested inmodeling a distribution over image-emoji tuples,it is reasonable to represent it using generativeadversarial networks. They have been shown tosuccessfully memorize distributions over both textand images. For example, a GAN can be coupledwith RNNs in order to generate realistic imagesbased on an input sentence (Reed et al., 2016).We train our algorithm on emoji-picture pairs ob-tained from various advertisement posts on Insta-

gram. A practical application of our method is toanalyze the effects of product advertisement on In-stagram users. Previous works attempted to pre-dict the popularity of Instagram posts by using sur-rogate signals such as number of likes or follow-ers (Almgren et al., 2016; De et al., 2017). Othersused social media data in order to model the popu-larity of fashion industry icons (Park et al., 2016).A thorough inspection of clothing styles aroundthe world has also been conducted (Matzen et al.,2017).

3 Proposed Approach

3.1 Generative Adversarial Networks

Generative Adversarial Networks (GANs) (Good-fellow et al., 2014) have recently gained hugepopularity as a blackbox unsupervised method oflearning some target distribution. Taking roots ingame theory, their training process is framed as atwo player zero-sum game where a generator net-work G tries to fool a discriminator network D byproducing samples closely mimicking the distribu-tion of interest. In this work, we use Wasserstein-GAN (Arjovsky et al., 2017), a variant of the orig-inal GAN which uses the Wasserstein metric in or-der to avoid problems such as mode collapse. Thegenerator and the discriminator are gradually im-proved through either alternating or simultaneousgradient descent minimization of the loss functiondefined as:

minG

maxD

Ex∼fX(x)

[D(x)]+ Ex∼G(z)

[−D(x)]+p(λ),

(1)where p(λ) = λ(||∇xD(x)|| − 1)2,x = εx + (1 − ε)G(Z), ε ∼ Uniform(0, 1),and Z ∼ fZ(z). This gradient penalized loss(Gulrajani et al., 2017) is now widely used toenforce the Lipschitz continuity constraint. Notethat setting λ = 0 recovers the original WGANobjective.

3.2 Choice of embedding

Multiple embeddings have been proposed to en-code language entities such as words, ideograms,sentences and even documents. A more recentsuccessor of word2vec, emoji2vec aims to en-code groups of words represented by visual sym-bols (ie ideograms or emojis). This representa-tion is a fine-tuned version of word2vec which was

275

trained on roughly 1,600 emojis to output a 300-dimensional real-valued vector. We experimentedwith both word2vec and emoji2vec by encodingeach emoji through a sum of the word2vec repre-sentations of its textual description. We observedthat both word2vec and emoji2vec embeddingsyielded only a mild amount of similarity for mostemojis. Moreover, dealing with groups of wordsrequires to design a recurrent layer in the architec-ture, which can be cumbersome and yield subopti-mal results as opposed to restricting the generatornetwork to only Unicode characters. Bearing thisin mind, we decided to use the emoji2vec embed-ding in all of our experiments.

3.3 Learning a skewed distribution

Just like in text analysis, some emojis (mostlyemotions such as love, laughter, sadness) occurmore frequently than domain-specific pictographs(for example, country flags). The distribution overemojis is hence highly skewed and multimodal.Since such imbalance can lead to a considerablereduction in variance, also known as mode col-lapse, we propose to re-weight each backwardpass with coefficients obtained through either ofthe following schemes:

• term frequency-inverse document frequency(tf-idf ) weights, a classical approach usedin natural language processing (Salton andBuckley, 1988);

• Exponentially-smoothed raw frequencies:

ws(e) =exp−k×freq(e)

N∑i=1

exp−k×freq(ei)∀e, k ≥ 0 (2)

where k is a smoothing constant andfreq(e) = count(e)

N is the frequency of emojie and N is the total number of emojis.

3.3.1 Algorithm

Our method relies on the conditional version ofWGAN-GP which accepts fixed size (64×64×3)RGB image tensors. Our approach is presented inAlgorithm. 1, shown below:

Algorithm 1 Conditional Wasserstein GANInput: Tuple of emojis and images (X,Y ), thegradient penalty coefficient λ, the number ofcritic iterations per generator iteration ncritic,the batch size m, learning rate lr and weightvector w.Initialization: initialize generator parametersθG0 , critic parameters θD0

for epoch = 1, ..., N dofor t = 1, ..., ncritic do{Updating Discriminator}for n = 1, ..., ndisc do

Sample {x}mi=1 ∼ X , {y}mi=1 ∼ Y ,{z}mi=1 ∼ N (0, 1), {ε}mi=1 ∼ U [0, 1]xi ← εxi + (1− εi)G(zi|yi)L(i) ← D(G(zi|(yi)) − D(xi|yi) +λ(|∇xiD(xi|yi)| − 1)2

θD ← Adam(∇θD∑m

i=1wiL(i), lr)end for{Updating Generator}for n = 1, ..., ngen do

sample a batch of {z(i)}mi=1 ∼ N(0, 1)θG ← Adam(−∇θG

∑mi=1wiL(i), lr )

end forend for

end for

4 Experiments

4.1 Data collectionWe used the (soon to be deprecated) InstagramAPI to collect posts from top influencers withinthe following categories: fashion, fitness, healthand weight loss; we believe that user data acrossthose domains share similar patterns. Here, in-fluencers are defined as accounts with the highestcombined count of followers, posts and user re-actions; 166 influencers were selected from var-ious ranking lists put together by Forbes andIconosquare. The final dataset has 80,000 (image,pictograph) tuples and covers a total of 753 dis-tinct symbols.

4.2 ArchitectureInspired from (Reed et al., 2016), we performedexperiments using the following architecture: thegenerator has 4 convolutional layers with kernelsof size 4 which output a 4× 4 feature matrix witha fully connexted layer; the discriminator is iden-tical to G but outputs a scalar softmax instead of a300-dimensional vector. The structure of both Dand G is shown in Fig. 1.

276

Figure 1: Illustration of how EmojiGAN learns a dis-tribution. The generator learns the conditional distribu-tion of emojis given a set of pictures while the discrim-inator assigns a score to each generated emoji.

5 Results

A series of experiments were conducted on thedata collected from Instagram. The best architec-ture was selected through cross-validation and hy-perparameter grid search and has been previouslydiscussed. The training process used minibatch al-ternating gradient descent with the popular Adamoptimizer (Kingma and Ba, 2014) with a learn-ing rate lr = 0.0001 and β1 = 0.1, β2 = 0.9.We trained both G and D until convergence af-ter aproximatively 10 epochs. Empirically, wesaw that exponentially-smoothed raw frequenciesweights (2) performed better than tf-idf weights.

In order to assess how closely the generator net-work approximates the true data distribution, wefirst sampled 750 images and obtained their re-spective emoji distribution by performing 50 for-ward passes through G. The mode, that is themost frequent observation in the sample, of the re-sulting distribution is considered as the most rep-resentative pictograph for the given image. Weused t-SNE on the image tensor in order to vi-sualize both the image and the emoji spaces (seeFig. 2). The purpose of the performed experimentwas to assert whether two entities close to eachother in the image space will also yield similaremojis. The top right corner of both clouds ex-

Figure 2: Visualization of t-SNE reduced images andtheir corresponding most frequent pictographs (emo-jis). The most popular emoji for each picture was ob-tained by sampling 50 observations from the generatorand taking the mode of the sample. Note that even thistechnique has a stochastic outcome, meaning that if animage has a rather flat distribution, its mode will not beconsistent across runs. The described behaviour can beobserved in the upper right area of both space represen-tations.

poses a shortcoming of the algorithm: if the dis-tribution is flat (i.e. is multimodal), even largesamples will yield different modes just by chance.This phenomenon is clearly present throughout thecloud of pictographs: four identical images yieldthree distinct emojis. On the other hand, the tworemaining examples correctly capture the presenceof two people in a single photo (middle section), aswell expression of amazement (bottom section).The performance of generative models is difficultto assess numerically, especially when it comesto emojis. Indeed, the Frechet Inception Distance(Heusel et al., 2017) is often used to score gen-erated images but to the best of our knowledge,no such measure exists for ideograms. As an al-ternative way to assess the performance of ourmethod, we plotted the true and generated distri-butions over 30 randomly chosen emojis for 1000random images (see Fig. 3). While our algorithmrelied on raw (i.e. uncleaned and unprocessed)data, we still observe a reasonable match betweenboth distributions.

Fig. 4 reports the fitted distribution of the top10 most frequent observations for three randomlysampled images. The top image represents a fash-ion model in an outfit; our model correctly cap-tures the concepts of woman, love, and overall

277

Figure 3: True and fitted distributions over 30 ran-domly sampled emojis for 500 randomly sampled im-ages. Probabilities are normalized by the maximal ele-ment of the set.

positive emotion in the image. However, Emo-jiGAN can struggle with filtering out unrealisticemojis (in this case, pineapple and pig nose) forimages with very few distinct ideograms. Thebottom subfigure outlines another very commonproblem seen in GANs: mode collapse. While thegenerated emoji fits in the context of the image,the variance in this case is nearly zero and resultsin G learning a Dirac distribution at the most fre-quent observation.

The middle image also suffers from the above

Figure 4: Emojis sampled for some Instagram posts:observe the mode collapse in the bottom subfigure asopposed to more equally spread out distributions.

problems (the sunset pictograph dominates thedistribution). We note how algorithms based onunfiltered data from social networks are prone toethical fallacies, as illustrated in the middle image.This situation is reminiscent of the infamous Mi-crosoft chatbot Tay which started to pick up racistand sexist language after being trained on uncen-sored tweets and had to be shut down (Neff andNagy, 2016). We ourselves experienced a simi-lar behaviour when assessing the performance ofEmojiGAN. One plausible explanation of this phe-nomenon would be that while derogatory com-ments are quite rare, the introduction of exponen-tial weight or similar scores in the hope of pre-venting mode collapse to the most popular emojihas the side effect of overfitting least frequent pic-tographs.

6 Conclusion and Discussion

In this work, we proposed a new way of model-ing social media posts through a generative adver-sarial network over pictographs. EmojiGAN man-aged to learn the emoji distribution for a set ofgiven images and generate realistic pictographicrepresentations from a picture. While the issue ofnoisy predictions still remains, our approach canbe used as an alternative to classical image anno-tation methods. Using a modified attention mech-anism (Xu et al., 2015) would be a stepping stoneto correctly model the context-dependent connota-tions (Jibril and Abdullah, 2013) of emojis. How-ever, the biggest concern is of ethical nature: train-ing any algorithm on raw data obtained from socialnetworks without filtering offensive and deroga-tory ideas is itself a debate (Islam et al., 2016;Davidson et al., 2017).

Future work on the topic should start witha thorough analysis of algebraic properties ofemoji2vec similar to (Arora et al., 2016). For ex-ample, new Unicode formats support emoji com-position, which is reminiscent of traditional wordembeddings’ behaviour and could be explicitly in-corporated into a learning algorithm. Finally, theethical concerns behind deep learning without lim-its are not specific to our algorithm but rather acommunity-wide discourse. It is thus important towork together with AI safety research groups inorder to ensure that novel methods developed byresearchers learn our better side.

278

ReferencesKhaled Almgren, Jeongkyu Lee, et al. 2016. Pre-

dicting the future popularity of images on socialnetworks. In Proceedings of the The 3rd Multi-disciplinary International Social Networks Confer-ence on SocialInformatics 2016, Data Science 2016,page 15. ACM.

Martın Arjovsky, Soumith Chintala, and Leon Bottou.2017. Wasserstein GAN. CoRR, abs/1701.07875.

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,and Andrej Risteski. 2016. Linear algebraic struc-ture of word senses, with applications to polysemy.arXiv preprint arXiv:1601.03764.

Spencer Cappallo, Thomas Mensink, and Cees G.M.Snoek. 2015. Image2emoji: Zero-shot emoji pre-diction for visual media. In Proceedings of the 23rdACM International Conference on Multimedia, MM’15, pages 1311–1314, New York, NY, USA. ACM.

Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fi-dler. 2017. Towards diverse and natural im-age descriptions via a conditional GAN. CoRR,abs/1703.06029.

Thomas Davidson, Dana Warmsley, Michael W. Macy,and Ingmar Weber. 2017. Automated hate speechdetection and the problem of offensive language.CoRR, abs/1703.04009.

Shaunak De, Abhishek Maity, Vritti Goel, Sanjay Shi-tole, and Avik Bhattacharya. 2017. Predicting thepopularity of instagram posts for a lifestyle maga-zine using deep learning.

Ben Eisner, Tim Rocktaschel, Isabelle Augenstein,Matko Bosnjak, and Sebastian Riedel. 2016.emoji2vec: Learning emoji representations fromtheir description. CoRR, abs/1609.08359.

Bjarke Felbo, Alan Mislove, Anders Søgaard, IyadRahwan, and Sune Lehmann. 2017. Using millionsof emoji occurrences to learn any-domain represen-tations for detecting sentiment, emotion and sar-casm. CoRR, abs/1708.00524.

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron C. Courville, and Yoshua Bengio. 2014. Gen-erative adversarial networks. CoRR, abs/1406.2661.

Amit Goyal, Francesco Bonchi, and Laks VS Laksh-manan. 2010. Learning influence probabilities in so-cial networks. In Proceedings of the third ACM in-ternational conference on Web search and data min-ing, pages 241–250. ACM.

Ishaan Gulrajani, Faruk Ahmed, Martın Arjovsky,Vincent Dumoulin, and Aaron C. Courville. 2017.Improved training of wasserstein gans. CoRR,abs/1704.00028.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, Gunter Klambauer, and SeppHochreiter. 2017. Gans trained by a two time-scaleupdate rule converge to a nash equilibrium. CoRR,abs/1706.08500.

Aylin Caliskan Islam, Joanna J. Bryson, and ArvindNarayanan. 2016. Semantics derived automaticallyfrom language corpora necessarily contain humanbiases. CoRR, abs/1608.07187.

Tanimu Ahmed Jibril and Mardziah Hayati Abdul-lah. 2013. Relevance of emoticons in computer-mediated communication contexts: An overview.Asian Social Science, 9(4):201.

Ryan Kelly and Leon Watts. 2015. Characterisingthe inventive appropriation of emoji as relationallymeaningful in mediated close personal relationships.Experiences of Technology Appropriation: Unantic-ipated Users, Usage, Circumstances, and Design.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114.

Alex Krizhevsky. 2012. Learning multiple layers offeatures from tiny images.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Inter-national Conference on Machine Learning, pages1188–1196.

Ziwei Liu, Ping Luo, Xiaogang Wang, and XiaoouTang. 2015. Deep learning face attributes in thewild. In Proceedings of International Conferenceon Computer Vision (ICCV).

Kevin Matzen, Kavita Bala, and Noah Snavely. 2017.Streetstyle: Exploring world-wide clothing stylesfrom millions of photos. CoRR, abs/1706.01869.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. CoRR, abs/1310.4546.

Hannah Miller, Jacob Thebault-Spieker, Shuo Chang,Isaac Johnson, Loren Terveen, and Brent Hecht.2016. Blissfully happy or ready to fight: Varyinginterpretations of emoji. Proceedings of ICWSM,2016.

Gina Neff and Peter Nagy. 2016. Automation, algo-rithms, and politics— talking to bots: Symbioticagency and the case of tay. International Journalof Communication, 10:17.

279

Jaehyuk Park, Giovanni Luca Ciampaglia, and EmilioFerrara. 2016. Style in the age of instagram: Pre-dicting success within the fashion industry using so-cial media. In Proceedings of the 19th ACM Confer-ence on Computer-Supported Cooperative Work &Social Computing, pages 64–73. ACM.

Marco Pennacchiotti and Ana-Maria Popescu. 2011. Amachine learning approach to twitter user classifica-tion. Icwsm, 11(1):281–288.

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena.2014. Deepwalk: Online learning of social rep-resentations. In Proceedings of the 20th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 701–710. ACM.

Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan,Chunyuan Li, Andrew Stevens, and Lawrence Carin.2016. Variational autoencoder for deep learning ofimages, labels and captions. In Advances in neuralinformation processing systems, pages 2352–2360.

Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanu-gen Logeswaran, Bernt Schiele, and Honglak Lee.2016. Generative adversarial text to image synthe-sis. CoRR, abs/1605.05396.

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. In-formation processing & management, 24(5):513–523.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual at-tention. In International Conference on MachineLearning, pages 2048–2057.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with seman-tic attention. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages4651–4659.

Luda Zhao and Connie Zeng. 2017. Using neural net-works to predict emoji usage from twitter data.

Documents

EmojiGAN: learning emojis distributions with a generative model · 2018-11-12 · conditioned on an image with a deep generative model. We use generative adversarial networks (GANs)