Character-Centric Storytelling

Character-Centric Storytelling

Aditya SurikuchiAalto University

[email protected]

Jorma LaaksonenAalto University

[email protected]

Abstract

Sequential vision-to-language or visual story-telling has recently been one of the areas offocus in computer vision and language mod-eling domains. Though existing models gen-erate narratives that read subjectively well,there could be cases when these models missout on generating stories that account and ad-dress all prospective human and animal char-acters in the image sequences. Consideringthis scenario, we propose a model that im-plicitly learns relationships between providedcharacters and thereby generates stories withrespective characters in scope. We use theVIST dataset for this purpose and report nu-merous statistics on the dataset. Eventually,we describe the model, explain the experimentand discuss our current status and future work.

1 Introduction

Visual storytelling and album summarization taskshave recently been of focus in the domain ofcomputer vision and natural language processing.With the advent of new architectures, solutionsfor problems like image captioning and languagemodeling are getting better. Therefore it is onlynatural to work towards storytelling; deeper visualcontext yielding a more expressive style language,as it could potentially improve various applica-tions involving tasks using visual descriptions andvisual question answering. (Wiriyathammabhumet al., 2016).

Since the release of the VIST visual story-telling dataset (Huang et al., 2016), there havebeen numerous approaches modeling the behav-ior of stories, leveraging and extending success-ful sequence-to-sequence based image captioningarchitectures. Some of them primarily addressedmeans of incorporating image-sequence featureinformation into a narrative generating network(Gonzalez-Rico and Pineda, 2018), (Kim et al.,

2018), while others focused on model learningpatterns and behavioral orientations with changesin back-propagation methods (Wang et al., 2018),(Huang et al., 2018). Motivated by these works wenow want to understand the importance of charac-ters and their relationships in visual storytelling.

Specifically, we extract characters from theVIST dataset, analyze their influence across thedataset and exploit them for paying attention torelevant visual segments during story-generation.We report our findings, discuss the directions ofour ongoing work and suggest recommendationsfor using characters as semantics in visual story-telling.

2 Related work

(Huang et al., 2016) published the VIST datasetalong with a baseline sequence-to-sequence learn-ing model that generates stories for image se-quences in the dataset. Gradually, as a result ofthe 2018 storytelling challenge, there have beenother works on VIST. Most of them extendedthe encoder-decoder architecture introduced in thebaseline publication by adding attention mecha-nisms (Kim et al., 2018), learning positionally de-pendent parameters (Gonzalez-Rico and Pineda,2018) and using reinforcement learning basedmethods (Wang et al., 2018), (Huang et al., 2018).

To our best knowledge, there are no prior worksmaking use of characters for visual storytelling.The only work that uses any additional semanticsfor story generation is (Huang et al., 2018). Theypropose a hierarchical model structure which firstgenerates a “semantic topic” for each image in thesequence and then uses that information during thegeneration phase. The core module of their hier-archical model is a Semantic Compositional Net-work (SCN) (Gan et al., 2016), a recurrent neuralnetwork variant generating text conditioned on the

arX

iv:1

909.

0786

3v1

[cs

.CL

] 1

7 Se

p 20

19

provided semantic concepts.Unlike traditional attention mechanisms, the

SCN assembles the information on semantics di-rectly into the neural network cell. It achievesthis by extending the gate and state weight ma-trices to adhere to additional semantic informationprovided for the language generation phase. In-spired by the results SCN achieved for image andvideo captioning, we use it for storytelling. Thesemantic concepts we use are based on characterfrequencies and their co-occurrence informationextracted from the stories of the VIST dataset.

Our expectation is that the parameters of thelanguage decoder network generating the story aredependent on the character semantics and wouldlearn to capture linguistic patterns while simul-taneously learning mappings to respective visualfeatures of the image sequence.

3 Data

We used the Visual storytelling (VIST) datasetcomprising of image sequences obtained fromFlickr albums and respective annotated descrip-tions collected through Amazon Mechanical Turk(Huang et al., 2016). Each sequence has 5 im-ages with corresponding descriptions that togethermake up for a story. Furthermore, for each Flickralbum there are 5 permutations of a selected set ofits images. In the overall available data there are40,071 training, 4,988 validation, and 5,050 usabletesting stories.

3.1 Character extraction

We extracted characters out of the VIST dataset.To this end, we considered that a character is either“a person” or “an animal”. We decided that thebest way to do this would be by making use ofthe human-annotated text instead of images for thesake of being diverse (e.g.: detection on imageswould yield “person”, as opposed to father).

The extraction takes place as a two-step pro-cess:

Identification of nouns: We first used a pre-trained part-of-speech tagger (Marcus et al., 1994)to identify all kinds of nouns in the annotations.Specifically, these noun categories are NN – com-mon, singular or mass, NNS – noun, common,plural, NNP – noun, proper, singular, and NNPS– noun, proper, plural.

Filtering for hypernyms: WordNet (Miller,1995) is a lexical database over the English lan-

guage containing various semantic relations andsynonym sets. Hypernym is one such semanticrelation constituting a category into which wordswith more specific meanings fall. From among theextracted nouns, we thereby filtered those wordsthat have their lowest common hypernym as either“person” or “animal”.

3.2 Character analysisWe analyzed the VIST dataset from the perspec-tive of the extracted characters and observed that20,405 training, 2,349 validation and 2,768 testingdata samples have at least one character presentamong their stories. This is approximately 50%of the data samples in the entire dataset. To pur-sue the prominence of relationships between thesecharacters, we analyzed these extractions for bothindividual and co-occurrence frequencies.

Figure 1: Character frequencies (training split)

We found a total of 1,470 distinct characterswith 1,333 in training, 387 in validation and 466in the testing splits. This can be considered as anindication to the limited size of the dataset becausethe number of distinct characters within each splitis strongly dependent on the respective size of thatsplit.

Figure 1 plots the top 30 most frequent char-acters in the training split of the dataset. Apartfrom the character “friends” there is a gradual de-crease in the occurrence frequencies of the othercharacters from “mom” to “grandmother”. Sim-ilarly, in Figure 2, which plots the top 30 mostco-occurring character pairs, (“dad”, “mom”),

Figure 2: Characters co-occurrence frequencies (train-ing split)

(“friend”, “friends”) pairs occur drastically morenumber of times than other pairs in the stories.This can lead to an inclination bias of the storygenerator towards these characters owing to thedata size limitations we discussed.

In the process of detecting characters, we ob-served also that ∼5000 distinct words failed onWordNet due to their misspellings (“webxites”),for being proper nouns (“cathrine”), for being anabbreviation (“geez”), and simply because theywere compound words (“sing-a-long”). Thoughmost of the models ignore these words based ona vocabulary threshold value (typically 3), wewould like to comment that language model cre-ation without accounting for these words could ad-versely affect the behavior of narrative generation.

4 Model

Our model in Figure 3 follows the encoder-decoder structure. The encoder module incorpo-rates the image sequence features, obtained usinga pretrained convolutional network, into a subjectvector. The decoder module, a semantically com-positional recurrent network (SCN) (Gan et al.,2016), uses the subject vector along with characterprobabilities and generates a relevant story.

4.1 Character semantics

The relevant characters with respect to each data-sample are obtained as a preprocessing step.

We denote characters extracted from the human-annotated stories of respective image-sequences asactive characters. We then use these active charac-ters to obtain other characters which could poten-tially influence the narrative to be generated. Wedenote these as passive characters and they canbe obtained using various methods. We describesome methods we tried in Section 5. The individ-ual frequencies of these relevant characters, activeand passive are then normalized by the vocabularysize and constitute the character probabilities.

Figure 3: The model follows the encoder-decoderstructure. Additional character semantics passed to thedecoder module regulate its state parameters.

4.2 Encoder

Images of a sequence are initially passed througha pretrained ResNet network (He et al., 2015), forobtaining their features. The features extracted arethen provided to the encoder module, which is asimple recurrent neural network employed to learnparameters for incorporating the subjects in the in-dividual feature sets into a subject vector.

4.3 Decoder

We use the SCN-LSTM variant of the recurrentneural network for the decoder module as shownin Figure 4. The network extends each weight ma-trix of the conventional LSTM to be an ensembleof a set of tag-dependent weight matrices, subjec-tive to the character probabilities. Subject vectorfrom the encoder is fed into the LSTM to initializethe first step. The LSTM parameters utilized when

decoding are weighted by the character probabili-ties, for generating a respective story.

Gradients ∇, propagated back to the network,nudge the parameters W to learn while adheringto respective character probabilities ~cp:

∆(Wgates, states | ~cp) = α · ∇gates, states (1)

Consequently, the encoder parameters move to-wards incorporating the image-sequence featuresbetter.

Figure 4: (Gan et al., 2016), v and s denote the visualand semantic features respectively. Each triangle sym-bol represents an ensemble of tag dependent weightmatrices

5 Experiments

We report the current status of our work and theintended directions of progress we wish to makeusing the designed model. All experiments wereperformed on the VIST dataset.

As mentioned in Section 4.1, passive characterscan be selected by conditioning their relationshipson several factors. We explain two such methods:

5.1 Method 1In the first method we naıvely select all the charac-ters co-occurring with respective active characters.Subsequently, probabilities for these passive char-acters are co-occurrence counts normalized by thecorpus vocabulary size. This method enables themodel to learn parameters on the distribution ofcharacter relationships.

5.2 Method 2In the second approach, we conditionally select alimited number of characters that collectively co-occur most with the respective active characters.This is visualized in Figure 5. The selected passive

characters “girlfriend”, “father” and “son” collec-tively co-occur in the most co-occurring charactersof the active characters. K in this case is a tunablehyperparameter.

Figure 5: Conditional on collective co-occurrences

6 Discussion

Both methods we are experimenting with exhibitdifferent initial traits. We are currently work-ing towards analyzing the character relationshipslearned by the models and understanding the ab-stract concepts that get generated as a result ofsuch learning. We do not report any generated sto-ries and evaluations yet as we consider that to bepremature without proper examination. However,we feel the training process metrics are encourag-ing and provide us with enough intuition for pur-suing the proposed approach to its fullest scope.

7 Conclusion

We have extracted, analyzed and exploited char-acters in the realm of storytelling using the VISTdataset. We have provided a model that can makeuse of the extracted characters to learn their rela-tionships and thereby generate grounded and sub-jective narratives for respective image sequences.For future work we would like to make the en-coder semantically compositional by extracting vi-sual tags and also explore ways to improve learn-ing of character relationships while avoiding over-fitting.

ReferencesZhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,

Kenneth Tran, Jianfeng Gao, Lawrence Carin, andLi Deng. 2016. Semantic compositional networksfor visual captioning. CoRR, abs/1611.08002.

Diana Gonzalez-Rico and Gibran Fuentes Pineda.2018. Contextualize, show and tell: A neural visualstoryteller. CoRR, abs/1806.00738.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Deep residual learning for image recog-nition. CoRR, abs/1512.03385.

Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz,Dapeng Oliver Wu, Jianfeng Wang, and Xi-aodong He. 2018. Hierarchically structuredreinforcement learning for topically coherent visualstory generation. CoRR, abs/1805.08191.

Ting-Hao (Kenneth) Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-cob Devlin, Ross B. Girshick, Xiaodong He, Push-meet Kohli, Dhruv Batra, C. Lawrence Zitnick,Devi Parikh, Lucy Vanderwende, Michel Galley,and Margaret Mitchell. 2016. Visual storytelling.CoRR, abs/1604.03968.

Taehyeong Kim, Min-Oh Heo, Seonil Son, Kyoung-Wha Park, and Byoung-Tak Zhang. 2018. GLACnet: Glocal attention cascading networks formulti-image cued story generation. CoRR,abs/1805.10973.

Mitchell Marcus, Grace Kim, Mary AnnMarcinkiewicz, Robert MacIntyre, Ann Bies,Mark Ferguson, Karen Katz, and Britta Schas-berger. 1994. The penn treebank: Annotatingpredicate argument structure. In Proceedings ofthe Workshop on Human Language Technology,HLT ’94, pages 114–119, Stroudsburg, PA, USA.Association for Computational Linguistics.

George A. Miller. 1995. Wordnet: A lexical databasefor english. Commun. ACM, 38(11):39–41.

Xin Wang, Wenhu Chen, Yuan-Fang Wang, andWilliam Yang Wang. 2018. No metrics are perfect:Adversarial reward learning for visual storytelling.CoRR, abs/1804.09160.

Peratham Wiriyathammabhum, Douglas Summers-Stay, Cornelia Fermuller, and Yiannis Aloimonos.2016. Computer vision and natural language pro-cessing: Recent approaches in multimedia androbotics. ACM Comput. Surv., 49(4):71:1–71:44.

http://arxiv.org/abs/1611.08002













https://doi.org/10.3115/1075812.1075835

https://doi.org/10.3115/1075812.1075835

https://doi.org/10.1145/219717.219748

https://doi.org/10.1145/219717.219748



https://doi.org/10.1145/3009906

https://doi.org/10.1145/3009906

https://doi.org/10.1145/3009906

Documents

Character-Centric Storytelling