Towards Automatic Animated Storyboarding
Patrick Ye and Timothy Baldwin Computer Science and Software Engineering
University of Melbourne, Australia NICTA Victoria LaboratoriesUniversity of Melbourne, Australia
In this paper, we propose a machine learning-based NLP sys-tem for automatically creating animated storyboards usingthe action descriptions of movie scripts. We focus partic-ularly on the importance of verb semantics when generat-ing graphics commands, and find that semantic role labellingboosts performance and is relatively robust to the effects ofunseen verbs.
IntroductionAnimated storyboards are computer animations createdbased on movie scripts, and used as crude previews by di-rectors and actors during the pre-production of movies. Cre-ating non-trivial animated storyboards is a time- and labour-intensive process, and requires a level of technical expertisethat most people do not have. In this research, we propose toautomate the process of animated storyboarding using a vari-ety of NLP technologies, potentially saving time and moneyand also providing a dynamic, visual environment for scriptcreation and fine-tuning.
The creation of an animated storyboard can be describedas a two-step process. The first step is to construct a staticvirtual stage with virtual actors and props to approximate thescene to be shot. The second step is to create the interactionsbetween the virtual actors and props, to visualize the eventsdepicted by the action descriptions of the movie scripts. Thisresearch is focused on the second step as it is more labour-intensive and technically challenging than the first step.
There are three major differences between existing NLP-aided animation systems and our system. Firstly, most exist-ing systems use handcrafted rules to map the results of lan-guage analysis onto graphics commands, whereas our sys-tem uses a machine learning system to perform this taskautomatically. Secondly, existing systems were designedfor domain-specific tasks with a controlled vocabulary andsyntax, whereas our system is open-domain with no restric-tions on the language used other than that the input text is inthe style of 3rd person narration. Thirdly, existing systemsare all coupled with customised graphics systems designedspecifically for their respective tasks, whereas our system isdesigned to interface with any graphics system that offersaccess through a programming language style interface.1
Copyright c 2008, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
1We used a free animation system called Blender3D (www.blender.org) in our experiments.
Since the main purpose of animated storyboards is to vi-sualise the actions described in the scripts, and the actionsare mostly expressed in verb phrases, the linguistic analysisin our system is focused on verb semantics. Several NLPtechniques are used to perform this analysis, namely POStagging is used to identify the verbs, and semantic role la-belling (SRL) (Gildea and Jurafsky 2002) is used to identifythe semantic roles of each verb. Furthermore, our linguisticanalysis also makes use of lexical resources such as Word-Net (Fellbaum 1998).
The main contribution of this research is the proposal ofa novel multimodal domain of application for NLP, and theintegration of NLP techniques with extra-linguistic and mul-timodal information defined by the virtual stage. We are es-pecially interested in the usefulness of SRL in our task sinceit is an intuitively useful NLP technique but has rarely foundany practical applications. We are also interested in howwell our system could handle verbs not seen in the train-ing data, since rule-based systems often cannot handle suchverbs.
Related WorkMost existing literature on applying NLP techniques tocomputer animation is limited to domain-specific systemswhere a customised computer graphics module is driven bya simplistic handcrafted NLP module which extracts fea-tures from input text. The pioneering system to support nat-ural language commands to a virtual world was SHRDLU(Winograd 1972), which uses handcrafted rules to interpretnatural language commands for manipulating simple geo-metric objects in a customised block world. The K3 project(Funakoshi, Tokunaga, and Tanaka 2006) uses handcraftedrules to interpret Japanese voice commands to control virtualavatars in a customised 3D environment. Carsim (Johanssonet al. 2004; 2005) uses a customised semantic role labeller toextract car crash information from Swedish newspaper textand visually recreate the crash using a customised graphicssystem.
While existing systems have been successful in their ownrespective domains, this general approach is inappropriatefor the task of animated storyboarding because the input lan-guage and the domain (i.e. all movie genres) are too broad.Instead, we: (a) use a general-purpose graphics system witha general set of graphics commands not tuned to any one do-main, and (b) we use automated NLP and machine learningtechniques to perform language analysis on the input scripts,and control the graphics system.
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)
Our work is also similar to machine translation (MT) inthe sense that we aim to translate a source language into atarget language. The major difference is that for MT, thetarget language is a human language, and for our work, thetarget language is a programming language. Programminglanguages are much more structured than natural languages,meaning that there is zero tolerance for disfluencies in thetarget language, as they result in the graphics system beingunable to render the scene. Also, the lack of any means ofunderspecification or ambiguity in the graphics commandsmeans that additional analysis must be performed on thesource language to fully disambiguate the language.
OverviewIn this section, we describe our movie script corpus, thenprovide detailed discussion of the virtual stages and how weuse them in our experiments, and finally outline the maincomputational challenges in automatic animated storyboard-ing.
Movie Script CorpusMovie scripts generally contain three types of text: the scenedescription, the dialogue, and the action description. In thisresearch, the only section of the script used for languageanalysis is the acting instructions of the action descriptions,as in (1).2 That is, we only visualise the physical movementsof the virtual actors/props.
(1) Andys hand lowers a ceramic piggy bank in front ofMr. Potato Head and shakes out a pile of coins to thefloor. Mr. Potato Head kisses the coins.
In real movies, dialogues are of course accompanied bybody and facial movements. Such movements are almostnever specified explicitly in the movie scripts, and requireartistic interpretation on the part of the actual actors. Theyare therefore outside the scope of this research.
We animated around 95% of the Toy Story script using21 base graphics commands. The script was split up accord-ing to its original scene structure. For each annotated scene,we first constructed a virtual stage that resembles the corre-sponding scene in the original movie. We then performedsemantic role labelling of that scenes acting instructions,determined the correct order for visualisation, annotated ev-ery visualisable verb with a set of graphics commands andfinally, annotated the correct grounding of each argument.In total, 1264 instances of 307 verbs were annotated, withan average of 2.84 base graphics commands per verb token.
Virtual StageA novel and important aspect of this research is that we usethe physical properties and constraints of the virtual stageto improve the language analysis of the acting instructions.Below, we outline what sort of information the virtual stageprovides.
The building blocks of our virtual stages are individual3D graphical models of real world objects. Each 3D model
2All the movie script examples in this paper are taken from amodified version of the Toy Story script.
is hand-assigned a WordNet synset in a database. The vir-tual stages are assembled through a drag-and-drop interfacebetween the database and the graphics system.
Each 3D model in the virtual stage can be annotated withadditional WordNet synsets, which is useful for finding en-tities that have multiple roles in a movie. For example, thecharacter Woody in Toy Story is both a toy, i.e. an artifactdesigned to be played with, and a person. Furthermore,each 3D model in the virtual stage can be given one or morename, which is useful for finding entities with more than onetitle. For example, in the script of Toy Story, the character ofMrs. Davis is often referred to as either Mrs. Davis or Andysmum.
WordNet 2.1 synsets were used to annotate the 3D mod-els. Furthermore, since WordNet provides meronyms for itsnoun synsets, it will be possible to further annotate the com-ponents of each 3D model with the corresponding Word-Net synsets, facilitating the incorporation of more real worldknowledge into our system. For example, consider the sen-tence Andy sits in the chair. Since a chair (typically) con-sists of a seat, legs, and a back, it would be useful if thesystem could differentiate the individual components of the3D model and avoid unnatural visualisation of the sentence,such as Andy sitting on the back of the chair instead of theseat.
Finally, an important feature of the virtual stages is thatthe status (position, orientation, etc) of any virtual objectcan be queried at any time. This feature makes it possible toextract extra-linguistic information about the virtual objectsduring the animation process.
Main Computational TasksAs stated above, this research centres around extracting verbsemantic information from the acting