Transcript
Page 1: Towards Automatic Animated Storyboarding - aaai.org · Towards Automatic Animated Storyboarding Patrick Ye♠ and Timothy Baldwin♠♥ ♠ Computer Science and Software Engineering

Towards Automatic Animated Storyboarding

Patrick Ye ♠ and Timothy Baldwin♠♥

♠ Computer Science and Software EngineeringUniversity of Melbourne, Australia

♥ NICTA Victoria LaboratoriesUniversity of Melbourne, Australia

{jingy,tim}@csse.unimelb.edu.au

Abstract

In this paper, we propose a machine learning-based NLP sys-tem for automatically creating animated storyboards usingthe action descriptions of movie scripts. We focus partic-ularly on the importance of verb semantics when generat-ing graphics commands, and find that semantic role labellingboosts performance and is relatively robust to the effects ofunseen verbs.

IntroductionAnimated storyboards are computer animations createdbased on movie scripts, and used as crude previews by di-rectors and actors during the pre-production of movies. Cre-ating non-trivial animated storyboards is a time- and labour-intensive process, and requires a level of technical expertisethat most people do not have. In this research, we propose toautomate the process of animated storyboarding using a vari-ety of NLP technologies, potentially saving time and moneyand also providing a dynamic, visual environment for scriptcreation and fine-tuning.

The creation of an animated storyboard can be describedas a two-step process. The first step is to construct a staticvirtual stage with virtual actors and props to approximate thescene to be shot. The second step is to create the interactionsbetween the virtual actors and props, to visualize the eventsdepicted by the action descriptions of the movie scripts. Thisresearch is focused on the second step as it is more labour-intensive and technically challenging than the first step.

There are three major differences between existing NLP-aided animation systems and our system. Firstly, most exist-ing systems use handcrafted rules to map the results of lan-guage analysis onto graphics commands, whereas our sys-tem uses a machine learning system to perform this taskautomatically. Secondly, existing systems were designedfor domain-specific tasks with a controlled vocabulary andsyntax, whereas our system is open-domain with no restric-tions on the language used other than that the input text is inthe style of 3rd person narration. Thirdly, existing systemsare all coupled with customised graphics systems designedspecifically for their respective tasks, whereas our system isdesigned to interface with any graphics system that offersaccess through a programming language style interface.1

Copyright c© 2008, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1We used a free animation system called Blender3D (www.blender.org) in our experiments.

Since the main purpose of animated storyboards is to vi-sualise the actions described in the scripts, and the actionsare mostly expressed in verb phrases, the linguistic analysisin our system is focused on verb semantics. Several NLPtechniques are used to perform this analysis, namely POStagging is used to identify the verbs, and semantic role la-belling (SRL) (Gildea and Jurafsky 2002) is used to identifythe semantic roles of each verb. Furthermore, our linguisticanalysis also makes use of lexical resources such as Word-Net (Fellbaum 1998).

The main contribution of this research is the proposal ofa novel multimodal domain of application for NLP, and theintegration of NLP techniques with extra-linguistic and mul-timodal information defined by the virtual stage. We are es-pecially interested in the usefulness of SRL in our task sinceit is an intuitively useful NLP technique but has rarely foundany practical applications. We are also interested in howwell our system could handle verbs not seen in the train-ing data, since rule-based systems often cannot handle suchverbs.

Related WorkMost existing literature on applying NLP techniques tocomputer animation is limited to domain-specific systemswhere a customised computer graphics module is driven bya simplistic handcrafted NLP module which extracts fea-tures from input text. The pioneering system to support nat-ural language commands to a virtual world was SHRDLU(Winograd 1972), which uses handcrafted rules to interpretnatural language commands for manipulating simple geo-metric objects in a customised block world. The K3 project(Funakoshi, Tokunaga, and Tanaka 2006) uses handcraftedrules to interpret Japanese voice commands to control virtualavatars in a customised 3D environment. Carsim (Johanssonet al. 2004; 2005) uses a customised semantic role labeller toextract car crash information from Swedish newspaper textand visually recreate the crash using a customised graphicssystem.

While existing systems have been successful in their ownrespective domains, this general approach is inappropriatefor the task of animated storyboarding because the input lan-guage and the domain (i.e. all movie genres) are too broad.Instead, we: (a) use a general-purpose graphics system witha general set of graphics commands not tuned to any one do-main, and (b) we use automated NLP and machine learningtechniques to perform language analysis on the input scripts,and control the graphics system.

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)

578

Page 2: Towards Automatic Animated Storyboarding - aaai.org · Towards Automatic Animated Storyboarding Patrick Ye♠ and Timothy Baldwin♠♥ ♠ Computer Science and Software Engineering

Our work is also similar to machine translation (MT) inthe sense that we aim to translate a source language into atarget language. The major difference is that for MT, thetarget language is a human language, and for our work, thetarget language is a programming language. Programminglanguages are much more structured than natural languages,meaning that there is zero tolerance for disfluencies in thetarget language, as they result in the graphics system beingunable to render the scene. Also, the lack of any means ofunderspecification or ambiguity in the graphics commandsmeans that additional analysis must be performed on thesource language to fully disambiguate the language.

OverviewIn this section, we describe our movie script corpus, thenprovide detailed discussion of the virtual stages and how weuse them in our experiments, and finally outline the maincomputational challenges in automatic animated storyboard-ing.

Movie Script CorpusMovie scripts generally contain three types of text: the scenedescription, the dialogue, and the action description. In thisresearch, the only section of the script used for languageanalysis is the acting instructions of the action descriptions,as in (1).2 That is, we only visualise the physical movementsof the virtual actors/props.

(1) Andy’s hand lowers a ceramic piggy bank in front ofMr. Potato Head and shakes out a pile of coins to thefloor. Mr. Potato Head kisses the coins.

In real movies, dialogues are of course accompanied bybody and facial movements. Such movements are almostnever specified explicitly in the movie scripts, and requireartistic interpretation on the part of the actual actors. Theyare therefore outside the scope of this research.

We animated around95% of the Toy Story script using21 base graphics commands. The script was split up accord-ing to its original scene structure. For each annotated scene,we first constructed a virtual stage that resembles the corre-sponding scene in the original movie. We then performedsemantic role labelling of that scene’s acting instructions,determined the correct order for visualisation, annotated ev-ery visualisable verb with a set of graphics commands andfinally, annotated the correct grounding of each argument.In total, 1264 instances of 307 verbs were annotated, withan average of2.84 base graphics commands per verb token.

Virtual StageA novel and important aspect of this research is that we usethe physical properties and constraints of the virtual stageto improve the language analysis of the acting instructions.Below, we outline what sort of information the virtual stageprovides.

The building blocks of our virtual stages are individual3D graphical models of real world objects. Each 3D model

2All the movie script examples in this paper are taken from amodified version of the Toy Story script.

is hand-assigned a WordNet synset in a database. The vir-tual stages are assembled through a drag-and-drop interfacebetween the database and the graphics system.

Each 3D model in the virtual stage can be annotated withadditional WordNet synsets, which is useful for finding en-tities that have multiple roles in a movie. For example, thecharacter Woody in Toy Story is both atoy, i.e. “an artifactdesigned to be played with”, and a person. Furthermore,each 3D model in the virtual stage can be given one or morename, which is useful for finding entities with more than onetitle. For example, in the script of Toy Story, the character ofMrs. Davis is often referred to as eitherMrs. Davisor Andy’smum.

WordNet 2.1 synsets were used to annotate the 3D mod-els. Furthermore, since WordNet provides meronyms for itsnoun synsets, it will be possible to further annotate the com-ponents of each 3D model with the corresponding Word-Net synsets, facilitating the incorporation of more real worldknowledge into our system. For example, consider the sen-tenceAndy sits in the chair. Since a chair (typically) con-sists of a seat, legs, and a back, it would be useful if thesystem could differentiate the individual components of the3D model and avoid unnatural visualisation of the sentence,such as Andy sitting on the back of the chair instead of theseat.

Finally, an important feature of the virtual stages is thatthe status (position, orientation, etc) of any virtual objectcan be queried at any time. This feature makes it possible toextract extra-linguistic information about the virtual objectsduring the animation process.

Main Computational TasksAs stated above, this research centres around extracting verbsemantic information from the acting instructions and thevirtual stage, and then using this information to create anappropriate animation.

We identified two main computational tasks for our re-search: extracting the timing relationships between the verbsin the input script, and constructing graphics commands us-ing the verb semantic and virtual stage information.

Since a correct storyboard requires the actions describedby the movie scripts to be animated in the correct chronolog-ical order, the extraction of the timing relationships betweenthem is the first task of our system. Generally, verbs shouldbe visualised in the order they appear in the text. For in-stance, in (1),lower should be visualised first, followed byshake, and finallykiss. However, there are cases were the or-der of appearance of the verbs does not correspond to the or-der of visualisation, as in (2) wherepummelandbarkshouldbe visualised at the same time rather than sequentially in theorder they appear.

(2) Sid pummelsthe figure with rocks while Scud isbarkingwildly at it.

The second task of constructing graphics commands isconsiderably more complicated than the first task, and isthe main subject of this paper. This task consists of 2subtasks: selecting the graphics instructions for the tar-get verbs, and constructing the arguments for the cho-

579

Page 3: Towards Automatic Animated Storyboarding - aaai.org · Towards Automatic Animated Storyboarding Patrick Ye♠ and Timothy Baldwin♠♥ ♠ Computer Science and Software Engineering

sen graphics instructions. The graphics instructions usedin this research are similar to procedures in procedu-ral programming languages, and have the general formcommand(arg0, arg1, ..., argN). Therefore, the first sub-task is to decide the value ofcommand, and the second sub-task is to decide on the values of each argument to the chosencommand.

The linguistic analysis involved in the task of construct-ing graphics commands is mostly focused on the extractionof verb semantics. The main NLP technique used here issemantic role labelling. However, as semantic role labellersonly identify the surface string associated with each verb ar-gument, an additional step is required to ground the surfacestring associated with each argument to one or more spe-cific virtual stage object. This grounding task is relativelyless studied, and is far from a trivial problem. For instance,considerThe toys on the floorin (3). This does not simplycorrespond to all virtual objects that are annotated with thesynsettoy1 in the virtual stage, but has the extra constraintsof the objects needing to be on the floor and salient to thecurrent area of focus.

(3) The toys on the floor all stop and run towards themonitor.

Classifier SetupIn this section, we provide details of how we apply machinelearning to the task of choosing the graphics commands.

Since we use a generic graphics system, it is uncommonfor a verb to be fully visualised with a single base graphicscommand, and very often, it is necessary to use a sequenceof combinations of base graphics commands to visualise averb. Overall, the average number of base graphics com-mands used to visualise a single verb was2.84.

In the context of the classifier, we will refer to these com-binations of base graphics commands as “actions”. Basegraphics commands within a single action are executed fromthe same time stamp, sequentially at the beginning of eachanimation cycle.

We treat the subtask of assigning sequences of actions toverbs as a Markov process, and the overall task of actiongeneration as a classification problem. The feature vectorof the classifier denotes the latest semantic and virtual stageinformation, and the class labels denote the next action to beperformed. For example, suppose the sentence to be visu-alised isThe man grabs the mug on the table, and the virtualworld is configured as in Figure 1a. The classifier should se-lect a set of base commands (i.e. an action) which make thevirtual man stand up (Figure 1b), move close to the virtualmug (Figure 1c), and extend his hand to the mug (Figure 1d).Finally, the classifier should select anend-of-sequenceaction signalling the completion of the visualisation. Each ofthese movements needs to be executed and the virtual stageupdated, and this new information should be used to condi-tion the selection/animation of the next movement.

In the remainder of this section, we present the featuresthat we used in our experiments.

Linguistic FeaturesThe linguistic features used in our system can be dividedinto the following categories:

Verb Features: These features include the original form ofthe target verb, its lemmas and WordNet synsets, and thehypernyms of these synsets. The inclusion of hypernymsis intended to provide the classifiers with the means togeneralise when dealing with previously unseen verbs.The verbs are not disambiguated; instead all WordNetsenses of the target verb are included.

Collocation Features: These are the lemmas of all the openclass words (determined according to the POS tagger out-put) that occur in the same sentence as the target verb.

Semantic Role Features:These features include: the typesof roles that the target verb takes (ARG0, ARGM-LOC,etc.); the WordNet synsets of the head words of thegrounded constituents combined with the semantic rolesof the constituents; and a series of syntactic-collocationalfeatures of the constituents of the target verb. Note that wehave hand-resolved all anaphora in the data (around35%of the sentences in the training data contained one or moreanaphora), and used this oracle-style anaphora resolutionfor all methods presented in this paper.

The syntactic-collocational features of the constituentswere designed to capture the details missed by the groundedsemantic roles. The purpose of grounding semantic rolesis to find the most relevant virtual object with respect togiven textual forms of constituents, which is not sufficientif the most relevant virtual object does not exactly corre-spond to the semantics of the constituent. For example, con-siderARG0 andARGM LOC of the SRLed sentence in (4).The head ofARG0 is handwhich corresponds to a part ofthe body ofAndy. However, since grounding is performedonly at the level of virtual objects,ARG0 is grounded tothe virtualAndy, and the head of the constituent is lost inthe grounding process. Similarly forARGM LOC, the con-stituent is a PP, and the most relevant virtual object is thevirtual Mr. Potato Head. This is hence what this phrase willbe grounded to, resulting in the additional locational infor-mation ofin front of being lost. The syntactic-collocationalfeatures reinstate the linguistic content of each constituentto the feature space, providing the classifier with the meansto fine-tune the grounding information and more faithfullyrender the virtual scene.

(4) [ARG0 Andy’s hand] [TARGET lowers] [ARG1 a ceramic piggybank] [ARG LOC in front of Mr. Potato Head]

The syntactic-collocational features are collected recur-sively from the root of the smallest parse subtree coveringall the tokens in the corresponding constituent. Only seman-tic roles which are prepositional phrases (PP), noun phrases(NP), adjective phrases (ADJP) and adverb phrases (ADVP)are used to generate these features.

If the semantic role is a PP, it will generate a feature thatincludes the head preposition and the recursively retrievedfeatures of its argument. If the constituent is an NP, it willgenerate a feature that includes the WordNet synset of the

580

Page 4: Towards Automatic Animated Storyboarding - aaai.org · Towards Automatic Animated Storyboarding Patrick Ye♠ and Timothy Baldwin♠♥ ♠ Computer Science and Software Engineering

1a 1b 1c 1d

Figure 1: Examples of the sequence of virtual stages corresponding toThe man grabs the mug on the table

head noun and the recursively retrieved features of its pos-sessor and modifiers. If the semantic role is an ADJP orADVP, it will generate a feature that includes its head, andall the recursively retrieved features of its modifiers. Takethe constituentin front of Mr. Potato Head, for example.Mr. Potato Headwould generate the feature of (NP, toy1),andfront of Mr. Potato Headwould generate the feature of(NP, front2, (PPMOD, PP,of , (ARG, (NP, toy1)))).

The ASSERT system (Pradhan et al. 2004) was used toperform the semantic role labelling, the Charniak rerankingparser (Charniak and Johnson 2005) was used to perform theparsing, and the SVMTool (Gimnez and Mrquez 2004) POSTagger (version 1.2.2) was used to POS-tag the input text.

Virtual Stage FeaturesThe virtual stage features are binary features designed tocapture the spatial relationships between the grounded con-stituents. The general spatial relationships are denoted as theresults of different proximity and orientation tests betweenthe virtual objects. For virtual objects that contain bones, thespatial relationships also include the results of the proximitytests between their bones and other virtual objects.

In addition, the virtual stage features include descriptionof each ongoing base graphics command that each groundedvirtual object is involved in.

ExperimentsThe main questions we address in our experiments are: (1)What is the effect of unseen verbs on overall performance?(2) Does semantic role labelling positively contribute toclassification accuracy? and (3) How reliant is the classi-fication on accurate grounding?

In all the experiments, we divided our original datasetsinto training, development and test sets. The training setcontains roughly the same number of examples as the com-bined development and test sets, and the development andtest sets are of roughly the same size. We used a maxi-mum entropy (Ratnaparkhi 1996) classifier to generate mod-els from the training data.3 The parameters of these modelsare then adjusted using the development data, and finallytested on the test data. The details of the parameter tuningcan be found in (Ye and Baldwin 2006).

To explore the first question of the impact of unseen verbson overall performance, we performed 2 sets of experiments.

3The learner is downloadable from:homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html

S1

S

VP

NP ADVPAndy

NP

puts

.

piggythe bank back

Andy puts the piggy bank back-1 TARGET +1 +2

Figure 2: Example of substitute semantic roles

The first set of experiments was designed to test how well theclassifiers work when the verbs in the test data are includedin the training data whenever possible, i.e. when verbs in-stances are evenly distributed. The second set of experi-ments was designed to test how well the classifiers workwhen all the verbs in the test data are randomly chosen tonot appear in the training data.

To explore the second question of the impact of SRL, weexperimented with three different SRL methods: using goldstandard semantic roles, using an automatic labeller (AS-SERT), and using simple parse features as a replacement forsemantic roles.

The gold standard semantic role data was created by hand-correcting the outputs of ASSERT, and the automatically la-belled semantic roles were then evaluated using the evalua-tion script for the CoNLL 2004 SRL shared task (Carrerasand Marquez 2004). Overall, ASSERT had a precision of.744, a recall of.564 and an F-score of.642. For compar-ison, over the CoNLL 2004 data, ASSERT had a precisionof .724, a recall of.668 and an F-score of.695. It can beobserved that the recall over the movie script corpus is sig-nificantly lower than the data it was trained for. We also ob-served that many of the missing semantic roles were causedby the POS tagger, which mis-classified several high fre-quency verbs (e.g.grab, jumpandclimb) as nouns.

When no semantic roles are used, we used a parse tree-based method to extract top-level NPs, PPs, ADJPs and AD-VPs from the output of the Charniak parser, and used therelative positions of these phrases with respect to the targetverb as substitute semantic roles. Figure 2 shows the sub-stitute semantic roles for the sentenceAndy puts the piggy

581

Page 5: Towards Automatic Animated Storyboarding - aaai.org · Towards Automatic Animated Storyboarding Patrick Ye♠ and Timothy Baldwin♠♥ ♠ Computer Science and Software Engineering

SRL Precision Recall F-scoreGold .508 .701 .589Auto .470 .490 .480

Table 1: Grounding results

bank back.The final question of grounding performance was ex-

plored via two sources of grounding information: (1) goldstandard grounding annotations based on the gold standardsemantic roles; and (2) automatic grounding predictionsbased on either the gold standard semantic roles, the outputsof ASSERT or the parse tree-based phrase extractor.

The automatic grounding predictions were obtained us-ing a string matching method. Given the text descriptorof an object in the text, this method first tries to match itagainst the user-specified names of the virtual objects, andif no virtual objects match, it then matches the string againstthe WordNet synsets of the virtual objects. For example,if the virtual stage contains a virtual actor named Woody,and the stringWoodyor Woody’s handis to be matched, theautomatic grounding module will correctly return the vir-tual Woody. On the other hand, if the virtual stage containsmultiple virtual toys, and the stringtoy is to be matched,the automatic grounding module would incorrectly return allthe virtual toys. Table 1 shows the results of the automaticgrounding method on the gold standard semantic roles andautomatically-labelled semantic roles.

We evaluate our models in terms of classification accu-racy with respect to the gold-standard action annotations foreach verb, on a per-action basis. The majority class baselinefor our classifiers is the overall most frequent action (i.e.end-of-sequence).

Results and AnalysisExperiment Set 1: Balanced DistributionIn this first set of experiments, we partitioned the training,development and test datasets such that the annotated verbinstances were stratified across the three datasets. The strat-ification ensures the distributions of verbs in the training,development and test sets are as similar as possible. Our“Classifier” performance is based on a classifier which hasbeen trained over all the verbs in the training set. Table 2shows the classification accuracies under different combina-tions of gold-standard and automatic grounding, and gold-standard, automatic and parser-emulated SRL. The baselineof these classifiers is.386.

All our classifiers outperformed the majority class base-line. Unsurprisingly, classifiers trained on both thegold-standard SRL and the gold-standard grounding dataachieved the highest accuracy, although the relative differ-ence in including gold-standard SRL was considerably lessthan was the case with the grounding.

Source of Source of SRLGrounding Gold Auto Parser

Gold .568 N/A N/AClassifierAuto .531 .526 .475

Table 2: Accuracy for experiment set 1 (no unseen verbs)

Source of Source of SRLGrounding Gold Auto Parser

Gold .536 N/A N/AClassifierAuto .479 .452 .463

Table 3: Accuracy for experiment set 2 (all unseen verbs)

These experiments show that SRL can indeed be useful inthe domain of automatic animated storyboarding.

It is perhaps slightly surprising that the classifier based onautomatically grounded gold-standard semantic roles onlyslightly outperformed the one based on automatic SRL. Acloser look at the results showed that the gold-standardgrounding annotation only identified1266 grounded seman-tic roles, but the automatic grounding method came upwith 1748 grounded semantic roles. This overgenerationof grounded semantic roles introduced a massive amount ofnoise into the data, thereby lowering the performance of thecorresponding classifier.

On the other hand, it was observed that ASSERT tendedto make the same mistakes for the same verbs. Hence, thesame SRL mistakes were consistently present in both thetraining data and the test data, thereby not causing signifi-cant negative effects.

These experiments show that there is significant room forimprovement in the grounding of the semantic roles. Thebiggest improvement among the SRL based classifiers camewhen gold-standard grounding annotation was used. Thisindicates that the string-matching grounding system we arecurrently using is inadequate, and deeper linguistic analysisis needed.

Experiment Set 2: All Unseen VerbsRecall from above that all the hypernyms of all the sensesof the target verbs were included in the verb features, in thehope that they could provide some generalisation power forunseen verbs. The purpose of this second set of experimentsis to test how well our classifiers can generalise verb seman-tics.

We first randomly divided our datasets into 4 portions,each containing roughly the same number of verbs, with allinstances of a given verb occurring in a single portion. Wethen performed 4-fold cross-validation, using two portionsas training, one as development and one as test data. In ourresults, we report the average classification accuracy.

Table 3 shows the classification accuracies under the samecombinations of SRL and grounding data as for experimentset 1. The majority class baseline for these classifiers is.393

All the classifiers in this set of experiments performedworse than in the first set of experiments. This is not sur-

582

Page 6: Towards Automatic Animated Storyboarding - aaai.org · Towards Automatic Animated Storyboarding Patrick Ye♠ and Timothy Baldwin♠♥ ♠ Computer Science and Software Engineering

prising given that we do not have direct access to the se-mantics of a given verb, as it never occurs directly in thetraining data. It is also encouraging to see that despite theperformance drop, all the classifiers still outperformed thebaseline.

The classifiers based on automatically-grounded seman-tic roles suffered the greatest drop in performance: the au-tomatic SRL-based classifier performed below the parser-emulated SRL based classifier, even. This is not surprisingbecause most of the verb semantic information used by theseclassifiers is provided by the grounded semantic roles. Inthe first set of experiments, even though the semantic rolelabeller didn’t perform well, at least the errors were con-sistent between the training and test sets, and this consis-tency to some degree offset the poor semantic role labelling.However, in this set of experiments, the errors in automati-cally obtained semantic roles are no longer consistent in thetraining and test sets, and only the correctly-labelled seman-tic roles generalise well to unseen verbs. This is why theclassifier based on the gold-standard grounding annotationssuffered the least performance drop among all the semanticrole-based classifiers, whereas the classifier based on auto-matic SRL suffered the most.

On the other hand, the parser-emulated semantic roles arenot greatly affected by the unseen verbs, because these se-mantic roles only depend on the relative positions of the rele-vant phrases to the target verb. These relative positions tendto be much more verb-independent than the real semanticroles, and are therefore less affected by the variation of thetarget verbs.

Discussion and ConclusionThree observations can be made from the experiments.Firstly, unseen verbs have a noticeable negative impact onthe overall performance of the classifier, especially whenthe semantic roles are not of high standard. However, un-seen verbs did not cause the classifiers to completely fail,which is encouraging as it shows that our method is rela-tively robust over unseen verbs, unlike rule-based systemswhich rely on explicit information for each verb.

Secondly, semantic role labelling contributes positively toour task. However, it needs to achieve higher performancein order to be consistently useful when unseen verbs are in-volved.

Thirdly, the performance of the classification relies heav-ily on the grounding of the semantic roles, and the stringmatching-based grounding method tends to overgenerategroundings, which in turn introduce noise into our data andreduce the effectiveness of the resultant classifier.

In conclusion, we have presented a novel machine learn-ing method for automatically generating animated story-boards. Our method uses several NLP techniques and re-sources, including parsing, SRL, and WordNet. The ex-periments presented in this paper show that the features weused are effective and can generalise over previously unseenverbs. Furthermore, we believe that the inclusion of a vir-tual stage (i.e. encoding of real world interaction) providesa novel perspective to the application and evaluation of NLP

techniques, as demonstrated by the use of SRL and parsingin our experiments.

Our short term future work will be focused on the ground-ing process of the semantic roles. We will build a more com-plete grounding module which is capable of resolving com-plex NPs and PPs with the aid of the virtual stage, and wewill investigate how current techniques in word sense disam-biguation, anaphora resolution and co-reference resolutioncan be incorporated with the grounding process to provide amore integrated solution.

ReferencesCarreras, X., and Marquez, L. 2004. Introduction tothe CoNLL-2004 shared task: Semantic role labeling. InProc. of the 8th Conference on Natural Language Learn-ing (CoNLL-2004), 89–97.Charniak, E., and Johnson, M. 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. InPro-ceedings of the 43rd Annual Meeting of the Association forComputational Linguistics (ACL’05), 173–180.Fellbaum, C., ed. 1998.WordNet: An Electronic LexicalDatabase. Cambridge, USA: MIT Press.Funakoshi, K.; Tokunaga, T.; and Tanaka, H. 2006. Con-versational animated agent system K3. InProceedings of2006 International Conference on Intelligent User Inter-faces (IUI2006), Demo Session.Gildea, D., and Jurafsky, D. 2002. Automatic labeling ofsemantic roles.Computational Linguistics28(3):245–288.Gimnez, J., and Mrquez, L. 2004. SVMTool: A generalPOS tagger generator based on support vector machines. InProceedings of the 4th International Conference on Lan-guage Resources and Evaluation, 43–46.Johansson, R.; Williams, D.; Berglund, A.; and Nugues,P. 2004. Carsim: A System to Visualize Written RoadAccident Reports as Animated 3D Scenes. InACL2004:Second Workshop on Text Meaning and Interpretation, 57–64.Johansson, R.; Berglund, A.; Danielsson, M.; and Nugues,P. 2005. Automatic text-to-scene conversion in the trafficaccident domain. InIJCAI-05, Proceedings of the Nine-teenth International Joint Conference on Artificial Intelli-gence, 1073–1078.Pradhan, S.; Hacioglu, K.; Krugler, V.; Ward, W.; Martin,J. H.; and Jurafsky, D. 2004. Support vector learning forsemantic argument classification.Machine Learning60(1–3):11–39.Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. InProceedings of the Conference onEmpirical Methods in Natural Language Processing, 133–142.Winograd, T. 1972. Understanding Natural Language.Academic Press.Ye, P., and Baldwin, T. 2006. Semantic role labeling ofprepositional phrases.ACM Transactions on Asian Lan-guage Information Processing (TALIP)5(3):228–244.

583


Recommended