Grounding language acquisition by training semantic ... · with a yellow cup.”, yet the parse, x:object(x), corresponding to a sentence like “There exists an object.”, is also

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2647–2656Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

2647

Grounding language acquisition by training semantic parsersusing captioned videos

Candace RossCSAIL, MIT

[email protected]

Andrei BarbuCSAIL, MIT

[email protected]

Yevgeni BerzakBCS, MIT

[email protected]

Battushig MyanganbayarCSAIL, MIT

[email protected]

Boris KatzCSAIL, MIT

[email protected]

AbstractWe develop a semantic parser that is trained ina grounded setting using pairs of videos cap-tioned with sentences. This setting is bothdata-efficient, requiring little annotation, andsimilar to the experience of children wherethey observe their environment and listen tospeakers. The semantic parser recovers themeaning of English sentences despite not hav-ing access to any annotated sentences. It doesso despite the ambiguity inherent in visionwhere a sentence may refer to any combina-tion of objects, object properties, relations oractions taken by any agent in a video. For thistask, we collected a new dataset for groundedlanguage acquisition. Learning a grounded se-mantic parser — turning sentences into logi-cal forms using captioned videos — can sig-nificantly expand the range of data that parserscan be trained on, lower the effort of training asemantic parser, and ultimately lead to a betterunderstanding of child language acquisition.

1 Introduction

Children learn language from observations thatare very different in nature from what parsers aretrained on today. Most of the time, rather than re-ceiving direct feedback such as annotated sentencesor answers to direct questions, children observe andoccasionally interact with their environment. Theymust use these observations to learn the structureof the speaker’s language despite never seeing thatstructure overtly. This weak and indirect super-vision where most of the information is obtainedthrough passive observation poses a difficult disam-biguation problem for learners: how do you knowwhat the speaker is referring to in the environment,i.e., what does the speaker mean? Speakers canrefer to actions, objects, the properties of actionsand objects, relations between those actions andobjects, as well as other features in the environmentand generally do so by combining multiple features

The woman walks by the table with a yellow cup.λxyz.woman x,walk x, near x y, table y,

hold x z, yellow z, cup z

Figure 1: We develop a semantic parser trained on video-sentence pairs, without parses. At inference time a sentence,without a video, is presented and a logical form is produced.

into complex sentences. Moreover, speakers neednot refer to the most visually salient parts of a vi-sual scene. Here, we induce a semantic parser bysimultaneously resolving visual ambiguities andgrounding the semantics of language using a cor-pus of sentences paired with videos without otherannotations.

The goal of semantic parsing is to convert anatural-language sentence into a representationthat encodes its meaning. The parser takes sen-tences as input and produces these representations– a lambda-calculus expression in our case – thatcan be used for a variety of tasks such as query-ing databases, understanding references in imagesand videos, and answering questions. To train theparser presented here we collected a video dataset,balanced such that the raw statistics of the co-occurrences of objects and events are not infor-mative, and asked annotators on Mechanical Turkto produce sentences that are true of those videos.The parser is presented with pairs of short clipsand sentences. It hypothesizes potential meaningsfor those sentences as lambda-calculus expressions.Each hypothesized expression serves as input fora modular vision system that constructs a specificdetector for that lambda-calculus expression anddetermines the likelihood of the parse being true ofthe video. The likelihood of the parse with respectto the video is used as supervision for the parser. To

2648

test the parser, we annotated each sentence with itsground-truth semantic parse, but this informationis not available at training time.

This process introduces ambiguity. For example,Figure 1 shows a frame from a video annotatedwith the sentence “The woman walks by the tablewith a yellow cup.”, yet the parse, λx. object(x),corresponding to a sentence like “There exists anobject.”, is also true of that video. For a singlevideo there exists an infinite number of true parsesthat have high likelihood with respect to the visionsystem because they are indeed indicative of some-thing that is occurring in the video. We demonstratehow to construct a semantic parser that resolves thisambiguity and acquires language from captionedvideos by learning to tune the amount of polysemyin the induced lexicon.

This work makes several contributions: We showhow to construct a semantic parser that learns lan-guage in a setting closer to that of children. Wedemonstrate how to jointly resolve linguistic andvisual ambiguities at training time in a way that canbe adapted to other semantic parsing approaches.We demonstrate how such an approach can be usedto augment data where a small number of directlyannotated sentences can be combined with a largenumber of videos paired with sentences in order toimprove performance. We release a dataset system-atically constructed and annotated on MechanicalTurk for joint visual and linguistic learning tasks.

2 Prior work

Learning to understand language in a multimodalenvironment is a well-developed task. For example,visual question answering (VQA) datasets have ledto a number of systems capable of answering com-plex questions about scenes (Antol et al., 2015).The goal of our work is not to produce answers forany one set of questions, although it is possible todo so from our results; it is instead to learn to pre-dict the structure of the sentences and their mean-ing. This is a more general and difficult problem,in particular because at test time we do not receiveany visual input, only the sentence. The resultingapproach is reusable, generic and more similar tothe kind of general-purpose linguistic knowledgethat humans have. For example, one could use itto guide robotic actions. Al-Omari et al. (2017)acquire a grammar for a fragment of English andArabic from videos paired with sentences. Theylearn a small number of grammar rules for a lan-

guage restricted to robotic commands. Learningoccurs mostly in simulation and with little visualambiguity, and the resulting model is not a parserbut a means of associating n-grams with visualconcepts.

Siddharth et al. (2014) and Yu et al. (2015) ac-quire the meaning of a lexicon from videos pairedwith sentences but assume a fully-trained parser.Matuszek et al. (2012) similarly present a modelto learn the meanings and referents of words re-stricted to attributes and static scenes. Hermannet al. (2017) extend these notions to train agentsthat learn to carry out instructions in simulated en-vironments without the need for a parser, but doso using simple adjective-noun-relation utterances.Kollar et al. (2013) learn to parse similar utterancesin an interactive setting. Wang et al. (2016) cre-ate a language game to learn a parser but do notincorporate visual ambiguity or fallible perception.

Berant et al. (2013) describe semantic parsingwith execution by annotating answers to databasequeries. This learning mechanism provides thesame results as the one described here: a parserproduces the meanings of sentences at inferencetime without requiring the database, or in our casea video. Databases have far less ambiguity thanvideos; there is not a temporal aspect to their con-tents and there is not a notion of unreliable per-ception. Berant and Liang (2014) learn to parsesentences from paraphrases; one might considerthe work here as concerned with visual and notjust linguistic paraphrases. Artzi and Zettlemoyer(2013) consider a setting where a validation func-tion involves the dynamic actions of a simulatedrobot while sentences describe its actions.

3 Task

Given a dataset of captioned videos, D, we trainthe parameters and lexicon, θ and Λ, of a semanticparser. At training time, we perform gradient de-scent over the parameters θ and employ GENLEX(Zettlemoyer and Collins, 2005) to augment thelexicon Λ. The objective function of the seman-tic parser is written in terms of a visual-linguisticcompatibility between a hypothesized parse p andvideo v. This compatibility computes the likeli-hood of the parse being true of the video, P (v|p).At test time, we take as input a sentence withoutan associated video and produce a semantic parse.We could in principle also take as input the videoand produce a targeted parse for that visual sce-

2649

nario. This is a problem similar to that consideredby Berzak et al. (2015), but we do not do so here.

We create a CCG-based (Combinatory Categor-ical Grammar; Steedman (1996)) semantic parsercapable of being trained in this setting. To do so,we adapt the objective function, training proce-dure, and feature set to this new scenario. Thevisual-linguistic compatibility function is similarto the Sentence Tracker developed in Siddharthet al. (2014) and Yu et al. (2015). Given a parse,the Sentence Tracker produces a targeted detectorthat determines if the parse is true of a video, whichprovides a weak supervision signal for the parser.

Parses are represented as lambda-calculus ex-pressions consisting of a set of binders and a con-junction of literal expressions referring to thosebinders. The domain of the variables are the poten-tial object locations, or object tracks, in the videos.For example, in the parse presented in Figure 1,three potential object track slots are available, rep-resented by the binders x, y, and z. Because ofperceptual ambiguities and the large number ofpossible referents in any one video, we do not ex-plicitly enumerate the space of object tracks. In-stead, we rely on a joint-inference process betweenthe parser and the Sentence Tracker. Intuitively,each literal expression of the parse asserts a con-straint; for example, if an expression conveys thatone object is approaching another, the SentenceTracker will search the space of object tracks andattempt to satisfy these constraints. In Figure 1, forinstance, there is a constraint that for whicheverobjects are bound to x and z, x must be near y, xmust be walking, x must be a person, etc.

4 Model

We develop an approach that combines a semanticparser with a vision system at training time, butdoes not require the vision system at test time.

4.1 Semantic Parsing

We adopt a semantic parsing framework similarto that of Artzi and Zettlemoyer (2013), althoughthe general approach of using vision as weak su-pervision for semantic parsing generalizes to otherparsers. CCG-based parsing employs a small num-ber of fixed unary and binary derivation rules(Steedman, 2000) while learning a lexicon. InCCG-based parsing, a parser takes as input a se-quence of tokens and a lexicon that maps tokens topotential syntactic types and derives parse trees by

She takes the cup

NP (S\NP)/NP NP/N Nλx. person x λfgxy. fx, take xy, gy λfx. fx λx. cup x

>NP

λx.cup x>

S\NPλfxy. fx, take xy, cup y

<S

λxy. person x, take xy, cup y

Figure 2: A simple sentence parsed into a lambda-calculusexpression using a CCG-based grammar. The parse is deter-mined by the lexicon that associates tokens with syntactic andsemantic types as well as the order of function applications.Here, we acquire this lexicon and a means to score derivations.

creating and ranking multiple hypotheses that com-bine those types together. The syntactic types arericher than other approaches and include forwardand backward function application (the forwardand backward slash) in addition to the standardsyntactic categories. Each derivation has a currentsyntactic type that is the result of the application ofa sequence of rules. To create a derivation, at eachstep the parser applies each rule to either an indi-vidual subderivation or to a pair of subderivations.This process produces multiple hypotheses. Pars-ing rules are generic, polymorphic, and language-neutral and include concepts like function applica-tion and type raising (Carpenter, 1997). The parseraccepts a derivation when the tree reaches a singlenode. We refer to the single node of the parse treeas the logical form. Figure 2 shows a parse startingwith tokens and their syntactic types along witheach rule being applied.

Semantic parsing with CCGs extends this frame-work to simultaneously derive a logical form whileperforming syntactic parsing. Each syntactic ruleincludes a simple semantic component that ma-nipulates the logical form of its arguments. Forexample, the forward application rule reduces thesyntactic type by applying the syntactic type ofthe right argument to that of the left, while at thesame time performing a lambda-calculus reductionof the semantic types of those same arguments.Concretely, consider a case from Figure 2 wherea determiner is attached to a noun, the cup. Thetokens the and cup are hypothesized to have syn-tactic types NP/N and N (a function returningNP given an argument on the right side and anoun) and semantic type λfx.fx and λx.cup(x)(the identity function and a function that adds a cupconstraint). These two derivations can be reducedby forward application, denoted by >. Both thesyntactic and semantic types are applied and re-duced, which means the semantics helps guide the

2650

syntax. Derivations that produce illegal operations,such as applying an argument to a constant, areforbidden.

Following Zettlemoyer and Collins (2005) andCurran et al. (2007), we adopt a weighted linearsemantic parser. For each sentence paired with itshypothesized derivation, this approach computes afeature vector φ and a parameter vector θ. Givena sentence s, a parse p, a lexicon Λ, the set ofall possible parses for that sentence with that lexi-con, P (s,Λ), and an n-dimensional feature vectorcomputed for that sentence and parse, φ(s, p), theparser optimizes

argmaxp∈P

θ · φ(s, p) (1)

to find the best parse p∗. Using a fixed-width beamsearch, the parser enumerates derivations by choos-ing a potential syntactic and semantic type for eachtoken from the lexicon and choosing a set of deriva-tion rules to apply. For the i-th training sampledi, consisting of a sentence dsi and a video dvi indataset D and the feature function, the parser findsmargin-violating positive, E+, and negative, E−,parses, and then uses

θ +1

|E+i |

∑e∈E+

i

φi(e, dvi )− 1

|E−i |∑e∈E−

i

φi(e, dvi )

(2)to update the parameter θ. After each sweepthrough the dataset, the lexicon Λ is augmentedusing the modified GENLEX from Artzi and Zettle-moyer (2013), which does not require the ground-truth logical form. At no point is the logical formneeded for updating the lexicon or parameters; werely instead on a visual validation function to com-pute the margin-violating examples.

Rather than attempting to learn a fixed lexiconthat directly maps tokens to semantic and syntac-tic parses, we use a factored lexicon like that ofKwiatkowski et al. (2011). This represents tokensand any associated constants separately from po-tential syntactic and semantic types. For example,the token chair is associated with a single con-stant chair; chair ` [chair]. In addition to thetoken-constants pairs, there exists a list of pairs ofsyntactic and semantic types along with placehold-ers for constants; in the case for chair, a useful typemight be λv.[N : λx.placeholder(x)]. Whenparsing, each token is applied to a potential syntac-tic and semantic type and the derivation proceedsfrom there. The factored lexical entries allow forfar greater reuse; the model learns a small number

of constants that a word can imply separately froma small number of syntactic and semantic types forany word. The weighted linear CCG-based parsersearches over potential lexical entries, applyingthe token to different syntactic and semantic typesand over multiple hypotheses for which rule shouldbe applied. At training time, in order to learn areasonable lexicon and set of parameters, a super-vision signal is required to validate candidates. Weprovide that supervision using the vision systemdescribed below.

4.2 Sentence TrackingTo score a video-parse pair, we employ a frame-work similar to that of Yu et al. (2015). This ap-proach constructs a parse-specific model by ex-tracting the number of participants in the scenedescribed by a caption as well as the relationshipsand properties of those participants. It builds agraphical model where each participant is local-ized by an object tracker and each relationship isencoded by temporal models that express the prop-erties of the trackers that those models refer to. Theparser’s output representation is chosen to makebuilding the vision system possible. Each targetlogical form is a lambda expression with a set ofbinders, whose domain are objects, and a conjunc-tion of constraints that refer to those binders. Inessence, this notes which objects should be presentin a scene and what static and changing propertiesand relationships those objects should have withrespect to one another.

The Sentence Tracker creates one Viterbi-basedtracker for each participant and, given a map-ping from constraints to Hidden Markov models(HMMs), connects each tracker and each constrainttogether. Given a video v and a parse p, first a largenumber of object detections are computed for thevideo by using a low confidence threshold of anobject detector. Trackers weave these bounding-box detections into high-scoring object tracks anduse constraints to verify if the tracks have the de-sired properties and relations. Inference proceedsjointly between vision and the parse to allow theparse to focus the vision component on events andproperties that might otherwise be missed.

Understanding the relationship between a sen-tence and a video requires finding the objects thatthe sentence refers to and determining if those ob-jects follow the behavior implied by the sentence.We carry out a joint optimization that finds objectswhose behavior follows certain rules. For clarity,

2651

the two steps are presented separately, while wefind the global optimum for a linear combination ofEquation (3) and Equation (4). Object trackers are amaximum-entropy Markov model with a per-framescore f , the likelihood that any one object detec-tion is true, as well as a motion-coherence scoreg, the likelihood that the bounding boxes selectedbetween frames refer to the same object instance.Given a parse p with L participants and a video vof length T , Equation (3) shows the optimizationwhere J is a set of L candidate tracks ranging overevery hypothesis from the object detector and b isa candidate object detection.

maxJ

L∑l=1

(T∑

t=1

f(btjtl) +

T∑t=1

g(bt−1

jt−1l

, btjtl)

)(3)

Determining if an object track follows a set ofbehaviors implied by a sentence is done using acollection of HMMs. Each has a per-frame score hthat observes one or more objects tracks, depend-ing on the number of participants in the behaviorbeing modeled, and a transition function a thatdetermines the temporal sequence of the behav-ior. Given a parse p with C behaviors, also termedconstraints, along with a video v of length T , Equa-tion (4) shows the optimization where K is a set ofstates, one for each constraint, and γ is a linkingfunction.

maxJ,K

C∑c=1

(T∑

t=1

hc(bt−1

jt−1

γ1c

, btjtγ2c

, ktc) +

T∑t=1

ac(kt−1c , ktc)

)(4)

The linking function is an indicator variable thatencodes the structure of the logical form therebyfilling in the correct trackers as arguments for thecorresponding constraints. The exposition abovepresents a variant using binary constraints that istrivially generalized to n-ary constraints by extend-ing γ and adding arguments to the appropriate con-straint observation functions hc. The domain of theoptimization problem is the combination of all ob-jects at all timesteps that the logical form can referto as well as every state of each constraint. TheViterbi algorithm carries out this optimization intime linear in the length of the video and quadraticin the number of detections per frame. The result isa likelihood of the parse being true of a video. Thisis used to create the joint model that supervises theparser with vision. The tracker can also produce atime series of bounding boxes that make explicitthe groundings of the sentences, though we do notuse these directly here.

4.3 Joint Model

At training time, we jointly learn using both thesemantic parser and the language-vision compo-nent. At test time, only the parser is used. Twoparameters are learned, a set of weights θ and thelexicon Λ. For both the parser and the associatedlanguage-vision component, Λ is used to structureinference. To induce new lexical entries, we em-ploy a variant of GENLEX (Artzi and Zettlemoyer,2013) that takes as input a validation function —the compatibility between a parse and the video.This GENLEX uses an ontology of predicates, avalidation function, and templates from the currentlexicon to construct new syntactic and semanticforms. A ground-truth logical form is not requiredor used.

The joint model must learn these parametersdespite three sources of noise. First, the vision-language component may simply fail to producethe correct likelihood because machine vision isfar from perfect. Overcoming this requires largebeam widths to avoid falling into local minima dueto these errors.

Second, an infinite number of possibly-erroneous parses are true of a video. When childrenlearn language, they face this same challenge asthey do not have access to bounding boxes or tological forms. The parse λx.person(x) as well asmany other seemingly reasonable parses are trueand cannot be distinguished from the ground-truthparse — which is not available — by the visioncomponent. This is a far less constrained environ-ment than other approaches to semantic parsing.It is easy to be misguided by a loss function thatis often true when it should not be and thus cre-ate many special-purpose definitions of words thathappen to fit the peculiarities of any video. Thisresults in two different problems: assigning emptysemantics to many words since the likelihood of asubset of a parse is always the same or higher thanthe whole parse and excessive polysemy where themeaning of a word is highly specific to some irrele-vant feature in a video. We introduce two featuresto the parser that bias it against empty semanticsand against excessive polysemy. Models of com-munication such as the Rational Speech Acts model(Frank and Goodman, 2012) predict that speakerswill avoid inserting meaningless words. One fea-ture counts the number of predicates mapped ontosemantic forms which are empty that occur in eachparse. The other feature attempts to prevent exces-

2652

sive polysemy by counting how many new seman-tic forms are introduced for existing tokens by thegenerated entries from each parse. As the parserbecomes more capable of handling sentences in thetraining set, these features begin to bias it againstadding empty semantics and new semantic forms.

Third, models in computer vision are computa-tionally expensive while many evaluations of parse-video pairs are required to train a parser. To over-come this, we construct a provably-correct cachethat keeps track of failing subexpressions. Thisis possible because of a feature of this particu-lar vision-language scoring function: the scoredecreases monotonically with the number of con-straints. With these improvements, the modifiedsemantic parser employing vision-language-basedvalidation learns to map sentences into semanticparses despite facing a challenging setting with fewexamples and much ambiguity.

5 Dataset

We collected and annotated a dataset of captionedvideos with fully annotated semantic parses of thecaptions. The videos contain people carrying outone of 15 actions, such as picking things up andputting things down, with one of 20 objects span-ning 10 different colors. We control for 11 spatialrelations between objects and actors. Many videosdepict multiple agents performing actions leadingto additional ambiguity. Videos were filmed in mul-tiple locations with multiple agents but care wastaken to ensure that the background and agents arenot informative of the events depicted.

On Mechanical Turk we asked participants toprovide sentences that describe something aboutthe video. We did not specify what participantsshould describe to avoid biasing them and to addrichness to the dataset. This sometimes led to sen-tences that referred to properties of the video thatare well beyond the capacities of the vision sys-tem, e.g., descriptions of an agent being lazy orreferences to the camera’s movement. We removedsuch sentences. At training time, the parser re-ceives captioned videos but no annotations aboutwhich objects those captions refer to. Each sen-tence was annotated with a ground-truth semanticform by two trained annotators using a set of 34predicates. Each sentence was then reviewed andcorrected by one other annotator.

To detect the objects in the videos, we used twooff-the-shelf detectors, OpenPose (Cao et al., 2017)

for person detection and YOLO version 3 (Redmonand Farhadi, 2018) for the remaining objects. Ineach case we significantly lowered the confidencethreshold to avoid false negatives. Many objects inthis dataset are small and are handled by humans,which leads to regular object detector failures thatare only partially compensated for by lowering thedetection threshold at the cost of a large numberof false positives. We rely on the inference mecha-nism of the grounded parser to automatically elimi-nate these numerous false positives as candidateswhen grounding sentences due to their low likeli-hoods. False negatives are much more misleadingand difficult to overcome than false positives. It isharder to read in where an unseen object might bethan to eliminate a low-confidence detection.

In total, the dataset contains 1200 captions from401 videos, which selected out of a larger body ofsentences collected and pruned as described above.This is comparable to the size of other datasets usedfor semantic parsing such as two datasets fromTang and Mooney (2001) with 880 and 640 ex-amples respectively and the navigation instructiondataset (Chen and Mooney, 2011) with 706 exam-ples (containing 3236 single sentences). The sen-tences comprising our dataset contain 169 uniquetokens with an average of 7.93 tokens per caption.There are an average of 2.31 objects per caption.

6 Evaluation

6.1 Experimental Setup

We adapted the Cornell SFP (Semantic ParsingFramework) developed by Artzi (2016) to jointlyreason about sentences and videos. We selected720 examples for training and used 120 examplesfor the validation set to fine-tune the model param-eters. We used the remaining 360 examples for thetest set. This split was fixed and used in all experi-ments below. No sentences or videos occurred inboth the training and test sets. During training, eachhypothesized parse for each sentence is marked aseither correct or incorrect, using either direct super-vision with the target parse or compatibility withthe video, depending on the experiment.

We use beams of 80 for the CKY-parser andGENLEX. CCG-based semantic parsers are seededwith a small number of generic combinations ofsyntactic and semantic types. For example, Artzi(2016) seed with 141 lexical entries; we provide 98.GENLEX uses these entries along with an ontologyto form new syntactic and semantic types.

2653

Precision Recall F1Direct supervision

0.851 0.946 0.84 0.933 0.846 0.939

Noisy supervision (60%)0.235 0.423 0.201 0.362 0.217 0.390

Shuffled labels (direct supervision)0.147 0.384 0.122 0.321 0.136 0.349

Shuffled videos (weak supervision)0.000 0.106 0.000 0.103 0.000 0.104

Object-only vision0.051 0.387 0.042 0.349 0.046 0.367

Vision-language0.223 0.663 0.183 0.553 0.201 0.591

Figure 3: Pairs of results for each condition. On the left, weshow exact match results and on the right, in italics, results forthe near miss metric. In the case of direct supervision, we trainwith the target parses. In the case of noisy supervision, a per-centage of the time (60% here) the parser randomly accepts orrejects a parse. In the case of shuffled labels, the target logicalforms are assigned to random sentences. For shuffled videosthe sentences are assigned to random videos. The likelihoodof any sentence being true of a random video is low. In thecase of object-only vision, the vision system consists solely ofan object detector discarding any other predicates. The fullvision-language approach learns to parse a significant fractionof sentences, far outperforming the object-only approach, andusually being within one predicate of the correct answer.

6.2 Results

Figures 3 and 4 summarize the experiments andablation studies performed. The metrics we usewhen reporting results are exact matches, wherethe predicted parses must perfectly match the targetparses, and near misses, where a single predicatein the semantic parse is allowed to differ from thetarget. Experiments were averaged across 5 runs.

To establish chance-level performance, wetrained the directly supervised approach on shuffledlabels, assigning random correct parses to randomsentences. This is more powerful than a simplechance-level performance calculation as the parsercan still take advantage of any dataset biases. Evenwith the ability to exploit potential biases, perfor-mance is very low with F1 scores of 0.136 and0.349 for the exact and near miss metrics. Bothmetrics pose a challenging learning problem.

As a baseline, we directly supervised the parserwith the target logical forms. When doing so, itachieved high performance with F1 scores of 0.841and 0.911 for the exact match and near miss cases.Figure 4 shows performance of direct supervisionas a function of training set size.

We then added noise to the directly supervisedparser. Doing so simulates the unreliable nature

of vision and, to an extent, the ambiguities inher-ent in vision. Noise was introduced by modifyingthe compatibility function which determines if aparse is correct. A certain percentage of the time,that function returned true or false randomly whengiven a hypothesized logical form. With around60% noise, performance was 0.22 and 0.39 F1 forthe noisy and near miss cases. Figure 4 shows per-formance of the noisy baseline as a function of howmuch noise was introduced.

The fully grounded parser produced 0.2 and 0.6F1 scores for the exact and near miss metrics. Thisis far beyond chance performance and correspondsto direct supervision with around 55% noise. Thereare a number of reasons for why performance isnot perfect. First, the evaluation metrics cannotconsider equivalences in meaning, just form. Ahypothesized parse may carry the same meaningas the target logical form yet it will be consideredincorrect. This is less of a problem with directsupervision where the preferences that annotatorshave for a particular way of encoding the mean-ing of a sentence can be learned. In the groundedcase, this cannot be learned; visually equivalentparses are equally likely. Second, computer visionis unreliable, i.e., object detectors fail. We find thatin many of our videos while person detection isfairly reliable, object detection is unreliable. Third,vision in the real world is very ambiguous. Predi-cates like hold are true in almost every interaction.This makes learning the meanings of words muchmore difficult resulting in the grounded parser of-ten adding useless entries into the predicted logicalforms or substituted one predicate for a similar one.The near miss metric shows that overall the parserlearned reasonable logical forms. Figure 5 showssix examples from our dataset along with expectedand predicted parses, both correct and incorrect.

To understand how much of the performanceof the grounded parser comes from visual correla-tions, like the presence or absence of particular ob-jects, as opposed to more complex and cognitivelyrelevant spatio-temporal relations like actions, weablated the parser. We removed all features otherthan objects. The resulting grounded parser acceptsany hypothesized parse as long as the objects men-tioned in that parse are present in the video. Thisled to a significant performance drop, near-chancelevel performance on the exact metric, F1 0.05, andnearly half the F1 score on the near miss metric,0.37. Having a sophisticated vision system to inferabout agents and interactions is crucial for learning.

2654

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

% training data (solid) or % noise (dashed)

F1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

% training data (solid) or % noise (dashed)

F1

(a) Exact match (b) Near miss

Figure 4: Results from training the grounded semantic parser. In blue, direct supervision as a function of the amount of trainingdata. In dashed blue, noisy supervision uses the whole training set but accepts and rejects parses at random for a given fraction ofthe time. The red cross is the full vision system while the green o is the object detector ablation. The orange triangle representsshuffled videos and shows chance performance. While direct supervision outperforms vision-only supervision, the groundedparser closes the gap and operates like noisy direct supervision with roughly 55% noise.

7 Discussion

We present a semantic parser that learns the struc-ture of language using weak supervision from vi-sion. At test time, the model parses sentences with-out the need for visual input. Learning by passiveobservation in this way extends the capabilities ofsemantic parsers and points the way to a more cog-nitively plausible model of language acquisition.Several limits remain. Evaluating parses as corrector incorrect depending on a match to a human-annotated logical form is an overly strict criterionand is a problem that also plagues fully-supervisedsyntactic parsing (Berzak et al., 2016). Since twological forms may express the same meaning, it isnot yet clear what an effective evaluation metric isfor these grounded scenarios. In addition, learningin such a passive scenario is hard as correlationsbetween events, e.g., every pick up event involvesa touch event, are very difficult to disentangle.

An interesting source of error in the experimen-tal results comes from visual ambiguities. At thelevel of relative motions of labeled bounding boxes,the analysis performed by the language-vision sys-tem we employed here has difficulty distinguishingcertain parts of actions. For example, carrying ashirt and wearing a shirt appear very similar to oneanother as they are actions that mostly involve mov-ing alongside a person detection. Moreover, sinceevery agent is wearing a shirt it becomes more dif-ficult to learn to distinguish the two actions usingpositive evidence alone, i.e., a maximum likelihood

approach. A more robust vision system, perhapsincluding object segmentations, person pose, andweak negative evidence for the occurrence of ac-tions, would likely significantly improve the resultspresented.

In the future, we intend to add a generativemodel along with a physical simulation allowingthe learner to imagine scenarios where a predicatemight not hold. This would help mitigate sys-tematic correlations between sentences and videos.The sentences selected here were all chosen suchthat they are true of the video being shown, yetmuch of what people discuss is ungrounded, or atleast not grounded in the current visual scene. Weintend to combine the weakly supervised parserwith an unsupervised parser and learn to determinewhether a sentence should be grounded visuallyduring training. We hope this work will find ap-plications in robotics where learning to adapt tothe specific language of a user while engagingwith them is of utmost importance when deployingrobots in users’ homes.

Acknowledgments

This work was supported, in part, by the Centerfor Brains, Minds and Machines, NSF STC award1231216, NSF Graduate Research Fellowship, FordFoundation Graduate Research Fellowship, ToyotaResearch Institute, AFRL contract No. FA8750-15-C-0010, and MIT-IBM Brain-Inspired MultimediaComprehension project.

2655

Annotated sentence: The woman is picking up an apple.(i) Ground-truth parse: λxy.woman x, pick_up x y, apple y

Predicted parse: λxy.woman x, pick_up x y, apple y

Annotated sentence: A man walks across the hall holding a chair.(ii) Ground-truth parse: λxyz.person x,walk x, across x y, hallway y, hold x z chair z

Predicted parse: λxyz.person x, from x y, person y, hold x z chair z

Annotated sentence: A man is walking toward a chair.(iii) Ground-truth parse: λxy.person x,walk x, toward x y, chair y

Predicted parse: λxy.person x,walk x, toward x y, chair y

Annotated sentence: She places the toy car down on the table.(v) Ground-truth parse: λxyz.person x, put_down x y, toy y, car y, on y z table z

Predicted parse: λxyz.person x, in x y, toy y, car y, on y z table z

Annotated sentence: A man is lifting the chair.(iv) Ground-truth parse: λxy.person x, pick_up x y, chair y

Predicted parse: λxy.person x, pick_up x y, chair y

Annotated sentence: A woman reaches for a book on the table.(vi) Ground-truth parse: λxyz.person x, pick_up x y, book y, on y z table z

Predicted parse: λxyz.person x, stand x, in x y, book y, on y z table z

Figure 5: Six examples of frames from videos in the dataset along with target and predicted logical forms showing bothsuccesses and failures. Failures are highlighted in red. Note how incorrect parses are usually similar to the correct semanticforms. The intended meaning is often preserved even in these cases.

2656

ReferencesMuhannad Al-Omari, Paul Duckworth, David C Hogg,

and Anthony G Cohn. 2017. Natural language acqui-sition and grounding for embodied robotic systems.In AAAI. pages 4349–4356.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual question an-swering. In Proceedings of the IEEE InternationalConference on Computer Vision. pages 2425–2433.

Yoav Artzi. 2016. Cornell SPF: Cornell semantic pars-ing framework.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly Su-pervised Learning of Semantic Parsers for MappingInstructions to Actions. Transactions of the Associa-tion for Computational Linguistics pages 49–62.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic Parsing on Freebase fromQuestion-Answer Pairs. Empirical Methods for Nat-ural Language Processing .

Jonathan Berant and Percy Liang. 2014. Semantic pars-ing via paraphrasing. In Annual Meeting of the As-sociation for Computational Linguistics.

Yevgeni Berzak, Andrei Barbu, Daniel Harari, andBoris Katz. 2015. Do You See What I Mean ? Vi-sual Resolution of Linguistic Ambiguities. Confer-ence on Empirical Methods on Natural LanguageProcessing (September):1477–1487.

Yevgeni Berzak, Yan Huang, Andrei Barbu, Anna Ko-rhonen, and Boris Katz. 2016. Anchoring and agree-ment in syntactic annotations. Conference on Empir-ical Methods in Natural Language Processing .

Zhe Cao, Tomas Simon, Shih-En Wei, and YaserSheikh. 2017. Realtime multi-person 2d pose esti-mation using part affinity fields. In CVPR.

Bob Carpenter. 1997. Type-logical semantics. MITpress.

David L. Chen and Raymond J. Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions fro mobservations. In AAAI Conference on Ar-tificial Intelligence. San Francisco, CA, USA.

James R Curran, Stephen Clark, and Johan Bos. 2007.Linguistically motivated large-scale nlp with c&cand boxer. In Proceedings of the 45th Annual Meet-ing of the ACL on Interactive Poster and Demonstra-tion Sessions. Association for Computational Lin-guistics, pages 33–36.

Michael C Frank and Noah D Goodman. 2012. Predict-ing pragmatic reasoning in language games. Science336(6084):998–998.

Karl Moritz Hermann, Felix Hill, Simon Green, FuminWang, Ryan Faulkner, Hubert Soyer, David Szepes-vari, Wojtek Czarnecki, Max Jaderberg, DenisTeplyashin, et al. 2017. Grounded language learn-ing in a simulated 3D world. arXiv preprintarXiv:1706.06551 .

Thomas Kollar, Jayant Krishnamurthy, and Grant PStrimel. 2013. Toward interactive grounded lan-guage acqusition. In Robotics: Science and systems.volume 1, pages 721–732.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa-ter, and Mark Steedman. 2011. Lexical Generaliza-tion in CCG Grammar Induction for Semantic Pars-ing. Conference on Empirical Methods in NaturalLanguage Processing pages 1512–1523.

Cynthia Matuszek, Nicholas Fitzgerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012. A jointmodel of language and perception for grounded at-tribute learning. In International Conference on Ma-chine Learning. ACM, pages 1671–1678.

Joseph Redmon and Ali Farhadi. 2018. Yolov3: Anincremental improvement. arXiv .

Narayanaswamy Siddharth, Andrei Barbu, and Jef-frey Mark Siskind. 2014. Seeing what you’re told:Sentence-guided activity recognition in video. InThe IEEE Conference on Computer Vision and Pat-tern Recognition.

Mark Steedman. 1996. Surface Structure and Interpre-tation. The MIT Press.

Mark Steedman. 2000. The Syntactic Process. TheMIT Press.

Lappoon R. Tang and Raymond J. Mooney. 2001. Us-ing multiple clause constructors in inductive logicprogramming for semantic parsing. In EuropeanConference on Machine Learning. pages 466–477.

Sida I Wang, Percy Liang, and Christopher D Manning.2016. Learning language games through interaction.Meeting of the Association for Computational Lin-guistics .

Haonan Yu, N. Siddharth, Andrei Barbu, and Jef-frey Mark Siskind. 2015. A compositional frame-work for grounding language inference, generation,and acquisition in video. Journal of Artificial Intelli-gence Research .

Luke Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structuredclassification with probabilistic categorial grammars.In Proceedings of the Twenty-First Conference An-nual Conference on Uncertainty in Artificial Intel-ligence (UAI-05). AUAI Press, Arlington, Virginia,pages 658–666.

Documents

Grounding language acquisition by training semantic ... · with a yellow cup.”, yet the parse, x:object(x), corresponding to a sentence like “There exists an object.”, is also