arXiv:1603.08079v1 [cs.CV] 26 Mar 2016 · for the importance of visual information for lin-guistic ambiguity resolution by humans. Such in-formation is also vital during language

Do You See What I Mean?Visual Resolution of Linguistic Ambiguities

Yevgeni BerzakCSAIL MIT

[email protected]

Andrei BarbuCSAIL MIT

[email protected]

Daniel HarariCSAIL MIT

[email protected]

Boris KatzCSAIL MIT

[email protected]

Shimon UllmanWeizmann Institute of Science

[email protected]

Abstract

Understanding language goes hand inhand with the ability to integrate complexcontextual information obtained via per-ception. In this work, we present a noveltask for grounded language understanding:disambiguating a sentence given a visualscene which depicts one of the possibleinterpretations of that sentence. To thisend, we introduce a new multimodal cor-pus containing ambiguous sentences, rep-resenting a wide range of syntactic, se-mantic and discourse ambiguities, coupledwith videos that visualize the different in-terpretations for each sentence. We ad-dress this task by extending a vision modelwhich determines if a sentence is depictedby a video. We demonstrate how such amodel can be adjusted to recognize dif-ferent interpretations of the same under-lying sentence, allowing to disambiguatesentences in a unified fashion across thedifferent ambiguity types.

1 Introduction

Ambiguity is one of the defining characteristicsof human languages, and language understand-ing crucially relies on the ability to obtain un-ambiguous representations of linguistic content.While some ambiguities can be resolved usingintra-linguistic contextual cues, the disambigua-tion of many linguistic constructions requires in-tegration of world knowledge and perceptual in-formation obtained from other modalities.

In this work, we focus on the problem ofgrounding language in the visual modality, and in-troduce a novel task for language understandingwhich requires resolving linguistic ambiguities byutilizing the visual context in which the linguisticcontent is expressed. This type of inference is fre-quently called for in human communication thatoccurs in a visual environment, and is crucial forlanguage acquisition, when much of the linguis-tic content refers to the visual surroundings of thechild (Snow, 1972).

Our task is also fundamental to the problem ofgrounding vision in language, by focusing on phe-nomena of linguistic ambiguity, which are preva-lent in language, but typically overlooked whenusing language as a medium for expressing un-derstanding of visual content. Due to such ambi-guities, a superficially appropriate description ofa visual scene may in fact not be sufficient fordemonstrating a correct understanding of the rel-evant visual content. Our task addresses this issueby introducing a deep validation protocol for vi-sual understanding, requiring not only providinga surface description of a visual activity but alsodemonstrating structural understanding at the lev-els of syntax, semantics and discourse.

To enable the systematic study of visuallygrounded processing of ambiguous language, wecreate a new corpus, LAVA (Language and VisionAmbiguities). This corpus contains sentences withlinguistic ambiguities that can only be resolved us-ing external information. The sentences are pairedwith short videos that visualize different interpre-tations of each sentence. Our sentences encom-pass a wide range of syntactic, semantic and dis-

arX

iv:1

603.

0807

9v1

[cs

.CV

] 2

6 M

ar 2

016

mailto:[email protected]





course ambiguities, including ambiguous preposi-tional and verb phrase attachments, conjunctions,logical forms, anaphora and ellipsis. Overall, thecorpus contains 237 sentences, with 2 to 3 inter-pretations per sentence, and an average of 3.37videos that depict visual variations of each sen-tence interpretation, corresponding to a total of1679 videos.

Using this corpus, we address the problem ofselecting the interpretation of an ambiguous sen-tence that matches the content of a given video.Our approach for tackling this task extends thesentence tracker introduced in (Siddharth et al.,2014). The sentence tracker produces a scorewhich determines if a sentence is depicted by avideo. This earlier work had no concept of ambi-guities; it assumed that every sentence had a sin-gle interpretation. We extend this approach to rep-resent multiple interpretations of a sentence, en-abling us to pick the interpretation that is mostcompatible with the video.

To summarize, the contributions of this paperare threefold. First, we introduce a new task for vi-sually grounded language understanding, in whichan ambiguous sentence has to be disambiguatedusing a visual depiction of the sentence’s con-tent. Second, we release a multimodal corpusof sentences coupled with videos which covers awide range of linguistic ambiguities, and enablesa systematic study of linguistic ambiguities in vi-sual contexts. Finally, we present a computationalmodel which disambiguates the sentences in ourcorpus with an accuracy of 75.36%.

2 Related Work

Previous language and vision studies focused onthe development of multimodal word and sentencerepresentations (Bruni et al., 2012; Socher et al.,2013; Silberer and Lapata, 2014; Gong et al.,2014; Lazaridou et al., 2015), as well as methodsfor describing images and videos in natural lan-guage (Farhadi et al., 2010; Kulkarni et al., 2011;Mitchell et al., 2012; Socher et al., 2014; Thoma-son et al., 2014; Karpathy and Fei-Fei, 2014; Sid-dharth et al., 2014; Venugopalan et al., 2015;Vinyals et al., 2015). While these studies handleimportant challenges in multimodal processing oflanguage and vision, they do not provide explicitmodeling of linguistic ambiguities.

Previous work relating ambiguity in language tothe visual modality addressed the problem of word

sense disambiguation (Barnard et al., 2003). How-ever, this work is limited to context independentinterpretation of individual words, and does notconsider structure-related ambiguities. Discourseambiguities were previously studied in work onmultimodal coreference resolution (Ramanathanet al., 2014; Kong et al., 2014). Our work ex-pands this line of research, and addresses furtherdiscourse ambiguities in the interpretation of el-lipsis. More importantly, to the best of our knowl-edge our study is the first to present a systematictreatment of syntactic and semantic sentence levelambiguities in the context of language and vision.

The interactions between linguistic and visualinformation in human sentence processing havebeen extensively studied in psycholinguistics andcognitive psychology (Tanenhaus et al., 1995). Aconsiderable fraction of this work focused on theprocessing of ambiguous language (Spivey et al.,2002; Coco and Keller, 2015), providing evidencefor the importance of visual information for lin-guistic ambiguity resolution by humans. Such in-formation is also vital during language acquisition,when much of the linguistic content perceived bythe child refers to their immediate visual environ-ment (Snow, 1972). Over time, children developmechanisms for grounded disambiguation of lan-guage, manifested among others by the usage oficonic gestures when communicating ambiguouslinguistic content (Kidd and Holler, 2009). Ourstudy leverages such insights to develop a com-plementary framework that enables addressing thechallenge of visually grounded disambiguation oflanguage in the realm of artificial intelligence.

3 Task

In this work we provide a concrete frameworkfor the study of language understanding with vi-sual context by introducing the task of groundedlanguage disambiguation. This task requires tochoose the correct linguistic representation of asentence given a visual context depicted in a video.Specifically, provided with a sentence, n candidateinterpretations of that sentence and a video thatdepicts the content of the sentence, one needs tochoose the interpretation that corresponds to thecontent of the video.

To illustrate this task, consider the example infigure 1, where we are given the sentence “Samapproached the chair with a bag” along with twodifferent linguistic interpretations. In the first in-

S

NP

NNP

Sam

VP

VP

V

VBD

approached

NP

DT

the

NN

chair

PP

IN

with

NP

DT

a

NN

bag

(a) First interpretation

S

NP

NNP

Sam

VP

V

VBD

approached

NP

NP

DT

the

NN

chair

PP

IN

with

NP

DT

a

NN

bag

(b) Second interpretation

(c) Visual context

Figure 1: An example of the visually groundedlanguage disambiguation task. Given the sentence“Sam approached the chair with a bag”, two poten-tial parses, (a) and (b), correspond to two differentsemantic interpretations. In the first interpretationSam has the bag, while in the second reading thebag is on the chair. The task is to select the correctinterpretation given the visual context (c).

terpretation, which corresponds to parse 1(a), Samhas the bag. In the second interpretation associ-ated with parse 1(b), the bag is on the chair ratherthan with Sam. Given the visual context from fig-ure 1(c), the task is to choose which interpretationis most appropriate for the sentence.

4 Approach Overview

To address the grounded language disambiguationtask, we use a compositional approach for deter-mining if a specific interpretation of a sentence isdepicted by a video. In this framework, describedin detail in section 6, a sentence and an accom-panying interpretation encoded in first order logic,give rise to a grounded model that matches a videoagainst the provided sentence interpretation.

The model is comprised of Hidden MarkovModels (HMMs) which encode the semantics ofwords, and trackers which locate objects in videoframes. To represent an interpretation of a sen-tence, word models are combined with trackersthrough a cross-product which respects the seman-tic representation of the sentence to create a singlemodel which recognizes that interpretation.

Given a sentence, we construct an HMM basedrepresentation for each interpretation of that sen-tence. We then detect candidate locations for ob-jects in every frame of the video. Together the re-

S

NP

NNP

Bill

VP

VBD

held

NP

DT

the

NP

JJ

green

NP

NN

chair

CC

and

NN

bag

(a) First interpretation

S

NP

NNP

Bill

VP

VBD

held

NP

DT

the

NP

NP

JJ

green

NN

chair

CC

and

NN

bag

(b) Second interpretation

(c) First visual context (d) Second visual context

Figure 2: Linguistic and visual interpretations ofthe sentence “Bill held the green chair and bag”. Inthe first interpretation (a,c) both the chair and bagare green, while in the second interpretation (b,d)only the chair is green and the bag has a differentcolor.

forestation for the sentence and the candidate ob-ject locations are combined to form a model whichcan determine if a given interpretation is depictedby the video. We test each interpretation and re-port the interpretation with highest likelihood.

5 Corpus

To enable a systematic study of linguistic ambi-guities that are grounded in vision, we compileda corpus with ambiguous sentences describing vi-sual actions. The sentences are formulated suchthat the correct linguistic interpretation of eachsentence can only be determined using external,non-linguistic, information about the depicted ac-tivity. For example, in the sentence “Bill heldthe green chair and bag”, the correct scope of“green” can only be determined by integrating ad-ditional information about the color of the bag.This information is provided in the accompany-ing videos, which visualize the possible interpreta-tions of each sentence. Figure 2 presents the syn-tactic parses for this example along with framesfrom the respective videos. Although our videoscontain visual uncertainty, they are not ambiguouswith respect to the linguistic interpretation they arepresenting, and hence a video always correspondsto a single candidate representation of a sentence.

The corpus covers a wide range of well

known syntactic, semantic and discourse ambigu-ity classes. While the ambiguities are associatedwith various types, different sentence interpreta-tions always represent distinct sentence meanings,and are hence encoded semantically using first or-der logic. For syntactic and discourse ambiguitieswe also provide an additional, ambiguity type spe-cific encoding as described below.

• Syntax Syntactic ambiguities include Prepo-sitional Phrase (PP) attachments, Verb Phrase(VP) attachments, and ambiguities in the in-terpretation of conjunctions. In addition tological forms, sentences with syntactic am-biguities are also accompanied with ContextFree Grammar (CFG) parses of the candidateinterpretations, generated from a determinis-tic CFG parser.

• Semantics The corpus addresses severalclasses of semantic quantification ambigui-ties, in which a syntactically unambiguoussentence may correspond to different logicalforms. For each such sentence we provide therespective logical forms.

• Discourse The corpus contains two typesof discourse ambiguities, Pronoun Anaphoraand Ellipsis, offering examples comprisingtwo sentences. In anaphora ambiguity cases,an ambiguous pronoun in the second sen-tence is given its candidate antecedents in thefirst sentence, as well as a corresponding log-ical form for the meaning of the second sen-tence. In ellipsis cases, a part of the secondsentence, which can constitute either the sub-ject and the verb, or the verb and the object,is omitted. We provide both interpretationsof the omission in the form of a single unam-biguous sentence, and its logical form, whichcombines the meanings of the first and thesecond sentences.

Table 2 lists examples of the different ambiguityclasses, along with the candidate interpretations ofeach example.

The corpus is generated using Part of Speech(POS) tag sequence templates. For each template,the POS tags are replaced with lexical items fromthe corpus lexicon, described in table 3, using allthe visually applicable assignments. This gener-ation process yields an overall of 237 sentences,

Ambiguity Templates #

Synt

ax

PP NNP V DT [JJ] NN1 IN DT [JJ] NN2. 48

VP NNP1 V [IN] NNP2 V [JJ] NN. 60

Conjunction NNP1 [and NNP2] V DT JJ NN1 and NN2. 40NNP V DT NN1 or DT NN2 and DT NN3.

Sem

antic

s Logical Form NNP1 and NNP2 V a NN. 35Someone V the NNS.

Dis

cour

se Anaphora NNP V DT NN1 and DT NN2. It is JJ. 36

Ellipsis NNP1 V NNP2. Also NNP3. 18

Table 1: POS templates for generating the sen-tences in our corpus. The rightmost column rep-resents the number of sentences in each category.The sentences are produced by replacing the POStags with all the visually applicable assignmentsof lexical items from the corpus lexicon shown intable 3.

of which 213 sentences have 2 candidate interpre-tations, and 24 sentences have 3 interpretations.Table 1 presents the corpus templates for each am-biguity class, along with the number of sentencesgenerated from each template.

The corpus videos are filmed in an indoorenvironment containing background objects andpedestrians. To account for the manner of per-forming actions, videos are shot twice with differ-ent actors. Whenever applicable, we also filmedthe actions from two different directions (e.g. ap-proach from the left, and approach from the right).Finally, all videos were shot with two camerasfrom two different view points. Taking these vari-ations into account, the resulting video corpuscontains 7.1 videos per sentence and 3.37 videosper sentence interpretation, corresponding to a to-tal of 1679 videos. The average video length is3.02 seconds (90.78 frames), with in an overall of1.4 hours of footage (152434 frames).

A custom corpus is required for this task be-cause no existing corpus, containing either videosor images, systematically covers multimodal am-biguities. Datasets such as UCF Sports (Ro-driguez et al., 2008), YouTube (Liu et al., 2009),and HMDB (Kuehne et al., 2011) which come outof the activity recognition community are accom-panied by action labels, not sentences, and do notcontrol for the content of the videos aside from theprincipal action being performed. Datasets for im-age and video captioning, such as MSCOCO (Linet al., 2014) and TACOS (Regneri et al., 2013),

Ambiguity Example Linguistic interpretations Visual setups

Synt

axPP Claire left the green chair with

a yellow bag.Claire [left the green chair] [with a yellow bag]. The bag is with Claire.

Claire left [the green chair with a yellow bag]. Bag is on the chair.VP Claire looked at Bill picking

up a chair.Claire looked at [Bill [picking up a chair]]. Bill picks up the chair.

Claire [looked at Bill] [picking up a chair]. Claire picks up the chair.Conjunction Claire held a green bag and

chair.Claire held a [green [bag and chair]]. The chair is green.

Claire held a [[green bag] and [chair]]. The chair is not green.Claire held the chair or thebag and the telescope.

Claire held [[the chair] or [the bag and the telescope]]. Claire holds the chair.

Claire held [[the chair or the bag] and [the telescope]]. Claire holds the chair and the telescope.

Sem

antic

s

Logical Form Claire and Bill moved a chair. chair(x), move(Claire, x), move(Bill, x) Claire and Bill move the same chair.chair(x), chair(y), move(Claire, x),move(Bill, y), x 6= y

Claire and Bill move different chairs.

Someone moved the twochairs.

chair(x), chair(y), x 6= y, person(u),move(u, x), move(u, y)

One person moves both chairs.

chair(x), chair(y), x 6= y, person(u), person(v),u 6= v, move(u, x), move(v, y)

Each chair moved by a different person.

Dis

cour

se

Anaphora Claire held the bag and the It = bag The bag is yellow.chair. It is yellow. It = chair The chair is yellow.

Ellipsis Claire looked at Bill. Claire looked at Bill and Sam. Claire looks at Bill and Sam.Also Sam. Claire and Sam looked at Bill. Claire and Sam look at Bill.

Table 2: An overview of the different ambiguity types, along with examples of ambiguous sentenceswith their linguistic and visual interpretations. Note that similarly to semantic ambiguities, syntacticand discourse ambiguities are also provided with first order logic formulas for the resulting sentenceinterpretations. Table 4 shows additional examples for each ambiguity type, with frames from samplevideos corresponding to the different interpretations of each sentence.

Syntactic Category Visual Category WordsNouns Objects, People chair, bag, telescope, someone, proper names

Verbs Actions pick up, put down, hold, move (transitive), look at, approach, leave

Prepositions Spacial Relations with, left of, right of, on

Adjectives Visual Properties yellow, green

Table 3: The lexicon used to instantiate the templates in figure 1 in order to generate the corpus.

aim to control for more aspects of the videos thanjust the main action being performed but they donot provide the range of ambiguities discussedhere. The closest dataset is that of Siddharth et al.(2014) as it controls for object appearance, color,action, and direction of motion, making it morelikely to be suitable for evaluating disambiguationtasks. Unfortunately, that dataset was designed toavoid ambiguities, and therefore is not suitable forevaluating the work described here.

6 Model

To perform the disambiguation task, we extendthe sentence recognition model of Siddharth etal. (2014) which represents sentences as compo-sitions of words. Given a sentence, its first orderlogic interpretation and a video, our model pro-duces a score which determines if the sentence isdepicted by the video. It simultaneously tracks theparticipants in the events described by the sentencewhile recognizing the events themselves. This al-

lows it to be flexible in the presence of noise byintegrating top-down information from the sen-tence with bottom-up information from object andproperty detectors. Each word in the query sen-tence is represented by an HMM (Baum et al.,1970), which recognizes tracks (i.e. paths of de-tections in a video for a specific object) that satisfythe semantics of the given word. In essence, thismodel can be described as having two layers, onein which object tracking occurs and one in whichwords observe tracks and filter tracks that do notsatisfy the word constraints.

Given a sentence interpretation, we constructa sentence-specific model which recognizes if avideo depicts the sentence as follows. Each pred-icate in the first order logic formula has a cor-responding HMM, which can recognize if thatpredicate is true of a video given its arguments.Each variable has a corresponding tracker whichattempts to physically locate the bounding boxcorresponding to that variable in each frame of a

PP Attachment Sam looked at Bill with a telescope.

VP Attachment Bill approached the person holding a green chair.

Conjunction Sam and Bill picked up the yellow bag and chair.

Logical Form Someone put down the bags.

Anaphora Sam picked up the bag and the chair. It is yellow.

Ellipsis Sam left Bill. Also Clark.

Table 4: Examples of the six ambiguity classes described in table 2. The example sentences have atleast two interpretations, which are depicted by different videos. Three frames from each such video areshown on the left and on the right below each sentence.

predicate 1 predicate W

......

......

. . .

. . .

. . .

. . .

h a

× · · ·×...

......

...

. . .

. . .

. . .

. . .

h a

×

......

......

. . .

. . .

. . .

. . .

f g

t = 1 t = 2 t = 3 t = T

× · · ·×...

......

...

. . .

. . .

. . .

. . .

f g

t = 1 t = 2 t = 3 t = T

track 1 track L

person person moved chair

agent1-track agent2-track patient-track

6=

person person moved chair

agent1-track agent2-track patient1-track patient2-track

6= 6=

Figure 3: (left) Tracker lattices for every sentence participant are combined with predicate HMMs. TheMAP estimate in the resulting cross-product lattice simultaneously finds the best tracks and the best statesequences for every predicate. (right) Two interpretations of the sentence “Claire and Bill moved a chair”having different first order logic formulas. The top interpretation corresponds to Bill and Claire movingthe same chair, while the bottom one describes them moving different chairs. Predicates are highlightedin blue at the top and variables are highlighted in red at the bottom. Each predicate has a correspondingHMM which recognizes its presence in a video. Each variable has a corresponding tracker which locatesit in a video. Lines connect predicates and the variables which fill their argument slots. Some predicates,such as move and 6=, take multiple arguments. Some predicates, such as move, are applied multiple timesbetween different pairs of variables.

video. This creates a bipartite graph: HMMs thatrepresent predicates are connected to trackers thatrepresent variables. The trackers themselves aresimilar to the HMMs, in that they comprise a lat-tice of potential bounding boxes in every frame.To construct a joint model for a sentence interpre-tation, we take the cross product of HMMs andtrackers, taking only those cross products dictatedby the structure of the formula corresponding tothe desired interpretation. Given a video, we em-ploy an object detector to generate candidate de-tections in each frame, construct trackers whichselect one of these detections in each frame, and fi-nally construct the overall model from HMMs andtrackers.

Provided an interpretation and its correspond-ing formula composed of P predicates and V vari-ables, along with a collection of object detections,bframe

detection index, in each frame of a video of lengthT the model computes the score of the video-sentence pair by finding the optimal detection foreach participant in every frame. This is in essencethe Viterbi algorithm (Viterbi, 1971), the MAP al-gorithm for HMMs, applied to finding optimal ob-ject detections jframe

variable for each participant, and theoptimal state kframe

predicate for each predicate HMM, inevery frame. Each detection is scored by its con-fidence from the object detector, f and each ob-ject track is scored by a motion coherence metric gwhich determines if the motion of the track agreeswith the underlying optical flow. Each predicate,

p, is scored by the probability of observing a par-ticular detection in a given state hp, and by theprobability of transitioning between states ap. Thestructure of the formula and the fact that multi-ple predicates often refer to the same variables isrecorded by θ, a mapping between predicates andtheir arguments. The model computes the MAPestimate as:

maxj11 ,..., j

T1

...j1V ,..., jTV

maxk11,..., k

T1

...k1P ,..., kTP

V∑v=1

T∑t=1

f(btjtv ) +

T∑t=2

g(bt−1

jt−1v

, btjtv )+

P∑p=1

T∑t=1

hp(ktp, b

tjtθ1p

, btjtθ2p

) +

T∑t=2

ap(kt−1p , kt

p)

for sentences which have words that refer to atmost two tracks (i.e. transitive verbs or binarypredicates) but is trivially extended to arbitrary ar-ities. Figure 3 provides a visual overview of themodel as a cross-product of tracker models andword models.

Our model extends the approach of Siddharth etal. (2014) in several ways. First, we depart fromthe dependency based representation used in thatwork, and recast the model to encode first orderlogic formulas. Note that some complex first or-der logic formulas cannot be directly encoded inthe model and require additional inference steps.This extension enables us to represent ambiguitiesin which a given sentence has multiple logical in-terpretations for the same syntactic parse.

Second, we introduce several model compo-nents which are not specific to disambiguation, butare required to encode linguistic constructions thatare present in our corpus and could not be handledby the model of Siddharth et al. (2014). These newcomponents are the predicate “not equal”, disjunc-tion, and conjunction. The key addition amongthese components is support for the new predicate“not equal”, which enforces that two tracks, i.e.objects, are distinct from each other. For example,in the sentence “Claire and Bill moved a chair”one would want to ensure that the two movers aredistinct entities. In earlier work, this was not re-quired because the sentences tested in that workwere designed to distinguish objects based on con-straints rather than identity. In other words, theremight have been two different people but theywere distinguished in the sentence by their actionsor appearance. To faithfully recognize that two ac-tors are moving the chair in the earlier example,we must ensure that they are disjoint from eachother. In order to do this we create a new HMMfor this predicate, which assigns low probabilityto tracks that heavily overlap, forcing the model tofit two different actors in the previous example. Bycombining the new first order logic based seman-tic representation in lieu of a syntactic represen-tation with a more expressive model, we can en-code the sentence interpretations required to per-form the disambiguation task.

Figure 3(left) shows an example of two differ-ent interpretations of the above discussed sentence“Claire and Bill moved a chair”. Object track-ers, which correspond to variables in the first orderlogic representation of the sentence interpretation,are shown in red. Predicates which constrain thepossible bindings of the trackers, corresponding topredicates in the representation of the sentence,are shown in blue. Links represent the argumentstructure of the first order logic formula, and de-termine the cross products that are taken betweenthe predicate HMMs and tracker lattices in orderto form the joint model which recognizes the en-tire interpretation in a video.

The resulting model provides a single unifiedformalism for representing all the ambiguities intable 2. Moreover, this approach can be tuned todifferent levels of specificity. We can create mod-els that are specific to one interpretation of a sen-tence or that are generic, and accept multiple inter-pretations by eliding constraints that are not com-

mon between the different interpretations. This al-lows the model, like humans, to defer deciding ona particular interpretation or to infer that multipleinterpretation of the sentence are plausible.

7 Experimental Results

We tested the performance of the model describedin the previous section on the LAVA dataset pre-sented in section 5. Each video in the dataset waspre-processed with object detectors for humans,bags, chairs, and telescopes. We employed a mix-ture of CNN (Krizhevsky et al., 2012) and DPM(Felzenszwalb et al., 2010) detectors, trained onheld out sections of our corpus. For each objectclass we generated proposals from both the CNNand the DPM detectors, and trained a scoring func-tion to map both results into the same space. Thescoring function consisted of a sigmoid over theconfidence of the detectors trained on the sameheld out portion of the training set. As none of thedisambiguation examples discussed here rely onthe specific identity of the actors, we did not detecttheir identity. Instead, any sentence which con-tains names was automatically converted to onewhich contains arbitrary “person” labels.

The sentences in our corpus have either two orthree interpretations. Each interpretation has oneor more associated videos where the scene wasshot from a different angle, carried out either bydifferent actors, with different objects, or in differ-ent directions of motion. For each sentence-videopair, we performed a 1-out-of-2 or 1-out-of-3 clas-sification task to determine which of the interpre-tations of the corresponding sentence best fits thatvideo. Overall chance performance on our datasetis 49.04%, slightly lower than 50% due to the 1-out-of-3 classification examples.

The model presented here achieved an accuracyof 75.36% over the entire corpus averaged acrossall error categories. This demonstrates that themodel is largely capable of capturing the under-lying task and that similar compositional cross-modal models may do the same. For each of the3 major ambiguity classes we had an accuracy of84.26% for syntactic ambiguities, 72.28% for se-mantic ambiguities, and 64.44% for discourse am-biguities.

The most significant source of model failuresare poor object detections. Objects are often ro-tated and presented at angles that are difficult torecognize. Certain object classes like the telescope

are much more difficult to recognize due to theirsmall size and the fact that hands tend to largelyocclude them. This accounts for the degraded per-formance of the semantic ambiguities relative tothe syntactic ambiguities, as many more seman-tic ambiguities involved the telescope. Object de-tector performance is similarly responsible for thelower performance of the discourse ambiguitieswhich relied much more on the accuracy of theperson detector as many sentences involve onlypeople interacting with each other without any ad-ditional objects. This degrades performance by re-moving a helpful constraint for inference, accord-ing to which people tend to be close to the objectsthey are manipulating. In addition, these sentencesintroduced more visual uncertainty as they ofteninvolved three actors.

The remaining errors are due to the event mod-els. HMMs can fixate on short sequences of eventswhich seem as if they are part of an action, but infact are just noise or the prefix of another action.Ideally, one would want an event model which hasa global view of the action, if an object went upfrom the beginning to the end of the video whilea person was holding it, it’s likely that the objectwas being picked up. The event models used herecannot enforce this constraint, they merely assertthat the object was moving up for some number offrames; an event which can happen due to noisein the object detectors. Enforcing such local con-straints instead of the global constraint of the mo-tion of the object over the video makes joint track-ing and event recognition tractable in the frame-work presented here but can lead to errors. Findingmodels which strike a better balance between localinformation and global constraints while maintain-ing tractable inference remains an area of futurework.

8 Conclusion

We present a novel framework for studying am-biguous utterances expressed in a visual context.In particular, we formulate a new task for resolv-ing structural ambiguities using visual signal. Thisis a fundamental task for humans, involving com-plex cognitive processing, and is a key challengefor language acquisition during childhood. Werelease a multimodal corpus that enables to ad-dress this task, as well as support further inves-tigation of ambiguity related phenomena in visu-ally grounded language processing. Finally, we

present a unified approach for resolving ambigu-ous descriptions of videos, achieving good perfor-mance on our corpus.

While our current investigation focuses onstructural inference, we intend to extend this lineof work to learning scenarios, in which the agenthas to deduce the meaning of words and sentencesfrom structurally ambiguous input. Furthermore,our framework can be beneficial for image andvideo retrieval applications in which the query isexpressed in natural language. Given an ambigu-ous query, our approach will enable matching andclustering the retrieved results according to the dif-ferent query interpretations.

Acknowledgments

This material is based upon work supported bythe Center for Brains, Minds, and Machines(CBMM), funded by NSF STC award CCF-1231216. SU was also supported by ERC Ad-vanced Grant 269627 Digital Baby.

References

[Barnard et al.2003] Kobus Barnard, Matthew Johnson,and David Forsyth. 2003. Word sense disam-biguation with pictures. In Proceedings of the HLT-NAACL 2003 workshop on Learning word meaningfrom non-linguistic data-Volume 6, pages 1–5. As-sociation for Computational Linguistics.

[Baum et al.1970] L. E. Baum, T. Petrie, G. Soules, andN. Weiss. 1970. A maximization technique occur-ing in the statistical analysis of probabilistic func-tions of Markov chains. The Annals of Mathemati-cal Statistics, 41(1):164–171.

[Bruni et al.2012] Elia Bruni, Gemma Boleda, MarcoBaroni, and Nam-Khanh Tran. 2012. Distributionalsemantics in technicolor. In Proceedings of the 50thAnnual Meeting of the Association for Computa-tional Linguistics: Long Papers-Volume 1, pages136–145. Association for Computational Linguis-tics.

[Coco and Keller2015] Moreno I Coco and FrankKeller. 2015. The interaction of visual and linguis-tic saliency during syntactic ambiguity resolution.The Quarterly Journal of Experimental Psychology,68(1):46–74.

[Farhadi et al.2010] Ali Farhadi, Mohsen Hejrati, Mo-hammad Amin Sadeghi, Peter Young, CyrusRashtchian, Julia Hockenmaier, and David Forsyth.2010. Every picture tells a story: Generating sen-tences from images. In Computer Vision–ECCV2010, pages 15–29. Springer.

[Felzenszwalb et al.2010] Pedro F Felzenszwalb,Ross B Girshick, David McAllester, and DevaRamanan. 2010. Object detection with discrimina-tively trained part-based models. Pattern Analysisand Machine Intelligence, IEEE Transactions on,32(9):1627–1645.

[Gong et al.2014] Yunchao Gong, Liwei Wang, MicahHodosh, Julia Hockenmaier, and Svetlana Lazebnik.2014. Improving image-sentence embeddings usinglarge weakly annotated photo collections. In Com-puter Vision–ECCV 2014, pages 529–545. Springer.

[Karpathy and Fei-Fei2014] Andrej Karpathy andLi Fei-Fei. 2014. Deep visual-semantic alignmentsfor generating image descriptions. arXiv preprintarXiv:1412.2306.

[Kidd and Holler2009] Evan Kidd and Judith Holler.2009. Children’s use of gesture to resolve lexicalambiguity. Developmental Science, 12(6):903–913.

[Kong et al.2014] Chen Kong, Dahua Lin, MayankBansal, Raquel Urtasun, and Sanja Fidler. 2014.What are you talking about? text-to-image corefer-ence. In Computer Vision and Pattern Recognition(CVPR), pages 3558–3565. IEEE.

[Krizhevsky et al.2012] Alex Krizhevsky, IlyaSutskever, and Geoffrey E Hinton. 2012. Im-agenet classification with deep convolutional neuralnetworks. In Advances in neural informationprocessing systems, pages 1097–1105.

[Kuehne et al.2011] Hildegard Kuehne, HueihanJhuang, Estıbaliz Garrote, Tomaso Poggio, andThomas Serre. 2011. Hmdb: a large video databasefor human motion recognition. In Computer Vision(ICCV), 2011 IEEE International Conference on,pages 2556–2563. IEEE.

[Kulkarni et al.2011] G Kulkarni, V Premraj, S Dhar,Siming Li, Yejin Choi, AC Berg, and TL Berg.2011. Baby talk: Understanding and generatingsimple image descriptions. In Proceedings of the2011 IEEE Conference on Computer Vision andPattern Recognition, pages 1601–1608. IEEE Com-puter Society.

[Lazaridou et al.2015] Angeliki Lazaridou, Nghia ThePham, and Marco Baroni. 2015. Combininglanguage and vision with a multimodal skip-grammodel. CoRR, abs/1501.02598.

[Lin et al.2014] Tsung-Yi Lin, Michael Maire, SergeBelongie, James Hays, Pietro Perona, Deva Ra-manan, Piotr Dollar, and C Lawrence Zitnick.2014. Microsoft coco: Common objects in context.In Computer Vision–ECCV 2014, pages 740–755.Springer.

[Liu et al.2009] Jingen Liu, Jiebo Luo, and MubarakShah. 2009. Recognizing realistic actions fromvideos in the wild. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conferenceon, pages 1996–2003. IEEE.

[Mitchell et al.2012] Margaret Mitchell, Xufeng Han,Jesse Dodge, Alyssa Mensch, Amit Goyal, AlexBerg, Kota Yamaguchi, Tamara Berg, Karl Stratos,and Hal Daume III. 2012. Midge: Generating im-age descriptions from computer vision detections.In Proceedings of the 13th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics, pages 747–756. Association for Com-putational Linguistics.

[Ramanathan et al.2014] Vignesh Ramanathan, Ar-mand Joulin, Percy Liang, and Li Fei-Fei. 2014.Linking people in videos with their names usingcoreference resolution. In Computer Vision–ECCV2014, pages 95–110. Springer.

[Regneri et al.2013] Michaela Regneri, MarcusRohrbach, Dominikus Wetzel, Stefan Thater, BerntSchiele, and Manfred Pinkal. 2013. Groundingaction descriptions in videos. Transactions ofthe Association for Computational Linguistics,1:25–36.

[Rodriguez et al.2008] Mikel D. Rodriguez, JavedAhmed, and Mubarak Shah. 2008. Action MACHA Spatio-temporal Maximum Average Correlation

Height Filter for Action Recognition. In ComputerVision and Pattern Recognition, pages 1–8.

[Siddharth et al.2014] Narayanaswamy Siddharth, An-drei Barbu, and Jeffrey Mark Siskind. 2014. Seeingwhat you’re told: Sentence-guided activity recog-nition in video. In Computer Vision and PatternRecognition (CVPR), pages 732–739. IEEE.

[Silberer and Lapata2014] Carina Silberer and MirellaLapata. 2014. Learning grounded meaning rep-resentations with autoencoders. In Proceedings ofACL, pages 721–732.

[Snow1972] Catherine E Snow. 1972. Mothers’ speechto children learning language. Child development,pages 549–565.

[Socher et al.2013] Richard Socher, Milind Ganjoo,Christopher D Manning, and Andrew Ng. 2013.Zero-shot learning through cross-modal transfer. InAdvances in Neural Information Processing Sys-tems, pages 935–943.

[Socher et al.2014] Richard Socher, Andrej Karpathy,Quoc V Le, Christopher D Manning, and Andrew YNg. 2014. Grounded compositional semanticsfor finding and describing images with sentences.Transactions of the Association for ComputationalLinguistics, 2:207–218.

[Spivey et al.2002] Michael J Spivey, Michael K Tanen-haus, Kathleen M Eberhard, and Julie C Sedivy.2002. Eye movements and spoken language com-prehension: Effects of visual context on syntac-tic ambiguity resolution. Cognitive psychology,45(4):447–481.

[Tanenhaus et al.1995] Michael K Tanenhaus,Michael J Spivey-Knowlton, Kathleen M Eber-hard, and Julie C Sedivy. 1995. Integration ofvisual and linguistic information in spoken languagecomprehension. Science, 268(5217):1632–1634.

[Thomason et al.2014] Jesse Thomason, SubhashiniVenugopalan, Sergio Guadarrama, Kate Saenko, andRaymond Mooney. 2014. Integrating language andvision to generate natural language descriptions ofvideos in the wild. In Proceedings of the 25th Inter-national Conference on Computational Linguistics(COLING), August.

[Venugopalan et al.2015] Subhashini Venugopalan,Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Ray-mond Mooney, and Kate Saenko. 2015. Translatingvideos to natural language using deep recurrentneural networks. In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Denver, Colorado.

[Vinyals et al.2015] Oriol Vinyals, Alexander Toshev,Samy Bengio, and Dumitru Erhan. 2015. Show andtell: A neural image caption generator. In ComputerVision and Pattern Recognition (CVPR).

[Viterbi1971] A. J. Viterbi. 1971. Convolutional codesand their performance in communication systems.Communications of the IEEE, 19:751–772, October.

Documents

arXiv:1603.08079v1 [cs.CV] 26 Mar 2016 · for the importance of visual information for lin-guistic ambiguity resolution by humans. Such in-formation is also vital during language