Unsupervised Visual Sense Disambiguation for Verbs using ... · Verb type Examples Verbs Images Senses Depct ITA Motion run, walk, jump, etc. 39 1812 10.76 5.79 0.680 Non-motion sit,

Proceedings of NAACL-HLT 2016, pages 182–192,San Diego, California, June 12-17, 2016. c©2016 Association for Computational Linguistics

Unsupervised Visual Sense Disambiguation for Verbs using MultimodalEmbeddings

Spandana Gella, Mirella Lapata and Frank KellerInstitute for Language, Cognition and Computation

School of Informatics, University of Edinburgh10 Crichton Street, Edinburgh EH8 9AB

[email protected], [email protected], [email protected]

Abstract

We introduce a new task, visual sense disam-biguation for verbs: given an image and a verb,assign the correct sense of the verb, i.e., theone that describes the action depicted in theimage. Just as textual word sense disambigua-tion is useful for a wide range of NLP tasks,visual sense disambiguation can be useful formultimodal tasks such as image retrieval, im-age description, and text illustration. We intro-duce VerSe, a new dataset that augments exist-ing multimodal datasets (COCO and TUHOI)with sense labels. We propose an unsupervisedalgorithm based on Lesk which performs vi-sual sense disambiguation using textual, vi-sual, or multimodal embeddings. We find thattextual embeddings perform well when gold-standard textual annotations (object labels andimage descriptions) are available, while mul-timodal embeddings perform well on unanno-tated images. We also verify our findings byusing the textual and multimodal embeddingsas features in a supervised setting and analysethe performance of visual sense disambigua-tion task. VerSe is made publicly availableand can be downloaded at: https://github.com/spandanagella/verse.

1 Introduction

Word sense disambiguation (WSD) is a widely stud-ied task in natural language processing: given a wordand its context, assign the correct sense of the wordbased on a pre-defined sense inventory (Kilgarrif,1998). WSD is useful for a range of NLP tasks,including information retrieval, information extrac-tion, machine translation, content analysis, and lex-icography (see Navigli (2009) for an overview).

Figure 1: Visual sense ambiguity: three of the sensesof the verb play.

Standard WSD disambiguates words based on theirtextual context; however, in a multimodal setting(e.g., newspaper articles with photographs), visualcontext is also available and can be used for disam-biguation. Based on this observation, we introducea new task, visual sense disambiguation (VSD) forverbs: given an image and a verb, assign the correctsense of the verb, i.e., the one depicted in the image.While VSD approaches for nouns exist, VSD forverbs is a novel, more challenging task, and relatedin interesting ways to action recognition in computervision. As an example consider the verb play, whichcan have the senses participate in sport, play on aninstrument, and be engaged in playful activity, de-pending on its visual context, see Figure 1.

We expect visual sense disambiguation to be use-ful for multimodal tasks such as image retrieval. Asan example consider the output of Google ImageSearch for the query sit: it recognizes that the verbhas multiple senses and tries to cluster relevant im-ages. However, the result does not capture the pol-ysemy of the verb well, and would clearly benefitfrom VSD (see Figure 2).

Visual sense disambiguation has previously beenattempted for nouns (e.g., apple can mean fruit orcomputer), which is a substantially easier task thatcan be solved with the help of an object detector

182

Figure 2: Google Image Search trying to disam-biguate sit. All clusters pertain to the sit down sense,other senses (baby sit, convene) are not included.

(Barnard et al., 2003; Loeff et al., 2006; Saenko andDarrell, 2008; Chen et al., 2015). VSD for nouns ishelped by resources such as ImageNet (Deng et al.,2009), a large image database containing 1.4 millionimages for 21,841 noun synsets and organized ac-cording to the WordNet hierarchy. However, we arenot aware of any previous work on VSD for verbs,and no ImageNet for verbs exists. Not only imageretrieval would benefit from VSD for verbs, but alsoother multimodal tasks that have recently receiveda lot of interest, such as automatic image descrip-tion and visual question answering (Karpathy andLi, 2015; Fang et al., 2015; Antol et al., 2015).

In this work, we explore the new task of visualsense disambiguation for verbs: given an image anda verb, assign the correct sense of the verb, i.e., theone that describes the action depicted in the image.We present VerSe, a new dataset that augments exist-ing multimodal datasets (COCO and TUHOI) withsense labels. VerSe contains 3518 images, each an-notated with one of 90 verbs, and the OntoNotessense realized in the image. We propose an algo-rithm based on the Lesk WSD algorithm in order toperform unsupervised visual sense disambiguationon our dataset. We focus in particular on how to bestrepresent word senses for visual disambiguation,and explore the use of textual, visual, and multi-modal embeddings. Textual embeddings for a givenimage can be constructed over object labels or imagedescriptions, which are available as gold-standard inthe COCO and TUHOI datasets, or can be computedautomatically using object detectors and image de-scription models.

Our results show that textual embeddings per-form best when gold-standard textual annotationsare available, while multimodal embeddings per-form best when automatically generated object la-bels are used. Interestingly, we find that automati-cally generated image descriptions result in inferiorperformance.

Dataset Verbs Acts Images Sen DesPPMI (Yao and Fei-Fei, 2010) 2 24 4800 N NStanford 40 Actions (Yao et al., 2011) 33 40 9532 N NPASCAL 2012 (Everingham et al., 2015) 9 11 4588 N N89 Actions (Le et al., 2013) 36 89 2038 N NTUHOI (Le et al., 2014) – 2974 10805 N NCOCO-a (Ronchi and Perona, 2015) 140 162 10000 N YHICO (Chao et al., 2015) 111 600 47774 Y NVerSe (our dataset) 90 163 3518 Y Y

Table 1: Comparison of VerSe with existing actionrecognition datasets. Acts (actions) are verb-objectpairs; Sen indicates whether sense ambiguity is ex-plicitly handled; Des indicates whether image de-scriptions are included.

2 Related Work

There is an extensive literature on word sense disam-biguation for nouns, verbs, adjectives and adverbs.Most of these approaches rely on lexical databasesor sense inventories such as WordNet (Miller et al.,1990) or OntoNotes (Hovy et al., 2006). Unsuper-vised WSD approaches often rely on distributionalrepresentations, computed over the target word andits context (Lin, 1997; McCarthy et al., 2004; Brodyand Lapata, 2008). Most supervised approaches usesense annotated corpora to extract linguistic featuresof the target word (context words, POS tags, col-location features), which are then fed into a classi-fier to disambiguate test data (Zhong and Ng, 2010).Recently, features based on sense-specific seman-tic vectors learned using large corpora and a senseinventory such as WordNet have been shown toachieve state-of-the-art results for supervised WSD(Rothe and Schutze, 2015; Jauhar et al., 2015).

As mentioned in the introduction, all existingwork on visual sense disambiguation has usednouns, starting with Barnard et al. (2003). Sense dis-crimination for web images was introduced by Lo-eff et al. (2006), who used spectral clustering overmultimodal features from the images and web text.Saenko and Darrell (2008) used sense definitionsin a dictionary to learn a latent LDA space overssenses, which they then used to construct sense-specific classifiers by exploiting the text surroundingan image.

2.1 Related Datasets

Most of the datasets relevant for verb sense disam-biguation were created by the computer vision com-munity for the task of human action recognition (see

183

Table 1 for an overview). These datasets are anno-tated with a limited number of actions, where anaction is conceptualized as verb-object pair: ridehorse, ride bicycle, play tennis, play guitar, etc.Verb sense ambiguity is ignored in almost all actionrecognition datasets, which misses important gener-alizations: for instance, the actions ride horse andride bicycle represent the same sense of ride andthus share visual, textual, and conceptual features,while this is not the case for play tennis and playguitar. This is the issue we address by creating adataset with explicit sense labels.

VerSe is built on top of two existing datasets,TUHOI and COCO. The Trento Universal Human-Object Interaction (TUHOI) dataset contains 10,805images covering 2974 actions. Action (human-object interaction) categories were annotated usingcrowdsourcing: each image was labeled by multipleannotators with a description in the form of a verbor a verb-object pair. The main drawback of TUHOIis that 1576 out of 2974 action categories occur onlyonce, limiting its usefulness for VSD. The MicrosoftCommon Objects in Context (COCO) dataset is verypopular in the language/vision community, as it con-sists of over 120k images with extensive annotation,including labels for 91 object categories and five de-scriptions per image. COCO contains no explicit ac-tion annotation, but verbs and verb phrases can beextracted from the descriptions. (But note that notall the COCO images depict actions.)

The recently created Humans Interacting withCommon Objects (HICO) dataset is conceptuallysimilar to VerSe. It consists of 47774 images anno-tated with 111 verbs and 600 human-object interac-tion categories. Unlike other existing datasets, HICOuses sense-based distinctions: actions are denoted bysense-object pairs, rather than by verb-object pairs.HICO doesn’t aim for complete coverage, but re-stricts itself to the top three WordNet senses of averb. The dataset would be suitable for performingvisual sense disambiguation, but has so far not beenused in this way.

3 VerSe Dataset and Annotation

We want to build an unsupervised visual sense dis-ambiguation system, i.e., a system that takes an im-age and a verb and returns the correct sense ofthe verb. As discussed in Section 2.1, most exist-

Verb: touch

2 make physical contact with, possibly with the effect of physicallymanipulating. They touched their fingertips together and smiled

2 affect someone emotionally The president’s speech touched achord with voters.

2 be or come in contact without control They sat so close that theirarms touched.

2 make reference to, involve oneself with They had wide-rangingdiscussions that touched on the situation in the Balkans.

2 Achieve a value or quality Nothing can touch cotton for durabil-ity.

2 Tinge; repair or improve the appearance of He touched on thepaintings, trying to get the colors right.

Figure 3: Example item for depictability and senseannotation: synset definitions and examples (in blue)for the verb touch.

ing datasets are not suitable for this task, as they donot include word sense annotation. We therefore de-velop our own dataset with gold-standard sense an-notation. The Verb Sense (VerSe) dataset is based onCOCO and TUHOI and covers 90 verbs and around3500 images. VerSe serves two main purposes: (1) toshow the feasibility of annotating images with verbsenses (rather than verbs or actions); (2) to functionas test bed for evaluating automatic visual sense dis-ambiguation methods.

Verb Selection Action recognition datasets oftenuse a limited number of verbs (see Table 1). We ad-dressed this issue by using images that come withdescriptions, which in the case of action images typ-ically contain verbs. The COCO dataset includes im-ages in the form of sentences, the TUHOI dataset isannotated with verbs or prepositional verb phrasesfor a given object (e.g., sit on chair), which we use inlieu of descriptions. We extracted all verbs from allthe descriptions in the two datasets and then selectedthose verbs that have more than one sense in theOntoNotes dictionary, which resulted in 148 verbsin total (94 from COCO and 133 from TUHOI).

Depictability Annotation A verb can have mul-tiple senses, but not all of them may be depictable,e.g., senses describing cognitive and perception pro-cesses. Consider two senses of touch: make physicalcontact is depictable, whereas affect emotionally de-scribes a cognitive process and is not depictable. Wetherefore need to annotate the synsets of a verb asdepictable or non-depictable. Amazon MechanicalTurk (AMT) workers were presented with the def-initions of all the synsets of a verb, along with ex-

184

Verb type Examples Verbs Images Senses Depct ITAMotion run, walk, jump, etc. 39 1812 10.76 5.79 0.680Non-motion sit, stand, lay, etc. 51 1698 8.27 4.86 0.636

Table 2: Overview of VerSe dataset divided intomotion and non-motion verbs; Depct: depictablesenses; ITA: inter-annotator agreement.

amples, as given by OntoNotes. An example for thisannotation is shown in Figure 3. We used OntoNotesinstead of WordNet, as WordNet senses are veryfine-grained and potentially make depictability andsense annotation (see below) harder. Granularity is-sues with WordNet for text-based WSD are welldocumented (Navigli, 2009).

OntoNotes lists a total of 921 senses for our 148target verbs. For each synset, three AMT workersselected all depictable senses. The majority labelwas used as the gold standard for subsequent ex-periments. This resulted in a 504 depictable senses.Inter-annotator agreement (ITA) as measured byFleiss’ Kappa was 0.645.

Sense Annotation We then annotated a subset ofthe images in COCO and TUHOI with verb senses.For every image we assigned the verb that occursmost frequently in the descriptions for that image(for TUHOI, the descriptions are verb-object pairs,see above). However, many verbs are representedby only a few images, while a few verbs are rep-resented by a large number of images. The datasetstherefore show a Zipfian distribution of linguisticunits, which is expected and has been observed pre-viously for COCO (Ronchi and Perona, 2015). Forsense annotation, we selected only verbs for whicheither COCO or TUHOI contained five or more im-ages, resulting in a set of 90 verbs (out of the to-tal 148). All images for these verbs were included,giving us a dataset of 3518 images: 2340 images for82 verbs from COCO and 1188 images for 61 verbsfrom TUHOI (some verbs occur in both datasets).

These image-verb pairs formed the basis for senseannotation. AMT workers were presented with theimage and all the depictable OntoNotes senses ofthe associated verb. The workers had to chose thesense of the verb that was instantiated in the image(or “none of the above”, in the case of irrelevant im-ages). Annotators were given sense definitions andexamples, as for the depictability annotation (seeFigure 3). For every image-verb pair, five annotators

performed the sense annotation task. A total of 157annotators participated, reaching an inter-annotatoragreement of 0.659 (Fleiss’ Kappa). Out of 3528 im-ages, we discarded 18 images annotated with “noneof the above”, resulting in a set of 3510 images cov-ering 90 verbs and 163 senses. We present statis-tics of our dataset in Table 2; we group the verbsinto motion verbs and non-motion verb using Levin(1993) classes.

4 Visual Sense Disambiguation

For our disambiguation task, we assume we have aset of images I, and a set of polysemous verbs Vand each image i ∈ I is paired with a verb v ∈ V .For example, Figure 1 shows different images pairedwith the verb play. Every verb v ∈ V , has a set ofsenses S(v), described in a dictionary D . Now givenan image i paired with a verb v, our task is to pre-dict the correct sense s ∈ S(v), i.e., the sense that isdepicted by the associated image. Formulated as ascoring task, disambiguation consists of finding themaximum over a suitable scoring function Φ:

s = argmaxs∈S(v)

Φ(s, i,v,D) (1)

For example, in Figure 1, the correct sense for thefirst image is participate in sport, for the second oneit is play on an instrument, etc.

The Lesk (1986) algorithm is a well knownknowledge-based approach to WSD which relieson the calculation of the word overlap between thesense definition and the context in which a word oc-curs. It is therefore an unsupervised approach, i.e.,it does not require sense-annotated training data, butinstead exploits resources such as dictionaries or on-tologies to infer the sense of a word in context. Leskuses the following scoring function to disambiguatethe sense of a verb v:

Φ(s,v,D) = |context(v)∩definition(s,D)| (2)

Here, context(v) the set of words that occur closethe target word v and definition(s,D) is the set ofwords in the definition of sense s in the dictionary D .Lesk’s approach is very sensitive to the exact word-ing of definitions and results are known to changedramatically for different sets of definitions (Nav-igli, 2009). Also, sense definitions are often very

185

engage in competition or sport

perform or transmit music

engage in a playful activity

play Sense Inventory: D

O: person, tennis racket, sports ball

C: A woman is playing tennis.

Sense Representations

objects

captions

CNN-fc7

Lesk Algorithm

s 1s2 s3

s1

Φ

Image Representations

Figure 4: Schematic overview of the visual sensedisambiguation model.

short and do not provide sufficient vocabulary orcontext.

We propose a new variant of the Lesk algorithmto disambiguate the verb sense that is depicted inan image. In particular, we explore the effectivenessof textual, visual and multimodal representations inconjunction with Lesk. An overview of our method-ology is given in Figure 4. For a given image i la-beled with verb v (here play), we create a represen-tation (the vector i), which can be text-based (usingthe object labels and descriptions for i), visual, ormultimodal. Similarly, we create text-based, visual,and multimodal representations (the vector s) for ev-ery sense s of a verb. Based on the representations iand s (detailed below), we can then score senses as:1

Φ(s,v, i,D) = i · s (3)

Note that this approach is unsupervised: it requiresno sense annotated training data; we will use thesense annotations in our VerSe dataset only for eval-uation.

4.1 Sense RepresentationsFor each candidate verb sense, we create a text-based sense representation st and a visual sense rep-resentation sc.

Text-based Sense Representation We create avector st for every sense s ∈ S(v) of a verb v fromits definition and the example usages provided in

1Taking the dot product of two normalized vectors is equiv-alent to using cosine as similarity measure. We experimentedwith other similarity measures, but cosine performed best.

play

perform or transmit music

engage in competition or sport

engage in a playful activity

playing guitar

playing music

playing in a band

playing tennis

playing sport

playing game

playful activity

kids playing

children playing

#1

#3

#2

q 11q 12

q 13

q 22

q21

q23

q31

q32q

33

. . . . .

. . . . .

. . . . .

. . . . .

#4

Mean P

oolingM

ean Pooling

Mean P

ooling

CNN - fc7

CNN - fc7

CNN - fc7

CNN - fc7

CNN - fc7

CNN - fc7

play #1

play #2

play #3

Figure 5: Extracting visual sense representation forthe verb play.

the OntoNotes dictionary D . We apply word2vec(Mikolov et al., 2013), a widely used model of wordembeddings, to obtain a vector for every contentword in the definition and examples of the sense.We then take the average of these vectors to com-pute an overall representation of the verb sense. Forour experiments we used the pre-trained 300 dimen-sional vectors available with the word2vec package(trained on part of Google News dataset, about 100billion words).

Visual Sense Representation Sense dictionariestypically provide sense definitions and example sen-tences, but no visual examples or images. For nouns,this is remedied by ImageNet (Deng et al., 2009),which provides a large number of example imagesfor a subset of the senses in the WordNet noun hier-archy. However, no comparable resource is availablefor verbs (see Section 2.1).

In order to obtain visual sense representation sc,we therefore collected sense-specific images for theverbs in our dataset. For each verb sense s, threetrained annotators were presented with the definitionand examples from OntoNotes, and had to formulatea query Q (s) that would retrieve images depictingthe verb sense when submitted to a search engine.For every query q we retrieved images I (q) usingBing image search (for examples, see Figure 5). Weused the top 50 images returned by Bing for everyquery.

Once we have images for every sense, we canturn these images into feature representations us-

186

ing a convolutional neural network (CNN). Specifi-cally, we used the VGG 16-layer architecture (VG-GNet) trained on 1.2M images of the 1000 classILSVRC 2012 object classification dataset, a subsetof ImageNet (Simonyan and Zisserman, 2014). ThisCNN model has a top-5 classification error of 7.4%on ILSVRC 2012. We use the publicly available ref-erence model implemented using CAFFE (Jia et al.,2014) to extract the output of the fc7 layer, i.e., a4096 dimensional vector ci, for every image i. Weperform mean pooling over all the images extractedusing all the queries of a sense to generate a singlevisual sense representation sc (shown in Equation 4):

sc =1n ∑

q j∈Q (s)∑

i∈I (q j)ci (4)

where n is the total number of images retrieved persense s.

4.2 Image RepresentationsWe first explore the possibility of representing theimage indirectly, viz., through text associated with itin the form of object labels or image descriptions (asshown in Figure 4). We experiment with two differ-ent forms of textual annotation: GOLD annotation,where object labels and descriptions are providedby human annotators, and predicted (PRED) anno-tation, where state-of-the-art object recognition andimage description generation systems are applied tothe image.

Object Labels (O) GOLD object annotations areprovided with the two datasets we use. Each im-age sampled from COCO is annotated with oneor more of 91 object categories. Each image fromTUHOI is annotated with one more of 189 objectcategories. PRED object annotations were generatedusing the same VGG-16-layer CNN object recogni-tion model that was used to compute visual senserepresentations. Only object labels with object de-tection threshold of t > 0.2 were used.

Descriptions (C) To obtain GOLD image descrip-tions, we used the used human-generated descrip-tions that come with COCO. For TUHOI images,we generated descriptions of the form subject-verb-object, where the subject is always person, and theverb-object pairs are the action labels that come withTUHOI. To obtain PRED descriptions, we generated

three descriptions for every image using the state-of-the-art image description system of Vinyals et al.(2015).2

We can now create a textual representation it ofthe image i. Again, we used word2vec to obtainword embeddings, but applied these to the object la-bels and to the words in the image descriptions. Anoverall representation of the image is then computedby averaging these vectors over all labels, all contentwords in the description, or both.

Creating a visual representation ic of an image iis straightforward: we extract the fc7 layer of theVGG-16 network when applied to the image anduse the resulting vector as our image representation(same setup as in Section 4.1).

Apart from experimenting with separate textualand visual representations of images, it also makessense to combine the two modalities into a multi-modal representation. The simplest approach is aconcatenation model which appends textual and vi-sual features. More complex multimodal vectors canbe created using methods such as Canonical Corre-lation Analysis (CCA) and Deep Canonical Corre-lation Analysis (DCCA) (Hardoon et al., 2004; An-drew et al., 2013; Wang et al., 2015). CCA allows usto find a latent space in which the linear projectionsof text and image vectors are maximally correlated(Gong et al., 2014; Hodosh et al., 2015). DCCA canbe seen as non-linear version of CCA and has beensuccessfully applied to image description task (Yanand Mikolajczyk, 2015), outperforming previous ap-proaches, including kernel-based CCA.

We use both CCA and DCCA to map the vectorsit and ic (which have different dimensions) into ajoint latent space of n dimensions. We represent theprojected vectors of textual and visual features forimage i as it′ and ic′ and combine them to obtainmultimodal representation im as follows:

im = λt it′+λcic′ (5)

We experimented with a number of parameter set-tings for λt and λc for textual and visual models re-spectively. We use the same model to combine themultimodal representation for sense s as follows:

sm = λtst′+λcsc′ (6)

2We used Karpathy’s implementation, publicly available athttps://github.com/karpathy/neuraltalk.

187

We use these vectors (it, st), (ic, sc) and (im, sm)as described in Equation 3 to perform sense disam-biguation.

5 Experiments

5.1 Unsupervised Setup

To train the CCA and DCCA models, we use thetext representations learned from image descriptionsof COCO and Flickr30k dataset as one view andthe VGG-16 features from the respective images asthe second view. We divide the data into train, testand development samples (using a 80/10/10 split).We observed that the correlation scores for DCCAmodel were better than for the CCA model. We usethe trained models to generate the projected rep-resentations of text and visual features for the im-ages in VerSe. Once the textual and visual featuresare projected, we then merge them to get the multi-modal representation. We experimented with differ-ent ways of combining visual and textual featuresprojected using CCA or DCCA: (1) weighted in-terpolation of textual and visual features (see Equa-tions 5 and 6), and (2) concatenating the vectors oftextual and visual features.

To evaluate our proposed method, we compareagainst the first sense heuristic, which defaults tothe sense listed first in the dictionary (where sensesare typically ordered by frequency). This is a strongbaseline which is known to outperform more com-plex models in traditional text-based WSD. In VerSewe observe skewness in the distribution of the sensesand the first sense heuristic is as strong as over text.Also the most frequent sense heuristic, which as-signs the most frequently annotated sense for a givenverb in VerSe, shows very strong performance. It issupervised (as it requires sense annotated data to ob-tain the frequencies), so it should be regarded as anupper limit on the performance of the unsupervisedmethods we propose (also, in text-based WSD, themost frequent sense heuristic is considered an upperlimit, Navigli (2009)).

5.1.1 ResultsIn Table 3, we summarize the results of the gold-

standard (GOLD) and predicted (PRED) settingsfor motion and non-motion verbs across represen-tations. In the GOLD setting we find that for bothtypes of verbs, textual representations based on im-

age descriptions (C) outperform visual representa-tions (CNN features). The text-based results com-pare favorably to the original Lesk (as describedin Equation 2), which performs at 30.7 for motionverbs and 36.2 for non-motion verbs in the GOLDsetting. This improvement is clearly due to the useof word2vec embeddings.3 Note that CNN-basedvisual features alone performed better than gold-standard object labels alone in the case of motionverbs.

We also observed that adding visual featuresto textual features improves performance in somecases: multimodal features perform better than tex-tual features alone both for object labels (CNN+O)and for image descriptions (CNN+C). However,adding CNN features to textual features based onobject labels and descriptions together (CNN+O+C)resulted in a small decrease in performance. Further-more, we note that CCA models outperform simplevector concatenation in case of GOLD setting formotion verbs, and overall DCCA performed consid-erably worse than concatenation. Note that for CCAand DCCA we report the best performing scoresachieved using weighted interpolation of textual andvisual features with weights λt = 0.5 and λc = 0.5.

When comparing to our baseline and upper limit,we find that the all the GOLD models which usedescriptions-based representations (except DCCA)outperform to the first sense heuristic for motion-verbs (accuracy 70.8), whereas they performed be-low the first sense heuristic in case of non-motionverbs (accuracy 80.6). As expected, both motion andnon-motion verbs performed significantly below themost frequent sense heuristic (accuracy 86.2 and90.7 respectively), which we argued provides an up-per limit for unsupervised approaches.

We now turn the PRED configuration, i.e., to re-sults obtained using object labels and image descrip-tions predicted by state-of-the-art automatic sys-tems. This is arguably the more realistic scenario,as it only requires images as input, rather than as-suming human-generated object labels and imagedescriptions (though object detection and image de-scription systems are required instead). In the PREDsetting, we find that textual features based on ob-

3We also experimented with Glove vectors (Pennington etal., 2014) but observed that word2vec representations consis-tently achieved better results that Glove vectors.

188

(a) Motion verbs (39), FS: 70.8, MFS: 86.2Annotation Textual Vis Concat (CNN+) CCA (CNN+) DCCA (CNN+)

O C O+C CNN O C O+C O C O+C O C O+C

GOLD 54.6 73.3 75.6 58.3 66.6 74.7 73.8 50.5 75.4 74.0 52.4 66.3 68.3PRED 65.1 54.9 61.6 58.3 72.6 63.6 66.5 54.0 56.6 56.2 57.1 56.5 56.2

(b) Non-motion verbs (51), FS: 80.6, MFS: 90.7Annotation Textual Vis Concat (CNN+) CCA (CNN+) DCCA (CNN+)

O C O+C CNN O C O+C O C O+C O C O+C

GOLD 57.0 72.7 72.6 56.1 66.0 72.2 71.3 53.6 71.6 70.2 57.3 59.8 55.1PRED 59.0 64.3 64.0 56.1 63.8 66.3 66.1 50.7 55.3 54.8 49.5 50.0 50.0

Table 3: Accuracy scores for motion and non-motion verbs using for different types of sense and imagerepresentations (O: object labels, C: image descriptions, CNN: image features, FS: first sense heuristic,MFS: most frequent sense heuristic). Configurations that performed better than FS in bold.

Motion verbs (19), FS: 60.0, MFS: 76.1Features GOLD PRED

Sup Unsup Sup Unsup

O 82.3 35.3 80.0 43.8C 78.4 53.8 69.2 41.5O+C 80.0 55.3 70.7 45.3CNN 82.3 58.4 82.3 58.4CNN+O 83.0 48.4 83.0 60.0CNN+C 82.3 66.9 82.3 53.0CNN+O+C 83.0 58.4 83.0 55.3

Table 4: Accuracy scores for motion verbs for bothsupervised and unsupervised approaches using dif-ferent types of sense and image representation fea-tures.

ject labels (O) outperform both first sense heuristicand textual features based on image descriptions (C)in the case of motion verbs. Combining textual andvisual features via concatenation improves perfor-mance for both motion and non-motion verbs. Theoverall best performance of 72.6 for predicted fea-tures is obtained by combining CNN features andembeddings based on object labels and outperformsfirst sense heuristic in case of motion verbs (accu-racy 70.8). In the PRED setting for both classes ofverbs the simpler concatenation model performedbetter than the more complex CCA and DCCA mod-els. Note that for CCA and DCCA we report the bestperforming scores achieved using weighted interpo-lation of textual and visual features with weightsλt = 0.3 and λc = 0.7. Overall, our findings are con-sistent with the intuition that motion verbs are easierto disambiguate than non-motion verbs, as they are

Non-Motion verbs (19), FS: 71.3, MFS: 80.0Features GOLD PRED

Sup Unsup Sup Unsup

O 79.1 48.6 78.2 46.0C 79.1 53.9 77.3 61.7O+C 79.1 66.0 77.3 55.6CNN 80.0 55.6 80.0 55.6CNN+O 80.0 56.5 80.0 52.1CNN+C 80.0 56.5 80.3 60.0CNN+O+C 80.0 59.1 80.0 55.6

Table 5: Accuracy scores for non-motion verbs forboth supervised and unsupervised approaches usingdifferent types of sense and image representationfeatures.

more depictable and more likely to involve objects.Note that this is also reflected in the higher inter-annotator agreement for motion verbs (see Table 2).

5.2 Supervised Experiments and Results

Along with the unsupervised experiments we inves-tigated the performance of textual and visual repre-sentations of images in a simplest supervised setting.We trained logistic regression classifiers for senseprediction by dividing the images in VerSe datasetinto train and test splits. To train the classifiers weselected all the verbs which has atleast 20 images an-notated and has at least two senses in VerSe. This re-sulted in 19 motion verbs and 19 non-motion verbs.Similar to our unsupervised experiments we exploremultimodal features by using both textual and visualfeatures for classification (similar to concatenationin unsupervised experiments).

189

Verb Image Predicted Descriptions Pred. Obj.

play

A man holding a nintendo wii gamecontroller. A man and a woman play-ing a video game. A man and a womanare playing a video game.

person, bas-soon, violinfiddle, oboe,hautboy

swing

A woman standing next to a fire hy-drant. A woman walking down a streetholding an umbrella. A woman stand-ing on a sidewalk holding an umbrella.

person,horizontalbar, highbar, pole

feed

A couple of cows standing next to eachother. A cow that is standing in the dirt.A close up of a horse in a stable

arabiancamel,dromedary,person

Table 6: Images that were assigned an incorrectsense in the PRED setting.

In Table 4 we report accuracy scores for 19 mo-tion verbs using a supervised logistic regressionclassifier and for comparison we also report thescores of our proposed unsupervised algorithm forboth GOLD and PRED setting. Similarly in Table 5we report the accuracy scores for 19 non-motionverbs. We observe that all supervised classifiers forboth motion and non-motion verbs performing bet-ter than first sense baseline. Similar to our findingsusing an unsupervised approach we find that in mostcases multimodal features obtained using concate-nating textual and visual features has outperformedtextual or visual features alone especially in thePRED setting which is arguably the more realisticscenario. We observe that the features from PREDimage descriptions showed better results for non-motion verbs for both supervised and unsupervisedapproaches whereas PRED object features showedbetter results for motion verbs. We also observethat supervised classifiers outperform most frequentsense for motion verbs and for non-motion verbs ourscores match with most frequent sense heuristic.

5.3 Error Analysis

In order to understand the cases where the proposedunsupervised algorithm failed, we analyzed the im-ages that were disambiguated incorrectly. For thePRED setting, we observed that using predicted im-age descriptions yielded lower scores compared topredicted object labels. The main reason for this isthat the image description system often generates ir-relevant descriptions or descriptions not related tothe action depicted, whereas the object labels pre-dicted by the CNN model tend to be relevant. Thishighlights that current image description systems

still have clear limitations, despite the high evalu-ation scores reported in the literature (Vinyals et al.,2015; Fang et al., 2015). Examples are shown inTable 6: in all cases human generated descriptionsand object labels that are relevant for disambigua-tion, which explains the higher scores in the GOLDsetting.

6 Conclusion

We have introduced the new task of visual verb sensedisambiguation: given an image and a verb, identifythe verb sense depicted in the image. We developedthe new VerSe dataset for this task, based on theexisting COCO and TUHOI datasets. We proposedan unsupervised visual sense disambiguation modelbased on the Lesk algorithm and demonstrated thatboth textual and visual information associated withan image can contribute to sense disambiguation. Inan in-depth analysis of various image representa-tions we showed that object labels and visual fea-tures extracted using state-of-the-art convolutionalneural networks result in good disambiguation per-formance, while automatically generated image de-scriptions are less useful.

References

Galen Andrew, Raman Arora, Jeff A. Bilmes, and KarenLivescu. 2013. Deep canonical correlation analysis.In Proceedings of the 30th International Conferenceon Machine Learning, ICML 2013, Atlanta, GA, USA,16-21 June 2013, pages 1247–1255.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, andDevi Parikh. 2015. VQA: visual question answering.In 2015 IEEE International Conference on ComputerVision, ICCV 2015, Santiago, Chile, December 7-13,2015, pages 2425–2433.

Kobus Barnard, Matthew Johnson, and David Forsyth.2003. Word sense disambiguation with pictures.In Proceedings of the HLT-NAACL 2003 workshopon Learning word meaning from non-linguistic data-Volume 6, pages 1–5. Association for ComputationalLinguistics.

Samuel Brody and Mirella Lapata. 2008. Good neigh-bors make good senses: Exploiting distributional sim-ilarity for unsupervised wsd. In Proceedings ofthe 22nd International Conference on ComputationalLinguistics-Volume 1, pages 65–72. Association forComputational Linguistics.

190

Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang,and Jia Deng. 2015. HICO: A benchmark for recog-nizing human-object interactions in images. In 2015IEEE International Conference on Computer Vision,ICCV 2015, Santiago, Chile, December 7-13, 2015,pages 1017–1025.

Xinlei Chen, Alan Ritter, Abhinav Gupta, and Tom M.Mitchell. 2015. Sense discovery via co-clustering onimages and text. In IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2015, Boston,MA, USA, June 7-12, 2015, pages 5298–5306.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Fei-Fei Li. 2009. ImageNet: A large-scale hi-erarchical image database. In 2009 IEEE ComputerSociety Conference on Computer Vision and PatternRecognition (CVPR 2009), 20-25 June 2009, Miami,Florida, USA, pages 248–255.

Mark Everingham, S. M. Ali Eslami, Luc Van Gool,Christopher K. I. Williams, John M. Winn, and An-drew Zisserman. 2015. The Pascal visual objectclasses challenge: A retrospective. International Jour-nal of Computer Vision, 111(1):98–136.

Hao Fang, Saurabh Gupta, Forrest N. Iandola, Ru-pesh K. Srivastava, Li Deng, Piotr Dollar, JianfengGao, Xiaodong He, Margaret Mitchell, John C. Platt,C. Lawrence Zitnick, and Geoffrey Zweig. 2015.From captions to visual concepts and back. In IEEEConference on Computer Vision and Pattern Recogni-tion, CVPR 2015, Boston, MA, USA, June 7-12, 2015,pages 1473–1482.

Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hock-enmaier, and Svetlana Lazebnik. 2014. Improvingimage-sentence embeddings using large weakly anno-tated photo collections. In Computer Vision - ECCV2014 - 13th European Conference, Zurich, Switzer-land, September 6-12, 2014, Proceedings, Part IV,pages 529–545.

David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: Anoverview with application to learning methods. Neu-ral Computation, 16(12):2639–2664.

Micah Hodosh, Peter Young, and Julia Hockenmaier.2015. Framing image description as a ranking task:Data, models and evaluation metrics (extended ab-stract). In Proceedings of the Twenty-Fourth Interna-tional Joint Conference on Artificial Intelligence, IJ-CAI 2015, Buenos Aires, Argentina, July 25-31, 2015,pages 4188–4192.

Eduard H. Hovy, Mitchell P. Marcus, Martha Palmer,Lance A. Ramshaw, and Ralph M. Weischedel. 2006.Ontonotes: The 90% solution. In Human LanguageTechnology Conference of the North American Chap-ter of the Association of Computational Linguistics,

Proceedings, June 4-9, 2006, New York, New York,USA, pages 57–60.

Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy.2015. Ontologically grounded multi-sense represen-tation learning for semantic vector space models. InNAACL HLT 2015, The 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Denver, Colorado, USA, May 31 - June 5, 2015, pages683–693.

Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross B. Girshick, SergioGuadarrama, and Trevor Darrell. 2014. Caffe: Con-volutional architecture for fast feature embedding. InProceedings of the ACM International Conference onMultimedia, MM ’14, Orlando, FL, USA, November03 - 07, 2014, pages 675–678.

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descrip-tions. In IEEE Conference on Computer Vision andPattern Recognition, CVPR 2015, Boston, MA, USA,June 7-12, 2015, pages 3128–3137.

Adam Kilgarrif. 1998. Senseval: An exercise in evalu-ating word sense disambiguation programs. In Proc.of the first international conference on language re-sources and evaluation, pages 581–588.

Dieu Thu Le, Raffaella Bernardi, and Jasper Uijlings.2013. Exploiting language models to recognize un-seen actions. In Proceedings of the 3rd ACM con-ference on International conference on multimedia re-trieval, pages 231–238. ACM.

Dieu-Thu Le, Jasper Uijlings, and Raffaella Bernardi,2014. Proceedings of the Third Workshop on Visionand Language, chapter TUHOI: Trento Universal Hu-man Object Interaction Dataset, pages 17–24. DublinCity University and the Association for ComputationalLinguistics.

Michael Lesk. 1986. Automatic sense disambiguationusing machine readable dictionaries: how to tell a pinecone from an ice cream cone. In Proceedings of the 5thAnnual International Conference on Systems Docu-mentation, SIGDOC 1986, Toronto, Ontario, Canada,1986, pages 24–26.

Beth Levin. 1993. English verb classes and alternations:A preliminary investigation. University of ChicagoPress.

Dekang Lin. 1997. Using syntactic dependency as localcontext to resolve word sense ambiguity. In Proceed-ings of the 35th Annual Meeting of the Association forComputational Linguistics and Eighth Conference ofthe European Chapter of the Association for Computa-tional Linguistics, pages 64–71. Association for Com-putational Linguistics.

191

Nicolas Loeff, Cecilia Ovesdotter Alm, and David A.Forsyth. 2006. Discriminating image senses by clus-tering with multimodal features. In ACL 2006, 21stInternational Conference on Computational Linguis-tics and 44th Annual Meeting of the Association forComputational Linguistics, Proceedings of the Confer-ence, Sydney, Australia, 17-21 July 2006, pages 547–554. Association for Computational Linguistics.

Diana McCarthy, Rob Koeling, Julie Weeds, and JohnCarroll. 2004. Finding predominant word senses inuntagged text. In Proceedings of the 42nd AnnualMeeting on Association for Computational Linguis-tics, pages 279–286. Association for ComputationalLinguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. CoRR, abs/1301.3781.

George A Miller, Richard Beckwith, Christiane Fell-baum, Derek Gross, and Katherine J Miller. 1990.Introduction to wordnet: An on-line lexical database.International Journal of Lexicography, 3(4):235–244.

Roberto Navigli. 2009. Word sense disambiguation: Asurvey. ACM Computing Surveys (CSUR), 41(2):10.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing, EMNLP 2014, October 25-29, 2014, Doha, Qatar,A meeting of SIGDAT, a Special Interest Group of theACL, pages 1532–1543.

Matteo Ruggero Ronchi and Pietro Perona. 2015. De-scribing common human visual actions in images. InProceedings of the British Machine Vision Confer-ence (BMVC 2015), pages 52.1–52.12. BMVA Press,September.

Sascha Rothe and Hinrich Schutze. 2015. Autoex-tend: Extending word embeddings to embeddings forsynsets and lexemes. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conferenceon Natural Language Processing of the Asian Federa-tion of Natural Language Processing, ACL 2015, July26-31, 2015, Beijing, China, Volume 1: Long Papers,pages 1793–1803.

Kate Saenko and Trevor Darrell. 2008. Unsupervisedlearning of visual sense models for polysemous words.In Advances in Neural Information Processing Sys-tems 21, Proceedings of the Twenty-Second AnnualConference on Neural Information Processing Sys-tems, Vancouver, British Columbia, Canada, Decem-ber 8-11, 2008, pages 1393–1400.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. CoRR, abs/1409.1556.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-mitru Erhan. 2015. Show and tell: A neural imagecaption generator. In IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2015, Boston,MA, USA, June 7-12, 2015, pages 3156–3164.

Weiran Wang, Raman Arora, Karen Livescu, and Jeff A.Bilmes. 2015. On deep multi-view representationlearning. In Proceedings of the 32nd InternationalConference on Machine Learning, ICML 2015, Lille,France, 6-11 July 2015, pages 1083–1092.

Fei Yan and Krystian Mikolajczyk. 2015. Deep corre-lation for matching images and text. In IEEE Con-ference on Computer Vision and Pattern Recognition,CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages3441–3450.

Bangpeng Yao and Li Fei-Fei. 2010. Grouplet: A struc-tured image representation for recognizing human andobject interactions. In Computer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on, pages9–16. IEEE.

Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy LaiLin, Leonidas Guibas, and Li Fei-Fei. 2011. Hu-man action recognition by learning bases of action at-tributes and parts. In Computer Vision (ICCV), 2011IEEE International Conference on, pages 1331–1338.IEEE.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In ACL 2010, Proceedings of the 48thAnnual Meeting of the Association for ComputationalLinguistics, July 11-16, 2010, Uppsala, Sweden, Sys-tem Demonstrations, pages 78–83.

192

Documents

Unsupervised Visual Sense Disambiguation for Verbs using ... · Verb type Examples Verbs Images Senses Depct ITA Motion run, walk, jump, etc. 39 1812 10.76 5.79 0.680 Non-motion sit,