Visual Denotations for Recognizing Textual Entailment · whether T implies H (yes), whether T contradicts H (no) or otherwise (unk). For example, given: (T)Some men walk in the tall

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2853–2859Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Visual Denotations for Recognizing Textual EntailmentDan Han1

[email protected] Martı́nez-Gómez1

[email protected]

Koji [email protected]

1Artificial Intelligence Research Center, AIST2Ochanomizu University

Tokyo, Japan

Abstract

In the logic approach to RecognizingTextual Entailment, identifying phrase-to-phrase semantic relations is still an un-solved problem. Resources such as theParaphrase Database offer limited cover-age despite their large size whereas unsu-pervised distributional models of meaningoften fail to recognize phrasal entailments.We propose to map phrases to their visualdenotations and compare their meaning interms of their images. We show that ourapproach is effective in the task of Recog-nizing Textual Entailment when combinedwith specific linguistic and logic features.

1 Introduction and Related Work

Recognizing Textual Entailment (RTE) is a chal-lenging task that was described as the best wayof testing an NLP system’s semantic capacity(Cooper et al., 1994). In this task, given a textT and a hypothesis H, the objective is to recognizewhether T implies H (yes), whether T contradictsH (no) or otherwise (unk). For example, given:

(T) Some men walk in the tall and green grass.(H) Some people walk in the field.

the system needs to recognize that T implies H(yes). Although humans can easily solve theseproblems, machines face great difficulties (Daganet al., 2013). RTE has been approached from dif-ferent perspectives, ranging from purely statisticalsystems (Lai and Hockenmaier, 2014; Zhao et al.,2014) to purely logical (Bos et al., 2004; Abzian-idze, 2015; Mineshima et al., 2015) and hybridsystems (Beltagy et al., 2013).

We evaluate our idea on top of a logic systemsince they generally offer a high precision and in-terpretability, which is useful to our purposes. In

this approach, there are two main challenges. Thefirst challenge is to model the logical semanticcomposition of sentences guided by the syntax andlogical words (e.g. most, not, some, every). Thesecond challenge is to introduce lexical knowledgethat describes the relationship between words orphrases (e.g. men→ people, tall and green grass→ field).

Whereas the relationship men → people canbe found in high precision ontological resourcessuch as WordNet (Miller, 1995), phrasal relationssuch as tall and green grass→ field are not avail-able in databases such as the Paraphrase Database(PPDB) (Ganitkevitch et al., 2013) despite theirlarge size. Moreover, although unsupervised dis-tributional similarity models have an infinite do-main (given a compositional function on words),they often fail to identify entailments (e.g. guitarhas a high similarity to piano but they do not en-tail each other). To address these issues, Rolleret al. (2014) investigated supervised methods toidentify word-to-word hypernym relations givenword vectors whereas Beltagy et al. (2016) pro-posed a mechanism to extract phrase pairs fromT and H and train a classifier to identify para-phrases in unseen T-H problems. Our approachis largely inspired by their work and our intentionis to increase the performance of these phrase andsentence level entailment classifiers using multi-modal features.

Our assumption is that the same concept ex-pressed using different phrase forms is mapped tosimilar visual representations since humans tendto ground the meaning of phrases into the samevisual denotation. In a similar line, Kiela and Bot-tou (2014) proposed a simple yet effective con-catenation of pre-trained distributed word repre-sentations and visual features, whereas Izadiniaet al. (2015) suggests a tighter parametric integra-tion using a set of hand annotated phrasal entail-

2853

ment relations; however, their work was limitedto recognizing word or phrase relations, ignoringthe additional challenges that come in RTE whichwe show is critical. Young et al. (2014) and Laiand Hockenmaier (2014) did tackle sentence-levelRTE using visual denotations. However, their ap-proach is only applicable to those RTE problemswhose words or phrases appear in the FLICKR30Kcorpus, which is a considerable limitation. Laiand Hockenmaier (2017) extended the approachto also recognize unseen phrasal semantic rela-tions using a neural network augmented with con-ditional probabilities estimated from visual deno-tations. Instead, our approach is much simpler andsimilarly effective.

Our contribution is a method to judge phrase-to-phrase semantic relations using an asymmetricsimilarity scoring function between their sets ofvisual denotations. We identify the conditions inwhich this function contributes to sentence-levelRTE and show empirically its benefit. Our ap-proach is simpler than previous methods and itdoes not require annotated phrase relations. More-over, this approach is not limited to specific cor-pora or evaluation datasets and it is potentially lan-guage independent.

2 Methodology

We formulate our framework in terms of a classi-fier gθ : T × H → {yes, no, unk} that outputsan entailment judgment for any text T ∈ T andhypothesis H ∈ H. There are three key issuesin designing an effective classifier that uses visualdenotations: i) to discern when it is appropriate touse visual denotations to recognize phrasal entail-ments, ii) to extract candidate phrase pairs and iii)to map those phrases into visual denotations1 andmeasure their semantic similarity in terms of theirassociated images.

Textual and Logic Features The first issue isto understand the linguistic and logic limitationsof visual denotations in recognizing phrasal entail-ments. From our observations, the linguistic phe-nomena that make visual denotations ineffectiveare word-to-word verb relations (e.g. laughing andcrying) since their associated images may depictdifferent actions with similar entities (e.g. picturesof a baby crying are similar to those of a babylaughing); antonym relations between any word

1We approximate the visual denotations of a phrase byobtaining the images associated to that phrase.

in a phrase pair (e.g. similar images for big carand small vehicle); and words that denote peopleof different gender (e.g. boy versus lady, man ver-sus woman) as they often display high visual sim-ilarity compared to other entities. The logic phe-nomena we identified signal sentences with smalldifferences in critical words, phrases or structures,as in the presence of negations (e.g. images of nocat still display cats), passive-active constructionsand subject-object case mismatches (e.g. imagesof boy eats apple and apple eats boy are similar)between T and H.

These logic phenomena can be easily detectedfrom logic formulas with the aid of the variableunification during the theorem proving process.For instance, using event semantics (Davidson,1967; Parsons, 1990), an active sentence a boyeats an apple and its corresponding passive sen-tence an apple is eaten by a boy can be composi-tionally mapped to the same logical formula, i.e.,∃e∃x∃y(boy(x)∧apple(y)∧ eat(e)∧ (subj(e) =x) ∧ (obj(e) = y)), while a boy eats an appleand an apple eats a boy are mapped to differentformulas. When trying to prove the formula cor-responding to H from the formula correspondingto T, one needs to unify the variables contained inthese formulas, so that the non-logical predicatessuch as boy, apple and eat in T and H are alignedby taking into account logical signals.

Extract candidate phrase pairs between T andH The second issue is to find candidate phrasepairs between T and H for which we comparetheir visual denotations. In our running example(see Figure 1), a desirable candidate phrase pairwould be tall and green grass and field. We usea tree mapping algorithm (Martı́nez-Gómez andMiyao, 2016) that finds node correspondences be-tween the syntactic trees of T and H. The searchis carried out bottom-up, guided by an ensembleof cost functions. This ensemble rewards word orphrase correspondences that are equal or if a lin-guistic relationship (i.e. synonymy, hypernymy,etc.) holds between them according to WordNet.This tree mapping implicitly defines hierarchicalphrase pair correspondences between T and H.We only select those phrase pairs for which bothphrases have less than 6 words. We believe thatdiscerning the entailment relation between longerphrases should be left to the logic prover and thecompositional mechanism of meaning.

2854

Figure 1: Phrase-image mappings for the phrase pair tall and green grass and field in one RTE problem.

Visual Features At this stage it remains to mea-sure the semantic relation between the candidatephrase pairs (extracted with the tree mapping algo-rithm described above) using their visual denota-tions (see Figure 1 for a schematic diagram2). Forthis purpose, we select the phrase pairs (t, h) withhighest and lowest similarity score. We define thesimilarity score as the average cosine similaritybetween the best image correspondences. That is:

score(t, h) =1|Ih|

∑ihl ∈Ih

maxitk∈It

f(itk, ihl ) (1)

where It = {it1, . . . , itn} are the n images associ-ated with phrase t from T and Ih = {ih1 , . . . , ihn}are the n images for phrase h from H, for 1 ≤l, k ≤ n. Note the asymmetry in Eq. 1 which cap-tures semantic subsumptions (a picture of river isamong the pictures of body of water). The func-tion f returns the cosine similarity between twoimages:

f(itk, ihl ) = cos(vvv(i

tk),vvv(i

hl )) =

vvv(itk) ·vvv(ihl )||vvv(itk)|| · ||vvv(ihl )||

(2)

where v(i) is the vector representation of an im-age i. We obtain these vector representations con-catenating the activations of the first 7 layers ofGoogLeNet (Szegedy et al., 2015) as it is commonpractice (Kiela and Bottou, 2014).

Given the phrases with the highest and lowest2 Due to copyright, images in this paper are a subset of

Google Image Search results for which we have a publishinglicense. Nevertheless, they are faithful representatives.

similarity score,3 we extract four features fromeach pair. The first feature is the similarity scoreitself. The other three features capture statistics ofthe relationship f(It × Ih) between the two setsof visual denotations It and Ih. This relationshipf(It × Ih) is defined as the the matrix of imagecosine similarities:

f(It × Ih) =f(it1, i

h1) f(i

t1, i

h2) · · · f(it1, ihn)

f(it2, ih1) f(i

t2, i

h2) · · · f(it2, ihn)

......

. . ....

f(itn, ih1) f(i

tn, i

h2) · · · f(itn, ihn)

(3)Specifically, these three features are:• max f(It × Ih) returns the cosine similarity

between the two most similar images. Thisfeature is robust against polysemic phrases(at least one image associated to pupil is simi-lar to at least one image associated to student)and hypernymy.• averagef(It × Ih) returns the average simi-

larity across all image pairs and aims to mea-sure the visual denotation overlap betweenboth phrases in the pair.• min f(It×Ih) returns the similarity between

the two most different images and gives a no-tion of how different the meanings of the twophrases can be.

3 If there are no candidate phrase pairs, the T-H problemis ignored. If there is only one phrase pair, such a pair is usedas the pair with highest and lowest score.

2855

All features above are concatenated into a fea-ture vector which is paired with the T-H entailmentgold label to train the classifier.

3 Experiments

Our system is independent from the logic back-end but we use ccg2lambda (Martı́nez-Gómezet al., 2016)4 for its high precision and capabilitiesto solve word-to-word divergences using WordNetand VerbOcean (Chklovski and Pantel, 2004).

We evaluate our system on the SemEval-2014version of the SICK dataset (Marelli et al., 2014)with train/trial/test splits of 4, 500/500/4, 927 T-H pairs and a yes/no/unk label distribution of.29/.15/.56. We chose SICK for its relativelylimited vocabulary (2, 409 words) and short sen-tences. The average T and H sentence length was10.6 where 3.6 to 3.8 words appeared in T andnot in H or vice versa. We used scipy’s RandomForests (Breiman, 2001) as our entailment classi-fier with 500 trees and feature value standardiza-tion, trained and evaluated on those T-H pairs forwhich ccg2lambda outputs unknown (around71% of the problems).

Using the tree mapping algorithm,5 we obtainedan average of 9.8 phrase pairs per T-H problem.We obtained n = 30 images for every phrase usingGoogle Image Search API which we consider asour visual denotations. The images and their vec-tor representations were obtained between Sept.2016 and Feb. 2017 using the image miner andthe feature extractor of Kiela (2016).6

Our main baseline is ccg2lambda when us-ing only WordNet and VerbOcean to account forword-to-word lexical divergences. ccg2lambdais augmented with a classifier c that uses eithertext and logic features t or image features from10, 20, or 30 images (10i, 20i or 30i). On thetraining data (Table 1), ccg2lambda obtains anaccuracy of 82.89%. Using our classifier with allfeatures, we carried out 10 runs of a 10-fold cross-validation on the training data and we obtainedan accuracy (standard deviation) of 84.14 (0.06),84.30 (0.14) and 84.28 (0.11) when using 10, 20and 30 images, respectively. Thus, no significantdifferences in accuracy were observed for differ-ent numbers of images. When using only textand logic features (c-t), the accuracy dropped

4 https://github.com/mynlp/ccg2lambda5 https://github.com/pasmargo/t2t-qa6 https://github.com/douwekiela/mmfeat

System Accuracy Std.ccg2lambda 82.89 −ccg2lambda, c-t-10i 84.14 0.06ccg2lambda, c-t-20i 84.30 0.14ccg2lambda, c-t-30i 84.28 0.11ccg2lambda, c-t 76.60 0.03ccg2lambda, c-20i 82.85 0.08

Table 1: Results (accuracy and standard devia-tion) of the classifier c in a cross-validation on thetraining split of SICK dataset using text and logicfeatures t for 10i, 20i and 30i images.

System Prec. Rec. Acc.ccg2lambda + images 90.24 71.08 84.29ccg2lambda, only text 96.95 62.65 83.13L&H, text + images − − 82.70L&H, only text − − 81.50Illinois-LH, 2014 81.56 81.87 84.57Yin & Schütze, 2017 − − 87.10Baseline (majority) − − 56.69

Table 2: Results on the test split of SICK datasetusing precision, recall and accuracy. The system“ccg2lambda + images” uses text and logicsfeatures and 20 images per phrase: c-t-20i.

to 76.60 (0.03); when using only image features(c-20i), the accuracy dropped to 82.85%. Theseresults show that using visual denotations to rec-ognize phrasal entailments contributes to improve-ments in accuracy and that the interaction with textand logic features produces further gains.

On the test data, we obtained 1.1% higher accu-racy (84.29 versus 83.13) over the ccg2lambdabaseline with a standard deviation of 0.07% over10 runs (see Table 2) when using the settingc-t-20i. As a comparison, Lai and Hocken-maier (2017) obtain a similar accuracy increasewhen using visual denotations (1.2%) with a sub-stantially more complex approach that requirestraining on the SNLI dataset (Bowman et al.,2015), a much larger corpus.

The best SemEval-2014 system obtained an ac-curacy of 84.57 (Lai and Hockenmaier, 2014)and other heavily engineered, finely-tuned sys-tems (Beltagy et al., 2016; Yin and Schütze, 2017)reported up to 3% points of accuracy improvementsince then. Thus, our results are still below thestate of the art.

2856

Figure 2: True positive, ID: 4012; gold: yes.

Figure 3: False positive, ID: 1215; gold: unk.

Figure 4: False negative, ID: 1318; gold: yes.

4 Error analysis

We had an average of 126 true positives (gold la-bel yes, system label yes) and 81 false positives(gold label unk, system label yes) in our cross-validation over the training data. Figure 2 showsan example of a true positive where the tree map-ping algorithm extracted the phrase pair kangaroothat is little and baby kangaroo. The image sim-ilarity features showed a high score causing theclassifier to correctly produce the judgment yes.Figure 3 shows a false positive where the extractedphrase pair was marsh and river and for whichthe image similarity is unfortunately high. Thesecases are common when comparing people (boyand man) or scenery (such as beach and desert).

Figure 4 shows a false negative (gold label yes,system label unk) where the candidate phrase pairwas plastic sword and toy weapon. In this case,there was only one image with a plastic swordwithin the images associated to toy weapon whichmay have caused the cosine similarities to be low.

5 Discussion and Conclusion

In this paper we have evaluated our method on theSICK dataset which was originally created fromimage captions. For that reason, the proportionof concepts with good visual denotations might behigher than in typically occurring RTE problems.Our future work is to assess the applicability ofour approach into other RTE problems such as theRTE challenges, SNLI (Bowman et al., 2015) andMultiNLI (Williams et al., 2017) datasets and fur-ther investigate what syntactic or semantic unitscan be best represented using visual denotations.

Another issue is the use of a commercial im-age search API as a black box to retrieve images.These search engines may include heuristics thatmap similar phrases or keywords into the samecanonical form and that are difficult to control ex-perimentally. However, we believe that our ap-proach is still valid for a variety of image searchmechanisms and it is generally useful to resolvelexical ambiguity at a high coverage.

We identified the conditions in which visual de-notations are effective for sentence-level RTE anddevised a simple scoring function to assess phrasalsemantic subsumption, which may serve as the ba-sis for more elaborated strategies. Our system isindependent on the semantic parser but the en-tailment recognition mechanism requires a theo-rem prover that displays remaining sub-goals. Thesystem and instructions are available at https://github.com/mynlp/ccg2lambda

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO). This project is also supported by JSPSKAKENHI Grant Number 17K12747, partiallyfunded by Microsoft Research Asia and JSTCREST Grant Number JPMJCR1301, Japan. Wethank Ola Vikholt and the anonymous reviewersfor their valuable comments.

2857

ReferencesLasha Abzianidze. 2015. A tableau prover for natural

logic and language. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2492–2502, Lisbon, Portugal. As-sociation for Computational Linguistics.

Islam Beltagy, Cuong Chau, Gemma Boleda, DanGarrette, Katrin Erk, and Raymond Mooney. 2013.Montague meets markov: Deep semantics withprobabilistic logical form. In Second Joint Con-ference on Lexical and Computational Semantics(*SEM), Volume 1: Proceedings of the Main Con-ference and the Shared Task: Semantic Textual Sim-ilarity, pages 11–21, Atlanta, Georgia, USA. Asso-ciation for Computational Linguistics.

Islam Beltagy, Stephen Roller, Pengxiang Cheng, Ka-trin Erk, and Raymond J. Mooney. 2016. Repre-senting meaning with a combination of logical anddistributional models. Computational Linguistics,42(4):763–808.

Johan Bos, Stephen Clark, Mark Steedman, James RCurran, and Julia Hockenmaier. 2004. Wide-coverage semantic representations from a CCGparser. In Proceedings of the 20th international con-ference on Computational Linguistics, pages 104–111. Association for Computational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.

Leo Breiman. 2001. Random forests. Machine learn-ing, 45(1):5–32.

Timothy Chklovski and Patrick Pantel. 2004. VerbO-cean: Mining the web for fine-grained semantic verbrelations. In Proceedings of EMNLP 2004, pages33–40, Barcelona, Spain. Association for Computa-tional Linguistics.

Robin Cooper, Richard Crouch, Jan van Eijck, ChrisFox, Josef van Genabith, Jan Jaspers, Hans Kamp,Manfred Pinkal, Massimo Poesio, Stephen Pulman,et al. 1994. FraCaS–a framework for computationalsemantics. Deliverable, D6.

Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-simo Zanzotto. 2013. Recognizing textual entail-ment: Models and applications, volume 6. Morgan& Claypool Publishers.

Donald Davidson. 1967. The logical form of actionsentences. In Nicholas Rescher, editor, The Logic ofDecision and Action. University of Pittsburgh Press.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. PPDB: The paraphrasedatabase. In Proceedings of the 2013 Conference of

the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 758–764, Atlanta, Georgia. Associ-ation for Computational Linguistics.

Hamid Izadinia, Fereshteh Sadeghi, Santosh K. Div-vala, Hannaneh Hajishirzi, Yejin Choi, and AliFarhadi. 2015. Segment-phrase table for semanticsegmentation, visual entailment and paraphrasing.In The IEEE International Conference on ComputerVision (ICCV).

Douwe Kiela. 2016. Mmfeat: A toolkit for extractingmulti-modal features. In Proceedings of ACL-2016System Demonstrations, pages 55–60, Berlin, Ger-many. Association for Computational Linguistics.

Douwe Kiela and Léon Bottou. 2014. Learning imageembeddings using convolutional neural networks forimproved multi-modal semantics. In Proceedingsof the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP), pages 36–45, Doha, Qatar. Association for Computational Lin-guistics.

Alice Lai and Julia Hockenmaier. 2014. Illinois-LH: Adenotational and distributional approach to seman-tics. In Proceedings of the 8th International Work-shop on Semantic Evaluation (SemEval 2014), pages329–334, Dublin, Ireland. Association for Computa-tional Linguistics and Dublin City University.

Alice Lai and Julia Hockenmaier. 2017. Learning topredict denotational probabilities for modeling en-tailment. In Proceedings of the 15th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers, pages721–730, Valencia, Spain. Association for Compu-tational Linguistics.

Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella Bernardi, and Roberto Zam-parelli. 2014. A SICK cure for the evaluation ofcompositional distributional semantic models. InProceedings of LREC2014, pages 216–223.

Pascual Martı́nez-Gómez, Koji Mineshima, YusukeMiyao, and Daisuke Bekki. 2016. ccg2lambda: Acompositional semantics system. In Proceedingsof ACL-2016 System Demonstrations, pages 85–90, Berlin, Germany. Association for ComputationalLinguistics.

Pascual Martı́nez-Gómez and Yusuke Miyao. 2016.Rule extraction for tree-to-tree transducers by costminimization. In Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 12–22, Austin, Texas. Associa-tion for Computational Linguistics.

George A. Miller. 1995. WordNet: A lexicaldatabase for English. Communications of the ACM,38(11):39–41.

2858

Koji Mineshima, Pascual Martı́nez-Gómez, YusukeMiyao, and Daisuke Bekki. 2015. Higher-order log-ical inference with compositional semantics. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 2055–2061, Lisbon, Portugal. Association for Computa-tional Linguistics.

Terence Parsons. 1990. Events in the Semantics of En-glish: A Study in Subatomic Semantics. The MITPress.

Stephen Roller, Katrin Erk, and Gemma Boleda. 2014.Inclusive yet selective: Supervised distributional hy-pernymy detection. In Proceedings of COLING2014, the 25th International Conference on Compu-tational Linguistics: Technical Papers, pages 1025–1036, Dublin, Ireland. Dublin City University andAssociation for Computational Linguistics.

Christian Szegedy, Wei Liu, Yangqing Jia, PierreSermanet, Scott Reed, Dragomir Anguelov, Du-mitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. 2015. Going deeper with convolutions. InThe IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR).

Adina Williams, Nikita Nangia, and Samuel R. Bow-man. 2017. A broad-coverage challenge corpus forsentence understanding through inference. CoRR,abs/1704.05426.

Wenpeng Yin and Hinrich Schütze. 2017. Task-specific attentive pooling of phrase alignments con-tributes to sentence matching. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, pages 699–709, Valencia, Spain. As-sociation for Computational Linguistics.

Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptions tovisual denotations: New similarity metrics for se-mantic inference over event descriptions. Transac-tions of the Association for Computational Linguis-tics, 2:67–78.

Jiang Zhao, Man Lan, and Tiantian Zhu. 2014. ECNU:Expression- and message-level sentiment orienta-tion classification in twitter using multiple effec-tive features. In Proceedings of the 8th Interna-tional Workshop on Semantic Evaluation (SemEval2014), pages 259–264, Dublin, Ireland. Associationfor Computational Linguistics and Dublin City Uni-versity.

2859

Documents

Visual Denotations for Recognizing Textual Entailment · whether T implies H (yes), whether T contradicts H (no) or otherwise (unk). For example, given: (T)Some men walk in the tall