LearningVisually...

Learning Visually-Grounded Semantics from Contrastive Adversarial SamplesHaoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning Jiang3 and Jian Sun3 1: Peking University 2: Tsinghua University 3: Megvii, Inc

{hyshi, jasonhsiao97}@pku.edu.cn, mjy14@mails.tsinghua.edu.cn, {jyn, sunjian}@megvii.com

INTRODUCTION

Visual-Semantic Embeddings (VSE)• Use paralleled image-caption pairs and embed texts and images into a joint space.

• Several datasets have been created for such purpose.

• However, even MS-COCO[1] is too small compared with the compositional semantic space.

VSE with Contrastive Adversarial Samples (this work)• Show the limitation of existing datasets and frameworks through adversarial attacks.

• Close the gap with semantics-aware text augmentation.

• Evaluate the visual grounding on multiple tasks.

A SIMPLE YET EFFECTIVE APPROACH

Add the Contrastive∗ Adversarial Samples to the Training Set∗: Use the online hard example mining (OHEM) technique to find “Contrastive” ones.

VSE [2]:min `V SE(i, c) =

∑c′

[α+ s(i, c′)− s(i, c)]+ +∑i′

[α+ s(i′, c)− s(i, c)]+

VSE++ [3]:

min `VSE++(i, c) = maxc′ 6=c

[α+ s(i, c′)− s(i, c)] + maxi′ 6=i

[α+ s(i′, c)− s(i, c)]

VSE-C (ours):

min `VSE-C(i, c) = `VSE++(i, c) + maxc′′∈C′(c)

[α+ s(i, c′′)− s(i, c)]+

i: image, c: caption, C′: adversarial samples.

BEGIN WITH ADVERSARIAL ATTACKS

Three giraffes and a rhino graze from trees.

relation: graze from

Original CaptionImage Contrastive Adversarial Samples

Three cows and a rhino graze from trees.

numeral / indefinite article:

Three giraffes and three rhinos graze from trees.

relation:

Three giraffes and a rhino graze on trees.Trees graze from three giraffes and a rhino.

numeral indef. nounnoun noun

Semantics-aware Text Augmentation (Adversarial Samples)

• Noun: use Word-Net [4] to compare the word similarity (e.g., Synonyms, Hypernyms).

• Numeral/Indefinite Article: singularize or pluralize corresponding nouns when necessary.

• Relation: dependency-parsing based subject and object detection.

ResultModel R@1 R@10 Med r. Avg r.MS-COCO TestVSE [2] 47.7 87.8 2.0 5.8VSE++ [3] 55.7 92.4 1.0 4.3VSE-C (+n.) 50.7 90.7 1.0 5.2VSE-C (+num.) 53.3 90.2 1.0 5.8VSE-C (+rel.) 52.4 89.0 1.0 5.7VSE-C (+all) 50.2 89.8 1.0 5.2MS-COCO Test w/ Adversarial SamplesVSE [2] 28.0 71.6 4.0 11.7VSE++ [3] 35.6 72.5 3.0 11.8VSE-C (+n.) 40.3 80.2 2.0 9.2VSE-C (+num.) 46.9 86.3 2.0 6.9VSE-C (+rel.) 42.3 82.5 2.0 7.2VSE-C (+all) 47.4 88.8 2.0 5.5

GROUNDING TEST I: WORD-OBJECT CORRELATION

Task Description

Image CaptionsA table with a huge glass vase and fake flowers come out of it.A plant in a vase sits at the end of a table.A vase with flowers in it with long stems sitting on a table with candles.A large centerpiece that is sitting on the edge of a dining table.Flowers in a clear vase sitting on a table.

Positive Objects: table, plant, vase.Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc.

Model Result

Image Encoder(ResNet 152)

Image Embedding: !(#)

Word Embedding: %(&)

EmbeddingInteraction

Pr Positive #, &]

Word: &

Model MAPGloVe [5] 58.7VSE [2] 61.7VSE++ [3] 61.1VSE-C (ours, +all) 62.2VSE-C (ours, +n.) 62.8VSE-C (ours, +rel.) 62.3VSE-C (ours, +num.) 62.0

SALIENCY VISUALIZATIONWhich part in the image or caption, in particular, makes them semantically different? We compute

the Jacobian (we normalize the textual saliency for visualization):

J = ∇is(i, c′) = ∇iW

Ti f(i; θi) ·WT

c g(c′; θc)

an elephant walking against the weeds in the forest0.039 0.176 0.101 0.087 0.051 0.248 0.060 0.057 0.181

an elephant walking against the weeds in the forest0.030 0.108 0.125 0.258 0.108 0.176 0.077 0.027 0.090

an elephant walking through the weeds in the forest

Original

Saliency

(VSE-C)

PAPER & CODE

Paper is available at http://aclweb.org/anthology/C18-1315. Code is availableat https://github.com/ExplorerFreda/VSE-C.

paper code

ACKNOWLEDGEMENTS

This work was done when HS, JM and TXwere intern researchers at Megvii Inc. HS, JMand TX contribute equally to this paper.

GROUNDING TEST II: FILL-IN-THE-BLANK

Model Result

A table with a huge glass _____ and fake flowers come out of it.

GRUEncoder

WordEmbeddings

Fusing with MLP

Predicted Word: Vase

……

…… ……

…… ….…

Image Embedding: !(#)

Model R@1 R@10Noun FillingGloVe 23.2 58.8VSE++ 25.0 61.7VSE-C (ours) 27.3 62.9Prep. FillingGloVe[5] 23.3 79.9VSE++ 34.9 84.9VSE-C (ours) 35.2 85.2All (Noun + Prep.)GloVe 23.3 66.6VSE++ 28.4 68.1VSE-C (ours) 30.0 70.98

REFERENCES

[1] Lin et al. Microsoft COCO: Common Objectsin Context. In ECCV, 2014.

[2] Kiros et al. Unifying Visual-SemanticEmbeddings with Multimodal Neu-ral Language Models. arXiv preprintarXiv:1411.2539, 2014.

[3] Faghri et al. Vse++: Improving visual-semantic embeddings with hard negatives.In BMVC, 2018.

[4] George A Miller. WordNet: a lexicaldatabase for English. Communications of theACM, 1995.

[5] Pennington et al. GloVe: Global Vectors forWord Representation. In EMNLP, 2014.

LearningVisually...

Documents

Project of astrophysics Author: Wang Yuning, Sun ningyuan,Dong qifei,Wang zeyu,Zhou haosheng

Jiayuan 2017 Brochure 6 - jiayuanintl.com · 星級管家等「4+N」社區綜合性服務，全方位提高業主生活的幸福度。 “Jiayuan Centurial Sky City”, one of our

TAJIMA-JIAYUAN Autonomous & Auto Charging Edition

L0: Introduction - INSE6300 Quality Assurance in Supply ...users.encs.concordia.ca/~jiayuan/scm16/l0.pdf · INSE6300 Quality Assurance in Supply Chain Management Jia Yuan Yu Concordia

MultiPath: Multiple Probabilistic Anchor Trajectory …Hypotheses for Behavior Prediction Yuning Chai Benjamin Sapp Mayank Bansal Dragomir Anguelov Waymo LLC {chaiy,bensapp}@waymo.com

Zhenghua Li, Jiayuan Chao, Min Zhang, Wenliang Chen {zhli13, minzhang, wlchen}@suda.edu.cn; china_cjy@163.com; Soochow University, China Coupled Sequence

L1: Introduction - INSE6290 Quality in Supply Chain Designusers.encs.concordia.ca/~jiayuan/scd16/l1.pdf · L1: Introduction INSE6290 Quality in Supply Chain Design Jia Yuan Yu Concordia

Generating Representative Headlines for News Stories · Generating Representative Headlines for News Stories Xiaotao Gu1∗, Yuning Mao1, Jiawei Han1, Jialu Liu2, Hongkun Yu2, You

Portfolio Haoyue Du

HEALTH AND MEDICINE Copyright © 2021 BBB pathophysiology ... · Jun Xu1,3, John Joseph1,3, Haoyue Lan1, Robert Langer2,6, Rebekah Mannix3,4, Jeffrey M. Karp1,3,6,7,8, Nitin Joshi1,2,3*

Sleep Apnea Monitoring and Alarm System Prepared by Jiayuan Wang Ying Zhou Renyuan Cheng

Haoyue Wu - arch.virginia.eduTongji Urban Landscape Architecture Studio, China. ... exercise to build a topo system of both micro and macro topography as a concept of my site design

Inside Jiayuan (NASDAQ: DATE) -By iChinaStock

Controllable Person Image Synthesis with Attribute-Decomposed … · 2020-06-17 · Controllable Person Image Synthesis with Attribute-Decomposed GAN Yifang Men1, Yiming Mao 2, Yuning

An introduction on several biometric modalities Yuning Xu · Yuning Xu `The way human beings use to recognize each other: equip machines with that capability `Passwords can be forgotten,

Xiaosheng Bi, Jiayuan Zhuang * and Yumin Su

Exploring The Potency of Indonesia Biodiversity for Herbal ...repository.unib.ac.id/11296/1/002 Indah Yuning P.pdfInternational Seminar on Promoting Local Resources for Food and Health,

MODERN LANDSCAPE ARCHITECTURE - WSEAS · Cheng Yuning, Yuan Yangyang, Cheng Shi . The Position of Water Supply Transmission Systems in Management of Urban Areas against Earthquake

TAJIMA-JIAYUAN AUTONOMOUS CAR...TAJIMA-JIAYUAN AUTONOMOUS CAR 株式会社タジマEV 静岡袋井R&Dセンター〒437-0121 静岡県袋井市宇刈二の谷3-4 TEL : 0538(86)5684

The Bi-Objective Quadratic Multiple Knapsack Problem ...hao/papers/ChenHaoKNOSYS2016.pdf · The Bi-Objective Quadratic Multiple Knapsack Problem: Model and Heuristics Yuning Chen