View
6
Download
0
Category
Preview:
Citation preview
Learning Visually-Grounded Semantics from Contrastive Adversarial SamplesHaoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning Jiang3 and Jian Sun3 1: Peking University 2: Tsinghua University 3: Megvii, Inc
{hyshi, jasonhsiao97}@pku.edu.cn, mjy14@mails.tsinghua.edu.cn, {jyn, sunjian}@megvii.com
INTRODUCTION
Visual-Semantic Embeddings (VSE)• Use paralleled image-caption pairs and embed texts and images into a joint space.
• Several datasets have been created for such purpose.
• However, even MS-COCO[1] is too small compared with the compositional semantic space.
VSE with Contrastive Adversarial Samples (this work)• Show the limitation of existing datasets and frameworks through adversarial attacks.
• Close the gap with semantics-aware text augmentation.
• Evaluate the visual grounding on multiple tasks.
A SIMPLE YET EFFECTIVE APPROACH
Add the Contrastive∗ Adversarial Samples to the Training Set∗: Use the online hard example mining (OHEM) technique to find “Contrastive” ones.
VSE [2]:min `V SE(i, c) =
∑c′
[α+ s(i, c′)− s(i, c)]+ +∑i′
[α+ s(i′, c)− s(i, c)]+
VSE++ [3]:
min `VSE++(i, c) = maxc′ 6=c
[α+ s(i, c′)− s(i, c)] + maxi′ 6=i
[α+ s(i′, c)− s(i, c)]
VSE-C (ours):
min `VSE-C(i, c) = `VSE++(i, c) + maxc′′∈C′(c)
[α+ s(i, c′′)− s(i, c)]+
i: image, c: caption, C′: adversarial samples.
BEGIN WITH ADVERSARIAL ATTACKS
Three giraffes and a rhino graze from trees.
relation: graze from
Original CaptionImage Contrastive Adversarial Samples
Three cows and a rhino graze from trees.
noun:
numeral / indefinite article:
Three giraffes and three rhinos graze from trees.
relation:
Three giraffes and a rhino graze on trees.Trees graze from three giraffes and a rhino.
numeral indef. nounnoun noun
Semantics-aware Text Augmentation (Adversarial Samples)
• Noun: use Word-Net [4] to compare the word similarity (e.g., Synonyms, Hypernyms).
• Numeral/Indefinite Article: singularize or pluralize corresponding nouns when necessary.
• Relation: dependency-parsing based subject and object detection.
ResultModel R@1 R@10 Med r. Avg r.MS-COCO TestVSE [2] 47.7 87.8 2.0 5.8VSE++ [3] 55.7 92.4 1.0 4.3VSE-C (+n.) 50.7 90.7 1.0 5.2VSE-C (+num.) 53.3 90.2 1.0 5.8VSE-C (+rel.) 52.4 89.0 1.0 5.7VSE-C (+all) 50.2 89.8 1.0 5.2MS-COCO Test w/ Adversarial SamplesVSE [2] 28.0 71.6 4.0 11.7VSE++ [3] 35.6 72.5 3.0 11.8VSE-C (+n.) 40.3 80.2 2.0 9.2VSE-C (+num.) 46.9 86.3 2.0 6.9VSE-C (+rel.) 42.3 82.5 2.0 7.2VSE-C (+all) 47.4 88.8 2.0 5.5
GROUNDING TEST I: WORD-OBJECT CORRELATION
Task Description
Image CaptionsA table with a huge glass vase and fake flowers come out of it.A plant in a vase sits at the end of a table.A vase with flowers in it with long stems sitting on a table with candles.A large centerpiece that is sitting on the edge of a dining table.Flowers in a clear vase sitting on a table.
Positive Objects: table, plant, vase.Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc.
Model Result
Image Encoder(ResNet 152)
Vase
Image Embedding: !(#)
Word Embedding: %(&)
EmbeddingInteraction
Pr Positive #, &]
Word: &
Model MAPGloVe [5] 58.7VSE [2] 61.7VSE++ [3] 61.1VSE-C (ours, +all) 62.2VSE-C (ours, +n.) 62.8VSE-C (ours, +rel.) 62.3VSE-C (ours, +num.) 62.0
SALIENCY VISUALIZATIONWhich part in the image or caption, in particular, makes them semantically different? We compute
the Jacobian (we normalize the textual saliency for visualization):
J = ∇is(i, c′) = ∇iW
Ti f(i; θi) ·WT
c g(c′; θc)
an elephant walking against the weeds in the forest0.039 0.176 0.101 0.087 0.051 0.248 0.060 0.057 0.181
an elephant walking against the weeds in the forest0.030 0.108 0.125 0.258 0.108 0.176 0.077 0.027 0.090
an elephant walking through the weeds in the forest
VSE++
VSE-C
Original
Image
Image
Saliency
(VSE-C)
PAPER & CODE
Paper is available at http://aclweb.org/anthology/C18-1315. Code is availableat https://github.com/ExplorerFreda/VSE-C.
paper code
ACKNOWLEDGEMENTS
This work was done when HS, JM and TXwere intern researchers at Megvii Inc. HS, JMand TX contribute equally to this paper.
GROUNDING TEST II: FILL-IN-THE-BLANK
Model Result
A table with a huge glass _____ and fake flowers come out of it.
GRUEncoder
WordEmbeddings
Fusing with MLP
Predicted Word: Vase
……
……
…… ……
…… ….…
Image Embedding: !(#)
Model R@1 R@10Noun FillingGloVe 23.2 58.8VSE++ 25.0 61.7VSE-C (ours) 27.3 62.9Prep. FillingGloVe[5] 23.3 79.9VSE++ 34.9 84.9VSE-C (ours) 35.2 85.2All (Noun + Prep.)GloVe 23.3 66.6VSE++ 28.4 68.1VSE-C (ours) 30.0 70.98
REFERENCES
[1] Lin et al. Microsoft COCO: Common Objectsin Context. In ECCV, 2014.
[2] Kiros et al. Unifying Visual-SemanticEmbeddings with Multimodal Neu-ral Language Models. arXiv preprintarXiv:1411.2539, 2014.
[3] Faghri et al. Vse++: Improving visual-semantic embeddings with hard negatives.In BMVC, 2018.
[4] George A Miller. WordNet: a lexicaldatabase for English. Communications of theACM, 1995.
[5] Pennington et al. GloVe: Global Vectors forWord Representation. In EMNLP, 2014.
Recommended