VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su*, Xizhou Zhu*, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

University of Science and Technology of China; Microsoft Research Asia


An Example of Visual-Linguistic Tasks

Question Why is [person1] pointing a gun at [person2] ?

Answer [person1] and [person3] are robbing the bank and

[person2] is the bank manager


Previous Paradigm

Why is [person1] pointing a gun at [person2] ?

ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction


Problem (I) High Design Cost


ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

Different for each task, hard to design

c

c


Problem (II) Overfitting


ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

c

c

Pre-trained separately w/o joint pre-training

Train from scratch

c


Inspiration

• Transformer is a unified and powerful architecture in NLP

• MLM based pre-training in BERT enhances the capability

• It can aggregate and align word embedded features


Solution


ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

VL-BERTTask-agnostic

+ VL Pre-training


Comparison among our VL-BERT and other concurrent works for pre-training generic visual-linguistic representations

Lots of concurrent works in just 3 weeks!


Model Architecture

TokenEmbedding

SegmentEmbeddingSequence

PositionEmbedding

Visual FeatureEmbedding

drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]

Visual-Linguistic BERT

bottle

Masked Language Modeling with Visual Clues

[Cat]

Masked RoI Classification with Linguistic Clues

Fast(er) R-CNN

Fully Connected

Appearance Feature

Geometry Embedding

Regions of Interest

AA A AA A C C C

32 4 51 6 7 7 8

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Caption Image Regions Image

Zero Out


Modification (I) Add Image Regions in Input Sequence

TokenEmbedding


PositionEmbedding




bottle


[Cat]


Fast(er) R-CNN

Fully Connected

Appearance Feature

Geometry Embedding

Regions of Interest

AA A AA A C C C

32 4 51 6 7 7 8

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+


Zero Out

c

c

c


Modification (II) Add Visual Feature Embedding

TokenEmbedding


PositionEmbedding




bottle


[Cat]


Fast(er) R-CNN

Fully Connected

Appearance Feature

Geometry Embedding

Regions of Interest

AA A AA A C C C

32 4 51 6 7 7 8

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+


Zero Out

c

c


Pre-training Datasets

• Visual-Linguistic Corpus: Conceptual Captions

• Harvested from the Internet

• ~3M image-text pairs

• Text-only Corpus: English Wikipedia & BooksCorpus

• Improve generalization over long and complex sentences


Pre-training Tasks #1

VL-BERT

kitten drink from bottle


bottle

[MASK]

P.S. For samples from text-only corpus, it degenerate to original MLM in BERT.


Pre-training Tasks #2

VL-BERT

kitten drink from bottle


[Cat]Zero Out

P.S. This task is not used in text-only corpus.


Results on Downstream TasksPe

rform

ance

42

46.5

51

55.5

60

VCR65

66.75

68.5

70.25

72

VQA65

67

69

71

73

RefCOCO+

Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large


Perfo

rman

ce

42

46.5

51

55.5

60

VCR65

66.75

68.5

70.25

72

VQA65

67

69

71

73

RefCOCO+


c

c

c

c

c c

Our generic representation surpasses task-specific baseline by a large margin

Results on Downstream Tasks


Perfo

rman

ce

42

46.5

51

55.5

60

VCR65

66.75

68.5

70.25

72

VQA65

67

69

71

73

RefCOCO+


c

c

c

c

c c

c

Pre-training further enhances the capability

Results on Downstream Tasks


Conclusion

• A new pre-trainable generic representation for VL tasks

• Pre-training procedure can better align VL clues

• Future work: seek better pre-training tasks, benefit more downstream tasks (e.g., Image Caption Generation)


Documents

VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5