18
VL-BERT: Pre-training of Generic Visual-Linguistic Representations Weijie Su*, Xizhou Zhu*, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai University of Science and Technology of China; Microsoft Research Asia

VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su*, Xizhou Zhu*, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

University of Science and Technology of China; Microsoft Research Asia

Page 2: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

An Example of Visual-Linguistic Tasks

Question Why is [person1] pointing a gun at [person2] ?

Answer [person1] and [person3] are robbing the bank and

[person2] is the bank manager

Page 3: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Previous Paradigm

Why is [person1] pointing a gun at [person2] ?

ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

Page 4: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Problem (I) High Design Cost

Why is [person1] pointing a gun at [person2] ?

ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

Different for each task, hard to design

c

c

Page 5: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Problem (II) Overfitting

Why is [person1] pointing a gun at [person2] ?

ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

c

c

Pre-trained separately w/o joint pre-training

Train from scratch

c

Page 6: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Inspiration

• Transformer is a unified and powerful architecture in NLP

• MLM based pre-training in BERT enhances the capability

• It can aggregate and align word embedded features

Page 7: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Solution

Why is [person1] pointing a gun at [person2] ?

ImageEncoder

TextEncoder

ImageFeature

TextFeature

Combiner

Prediction

VL-BERTTask-agnostic

+ VL Pre-training

Page 8: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Comparison among our VL-BERT and other concurrent works for pre-training generic visual-linguistic representations

Lots of concurrent works in just 3 weeks!

Page 9: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Model Architecture

TokenEmbedding

SegmentEmbeddingSequence

PositionEmbedding

Visual FeatureEmbedding

drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]

Visual-Linguistic BERT

bottle

Masked Language Modeling with Visual Clues

[Cat]

Masked RoI Classification with Linguistic Clues

Fast(er) R-CNN

Fully Connected

Appearance Feature

Geometry Embedding

Regions of Interest

AA A AA A C C C

32 4 51 6 7 7 8

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Caption Image Regions Image

Zero Out

Page 10: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Modification (I) Add Image Regions in Input Sequence

TokenEmbedding

SegmentEmbeddingSequence

PositionEmbedding

Visual FeatureEmbedding

drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]

Visual-Linguistic BERT

bottle

Masked Language Modeling with Visual Clues

[Cat]

Masked RoI Classification with Linguistic Clues

Fast(er) R-CNN

Fully Connected

Appearance Feature

Geometry Embedding

Regions of Interest

AA A AA A C C C

32 4 51 6 7 7 8

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Caption Image Regions Image

Zero Out

c

c

c

Page 11: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Modification (II) Add Visual Feature Embedding

TokenEmbedding

SegmentEmbeddingSequence

PositionEmbedding

Visual FeatureEmbedding

drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]

Visual-Linguistic BERT

bottle

Masked Language Modeling with Visual Clues

[Cat]

Masked RoI Classification with Linguistic Clues

Fast(er) R-CNN

Fully Connected

Appearance Feature

Geometry Embedding

Regions of Interest

AA A AA A C C C

32 4 51 6 7 7 8

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Caption Image Regions Image

Zero Out

c

c

Page 12: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Pre-training Datasets

• Visual-Linguistic Corpus: Conceptual Captions

• Harvested from the Internet

• ~3M image-text pairs

• Text-only Corpus: English Wikipedia & BooksCorpus

• Improve generalization over long and complex sentences

Page 13: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Pre-training Tasks #1

VL-BERT

kitten drink from bottle

Masked Language Modeling with Visual Clues

bottle

[MASK]

P.S. For samples from text-only corpus, it degenerate to original MLM in BERT.

Page 14: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Pre-training Tasks #2

VL-BERT

kitten drink from bottle

Masked RoI Classification with Linguistic Clues

[Cat]Zero Out

P.S. This task is not used in text-only corpus.

Page 15: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Results on Downstream TasksPe

rform

ance

42

46.5

51

55.5

60

VCR65

66.75

68.5

70.25

72

VQA65

67

69

71

73

RefCOCO+

Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large

Page 16: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Perfo

rman

ce

42

46.5

51

55.5

60

VCR65

66.75

68.5

70.25

72

VQA65

67

69

71

73

RefCOCO+

Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large

c

c

c

c

c c

Our generic representation surpasses task-specific baseline by a large margin

Results on Downstream Tasks

Page 17: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Perfo

rman

ce

42

46.5

51

55.5

60

VCR65

66.75

68.5

70.25

72

VQA65

67

69

71

73

RefCOCO+

Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large

c

c

c

c

c c

c

Pre-training further enhances the capability

Results on Downstream Tasks

Page 18: VL-BERT: Pre-training of Generic Visual-Linguistic ...€¦ · VL-BERT: Pre-training of Generic Visual-Linguistic Representations Performance 42 46.5 51 55.5 60 VCR 65 66.75 68.5

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Conclusion

• A new pre-trainable generic representation for VL tasks

• Pre-training procedure can better align VL clues

• Future work: seek better pre-training tasks, benefit more downstream tasks (e.g., Image Caption Generation)

VL-BERT: Pre-training of Generic Visual-Linguistic Representations