Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

The University of Tokyo

Yoshitaka Ushiku

losnuevetoros

Documents = Vision + Language

Vision & Language:

an emerging topic

• Integration of CV, NLP

and ML techs

• Several backgrounds

– Impact of Deep Learning

• Image recognition (CV)

• Machine translation (NLP)

– Growth of user generated

contents

– Exploratory researches on

Vision and Language

2012: Impact of Deep Learning

Academic AI startup A famous company

Many slides refer to the first use of CNN (AlexNet) on ImageNet


Academic AI startup A famous company

Large gap of error rates

on ImageNet

1st team: 15.3%

2nd team: 26.2%


on ImageNet

1st team: 15.3%

2nd team: 26.2%


on ImageNet

1st team: 15.3%

2nd team: 26.2%

Many slides refer to the first use of CNN (AlexNet) on ImageNet


According to the official site…

1st team w/ DL

Error rate: 15%

2nd team w/o DL

Error rate: 26%

[http://image-net.org/challenges/LSVRC/2012/results.html]

It’s me!!

2014: Another impact of Deep Learning

• Deep learning appears in machine translation[Sutskever+, NIPS 2014]

– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing

problem in RNN

→Dealing with relations between distant words in a sentence

– Four-layer LSTM is trained in an end-to-end manner

→comparable to state-of-the-art (English to French)

• Emergence of common techs such as CNN/RNN

Reduction of barriers to get into CV+NLP

Input

Output

Growth of user generated contents

Especially in content posting/sharing service

• Facebook: 300 million photos per day

• YouTube: 400-hours videos per minute

Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular.

Pairs of a sentence+ a video / photo→Collectable in

large quantities

Exploratory researches on Vision and Language

Captioning an image associated with its article[Feng+Lapata, ACL 2010]

• Input: article + image Output: caption for image

• Dataset: Sets of article + image + caption

× 3361

King Toupu IV died at the

age of 88 last week.

Exploratory researches on Vision and Language

Captioning an image associated with its article[Feng+Lapata, ACL 2010]

• Input: article + image Output: caption for image

• Dataset: Sets of article + image + caption

× 3361

King Toupu IV died at the

age of 88 last week.As a result of these backgrounds:

Various research topics such as …

Image Captioning

Group of people sitting at a table with a dinner.

Tourists are standing on the middle of a flat desert.

[Ushiku+, ICCV 2015]

Video Captioning

A man is holding a box of doughnuts.

Then he and a woman are standing next each other.

Then she is holding a plate of food.

[Shin+, ICIP 2016]

Multilingual + Image Caption Translation

Ein Masten mit zwei Ampeln

fur Autofahrer. (German)

A pole with two lights

for drivers. (English)

[Hitschler+, ACL 2016]

Visual Question Answering[Fukui+, EMNLP 2016]

Image Generation from Captions

This bird is blue with white

and has a very short beak.

This flower is white and

yellow in color, with petals

that are wavy and smooth.

[Zhang+, 2016]

Goal of this keynote

Looking over researches on vision&language

• Historical flow of each area

• Changes by Deep Learning

× Deep Learning enabled these researches

✓ Deep Learning boosted these researches

1. Image Captioning

2. Video Captioning

3. Multilingual + Image Caption Translation

4. Visual Question Answering

5. Image Generation from Captions

Frontiers of Vision and Language 1

Image Captioning

Every picture tells a story

Dataset:Images + <object, action, scene> + Captions

1. Predict <object, action, scene> for an input image using MRF

2. Search for the existing caption associated with similar <object, action, scene>

<Horse, Ride, Field>

[Farhadi+, ECCV 2010]

Every picture tells a story

<pet, sleep, ground>

See something unexpected.

<transportation, move, track>

A man stands next to a train

on a cloudy day.

[Farhadi+, ECCV 2010]

Retrieve? Generate?

• Retrieve

• Generate

– Template-basede.g. generating a Subject＋Verb sentence

– Template-free

A small gray dog

on a leash.

A black dog

standing in

grassy area.

A small white dog

wearing a flannel

warmer.

Input Dataset

Retrieve? Generate?

• Retrieve

– A small gray dog on a leash.

• Generate

– Template-basede.g. generating a Subject＋Verb sentence

– Template-free

A small gray dog

on a leash.

A black dog

standing in

grassy area.

A small white dog

wearing a flannel

warmer.

Input Dataset

Retrieve? Generate?

• Retrieve


• Generate

– Template-baseddog＋stand ⇒ A dog stands.

– Template-free

A small gray dog

on a leash.

A black dog

standing in

grassy area.

A small white dog

wearing a flannel

warmer.

Input Dataset

Retrieve? Generate?

• Retrieve


• Generate

– Template-baseddog＋stand ⇒ A dog stands.

– Template-free

A small white dog standing on a leash.

A small gray dog

on a leash.

A black dog

standing in

grassy area.

A small white dog

wearing a flannel

warmer.

Input Dataset

Captioning with multi-keyphrases[Ushiku+, ACM MM 2012]

End of sentence

[Ushiku+, ACM MM 2012]

Benefits of Deep Learning

• Refinement of image recognition [Krizhevsky+, NIPS 2012]

• Deep learning appears in machine translation[Sutskever+, NIPS 2014]

– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing

problem in RNN

→Dealing with relations between distant words in a sentence

– Four-layer LSTM is trained in an end-to-end manner

→comparable to state-of-the-art (English to French)

Emergence of common techs such as CNN/RNN

Reduction of barriers to get into CV+NLP

Input

Output

Google NIC

Concatenation of Google’s methods

• GoogLeNet [Szegedy+, CVPR 2015]

• MT with LSTM[Sutskever+, NIPS 2014]

Caption (word seq.) 𝑆0…𝑆𝑁 for image 𝐼

𝑆0: beginning of the sentence

𝑆1 = LSTM CNN 𝐼

𝑆𝑡 = LSTM St−1 , 𝑡 = 2…𝑁 − 1

𝑆𝑁: end of the sentence

[Vinyals+, CVPR 2015]

Examples of generated captions

[https://github.com/tensorflow/models/tree/master/im2txt]

[Vinyals+, CVPR 2015]

Comparison to [Ushiku+, ACM MM 2012]

Input image

[Ushiku+, ACM MM 2012]:

Conventional object recognition

Fisher Vector + Linear classifier

Neural image captioning:

Conventional object recognition

Convolutional Neural Network

Neural image captioning

Conventional machine translation

Recurrent Neural Network + beam search

[Ushiku+, ACM MM 2012]:

Conventional machine translation

Log Linear Model + beam search

Estimation of important words Connect the words with grammar model

• Trained using only images and captions

• Approaches are similar to each other

Current development: Accuracy

• Attention-based captioning [Xu+, ICML 2015]

– Focus on some areas for predicting each word!

– Both attention and caption models are trained

using pairs of an image & caption

Current development: Problem setting

Dense captioning

[Lin+, BMVC 2015] [Johnson+, CVPR 2016]


Generating captions for a photo sequence[Park+Kim, NIPS 2015][Huang+, NAACL 2016]

The family

got

together for

a cookout.

They had a

lot of

delicious

food.

The dog

was happy

to be there.

They had a

great time

on the

beach.

They even

had a swim

in the water.


Captioning using sentiment terms

[Mathews+, AAAI 2016][Shin+, BMVC 2016]

Neutral caption

Positive caption


Video Captioning

Before Deep Learning

• Grounding of languages and objects in videos[Yu+Siskind, ACL 2013]

– Learning from only videos and their captions

– Experiment with a small object with few objects

– Controlled and small dataset

• Deep Learning should suite for this problem

– Image Captioning: single image → word sequence

– Video Captioning: image sequence → word sequence

End-to-end learning by Deep Learning

• LRCN[Donahue+, CVPR 2015]

– CNN+RNN for

• Action recognition

• Image / Video

Captioning

• Video to Text[Venugopalan+, ICCV 2015]

– CNNs to recognize

• Objects from RGB frames

• Actions from flow images

– RNN for captioning

Video Captioning

A man is holding a box of doughnuts.

Then he and a woman are standing next each other.

Then she is holding a plate of food.

[Shin+, ICIP 2016]

Video Captioning

A boat is floating on the water near a mountain.

And a man riding a wave on top of a surfboard.

Then he on the surfboard in the water.

[Shin+, ICIP 2016]

Video Retrieval from Caption

• Input: Captions

• Output: A video related to the caption

10 sec video clip from 40 min database!

• Video captioning is also addressed

A woman in blue is

playing ping pong in a

room.

A guy is skiing with no

shirt on and yellow

snow pants.

A man is water skiing

while attached to a

long rope.

[Yamaguchi+, ICCV 2017]


Multilingual +

Image Caption Translation

Towards multiple languages

Datasets with multilingual captions

• IAPR TC12 [Grubinger+, 2006] English + Germany

• Multi30K [Elliot+, 2016] English + Germany

• STAIR Captions [Yoshikawa+, 2017]

English + Japanese

Development of cross-lingual tasks

• Non-English-caption generation

• Image Caption Transration

Input: Pair of a caption in Language A + an imageor A caption in Language A

Output: Caption in Language B

Non-English-caption generation

Non-English-caption generation

Most researches: generate English Caption

• Japanese [Miyazaki+Shimizu, ACL 2016]

• Chinese [Li+, ICMR 2016]

• Turkish [Unal+, SIU 2016]

Çimlerde ko¸ san bir köpek

金色头发的小女孩

柵の中にキリンが一頭立っています

Just collecting non-English captions?

Transfer learning among languages[Miyazaki+Shimizu, ACL 2016]

• Vision-Language grounding Wim is transferred

• Efficient learning using small amount of captionsan elephant is

an elephant

一匹の象が土の

一匹の象が

Image Caption Translation

Machine translation via visual data

Images can boost MT [Calixto+,2012]

• Example below (English to Portuguese):

Does the word “seal” in English

– mean “seal” similar to “stamp”?

– mean “seal” which is a sea animal?

• [Calixto+,2012] insist that the mistranslation can be

avoided using a related image (w/o experiments)

Mistranslation!

Input: Caption in Language A + image

• Caption translation via an associated image[Elliott+, 2015] [Hitschler+, ACL 2016]

– Generate translation candidates

– Re-rank the candidates using similar images’

captions in Language B

Eine Person in

einem Anzug

und Krawatte

und einem Rock.

(In German)

Translation w/o the related image

A person in a suit and tie

and a rock.

Translation with the related image

A person in a suit and tie

and a skirt.

Input: Caption in Language A

• Cross-lingual document retrieval via images [Funaki+Nakayama, EMNLP 2015]

• Zero-shot machine translation[Nakayama+Nishida, 2017]


Visual Question Answering

Visual Question Answering (VQA)

Proposed in Human-Computer Interfaces

• VizWiz [Bigham+, UIST 2010]

Manually solved on AMT

• Automation for the first time (w/o Deep Learning)[Malinowski+Fritz, NIPS 2014]

• Similar term: Visual Turing Test [Malinowski+Fritz, 2014]

VQA: Visual Question Answering

• Established VQA as an AI problem

– Provided a benchmark dataset

– Experimental results with reasonable baselines

• Portal web site is also organized

– http://www.visualqa.org/

– Annual competition for VQA accuracy

[Antol+, ICCV 2015]

What color are her eyes?What is the mustache made of?

VQA Dataset

Collected questions and answers on AMT

• Over 100K real images and 30K abstract images

• About 700K questions＋10 answers for each

VQA=Multiclass Classification

Feature 𝑍𝐼+𝑄 is applied to an usual classifier

Question 𝑄What objects are

found on the bed?

Answer 𝐴bed sheets, pillow

Image 𝐼Image feature

𝑥𝐼

Question feature

𝑥𝑄

Integrated feature

𝑧𝐼+𝑄

Development of VQA

How to calculate the integrated feature 𝑧𝐼+𝑄?

• VQA [Antol+, ICCV 2015]: Just concatenate them

• Summation例 Summation of an image feature with attention

and a question feature [Xu+Saenko, ECCV 2016]

• Multiplicatione.g.Bilinear multiplication using DFT

[Fukui+, EMNLP 2016]

• Hybrid of summation and multiplicatione.g.Concatenation of sum and multiplication

[Saito+, ICME 2017]

𝑧𝐼+𝑄 =𝑥𝐼

𝑥𝑄

𝑥𝐼 𝑥𝑄

𝑥𝐼 𝑥𝑄𝑧𝐼+𝑄 =

𝑧𝐼+𝑄 =

𝑧𝐼+𝑄 =𝑥𝐼 𝑥𝑄

𝑥𝐼 𝑥𝑄

VQA Challenge

Examples from competition results

Q: What is the woman holding?GT A: laptopMachine A: laptop

Q: Is it going to rain soon?GT A: yesMachine A: yes

VQA Challenge

Examples from competition results

Q: Why is there snow on one side of the stream and clear grass on the other?GT A: shadeMachine A: yes

Q: Is the hydrant painted a new color?GT A: yesMachine A: no


Image Generation from Captions

Image generation from input caption

Photo-realistic image generation itself is difficult

• [Mansimov+, ICLR 2016]: Incrementally draw using LSTM

• N.B. Photo synthesis is well studied [Hays+Efros, 2007]

Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]

• Unconditional generative model

• Adversarial learning of Generator and Discriminator

• GAN using convolution … DCGAN [Radford+, ICLR 2016]

Before Conditional Generative Models

Generator

Random vector → Image

Discriminator

Discriminates real or fake

is a fake

image from Generator!






Generator


Discriminator


is a fake







Generator


Discriminator


is a fake







Generator


Discriminator


is a fake







Generator


Discriminator


is a … hmm

Add a Caption to Generator and Discriminator

Conditional Generative Models

Tries to generate an image・photo-realistic

・related to the caption

Tries to detect an image・fake

・unrelated

[Reed+, ICML 2016]

Examples of generated images

• Birds (CUB) / Flowers (Oxford-102)

– About 10K images & 5 captions for each image

– 200 kinds of birds / 102 kinds of flowers

A tiny bird, with a tiny beak,

tarsus and feet, a blue crown,

blue coverts, and black

cheek patch

Bright droopy yellow petals

with burgundy streaks, and a

yellow stigma

[Reed+, ICML 2016]

Towards more realistic image generation

StackGAN [Zhang+, 2016]

Two-step GANs

• First GAN generates small and fuzzy image

• Second GAN enlarges and refines it







[Zhang+, 2016]







[Zhang+, 2016]

N.B. Results using dataset specialized in birds / flowers

→ More breakthrough is necessary to generate general images

Take-home Messages

• Looked over researches on vision and language

1. Image Captioning

2. Video Captioning

3. Multilingual + Image Caption Translation

4. Visual Question Answering

5. Image Generation from Captions

• Contributions of Deep Learning– Most research themes exist before Deep Learning

– Commodity techs for processing images, videos and natural languages

– Evolution of recognition and generation

Towards a new stage among vision and language!

Technology

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning