139
Deep Learning for Computer Vision Spring 2018 http:// vllab.ee.ntu.edu.tw/dlcv.html (primary) https ://ceiba.ntu.edu.tw/1062DLCV (grade, etc.) FB: DLCV Spring 2018 Yu-Chiang Frank Wang 王鈺強, Associate Professor Dept. Electrical Engineering, National Taiwan University 2018/05/09

Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Deep Learning for Computer VisionSpring 2018

http://vllab.ee.ntu.edu.tw/dlcv.html (primary)

https://ceiba.ntu.edu.tw/1062DLCV (grade, etc.)

FB: DLCV Spring 2018

Yu-Chiang Frank Wang 王鈺強, Associate Professor

Dept. Electrical Engineering, National Taiwan University

2018/05/09

Page 2: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

What’s to Be Covered Today…

• Visualization of NNs• t-SNE• Visualization of Feature Maps• Neural Style Transfer

• Understanding NNs• Image Translation• Feature Disentanglement

• Recurrent Neural Networks• LSTM & GRU

Many slides from Fei-Fei Li, Yaser Sheikh, Simon Lucey, Kaiming He, J.-B. Huang, Yuying Yeh, and Hsuan-I Ho

Page 3: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Representation Disentanglement

• Goal• Interpretable deep feature representation• Disentangle attribute of interest c from the derived latent representation z

• Unsupervised: InfoGAN• Supervised: AC-GAN

InfoGANChen et al.

NIPS ’16

ACGANOdena et al.

ICML ’17

Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS 2016.Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML’17 3

Page 4: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

AC-GAN

Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML’17

Real dataw.r.t. its domain label

• Supervised Disentanglement

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑧𝑧, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = 𝔼𝔼 − log Dcls(𝑐𝑐′|𝑦𝑦) + 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

• Learning• Overall objective function

• Adversarial Loss

• Disentanglement lossG

D

𝑧𝑧𝑐𝑐

G(𝑧𝑧, 𝑐𝑐)𝑦𝑦 (real)

Supervised

Generated dataw.r.t. assigned label

4

Page 5: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

AC-GAN

Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML’17

• Supervised Disentanglement

G

D

𝑧𝑧𝑐𝑐

G(𝑧𝑧, 𝑐𝑐)𝑦𝑦 (real)

Supervised

Different 𝑐𝑐 values

5

Page 6: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

InfoGAN

• Unsupervised Disentanglement

Generated dataw.r.t. assigned label

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑧𝑧, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

• Learning• Overall objective function

• Adversarial Loss

• Disentanglement loss

Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS 2016.6

Page 7: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

InfoGAN

• Unsupervised Disentanglement

Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS 2016.

• No guarantee in disentangling particular semantics

Different 𝑐𝑐

Rotation Angle Width

Training process

Different 𝑐𝑐

Time

Loss

7

Page 8: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Representation Disentanglement• InfoGAN: Unsupervised representation disentanglement• AC-GAN: Supervised representation disentanglement

• Image Translation• Pix2pix (CVPR’17): Pairwise cross-domain training data• CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data• UNIT (NIPS’17): Learning cross-domain image representation (with unpaired training data)• DTN (ICLR’17) : Learning cross-domain image representation (with unpaired training data)

• Joint Image Translation & Disentanglement• StarGAN (CVPR’18) : Image translation via representation disentanglement• CDRD (CVPR’18) : Cross-domain representation disentanglement and translation

From Image Understanding to Image Manipulation

8

Page 9: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Pix2pix

• Image-to-image translation with conditional adversarial networks (CVPR’17)• Can be viewed as image style transfer

Sketch Photo

Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 9

Page 10: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Pix2pix• Goal / Problem Setting

• Image translation across two distinct domains (e.g., sketch v.s. photo)

• Pairwise training data

• Method: Conditional GAN• Example: Sketch to Photo

• Generator (VAE)Input: SketchOutput: Photo

• DiscriminatorInput: Concatenation of Input(Sketch) & Synthesized/Real(Photo) imagesOutput: Real or Fake

Testing Phase

GeneratedInput

Input

Concat

Concat

Training Phase

Input

real

Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 10

Page 11: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Pix2pix

• Learning the model

GeneratedInput

Input

Concat

Training Phase

Concat

Input

Real

ℒ𝑐𝑐𝐺𝐺𝐺𝐺𝐺𝐺(G, D) = 𝔼𝔼 𝑥𝑥 log 1 − D(𝑥𝑥, G(𝑥𝑥)) + 𝔼𝔼 𝑥𝑥,𝑦𝑦 log D 𝑥𝑥,𝑦𝑦Fake (Generated) Real

Concatenate Concatenate

ℒ𝐿𝐿𝐿(G) = 𝔼𝔼 𝑥𝑥,𝑦𝑦 𝑦𝑦 − G(𝑥𝑥) 𝐿

Reconstruction Loss

Conditional GAN loss

Overall objective functionG∗ = arg min

GmaxD

ℒ𝑐𝑐𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝐿𝐿𝐿(G)

Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 11

Page 12: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Pix2pix

• Experiment results

Demo page: https://affinelayer.com/pixsrv/

Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 12

Page 13: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Representation Disentanglement• InfoGAN: Unsupervised representation disentanglement• AC-GAN: Supervised representation disentanglement

• Image Translation• Pix2pix (CVPR’17): Pairwise cross-domain training data• CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data• UNIT (NIPS’17): Learning cross-domain image representation (with unpaired training data)• DTN (ICLR’17) : Learning cross-domain image representation (with unpaired training data)

• Joint Image Translation & Disentanglement• StarGAN (CVPR’18) : Image translation via representation disentanglement• CDRD (CVPR’18) : Cross-domain representation disentanglement and translation

From Image Understanding to Image Manipulation

Page 14: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN/DiscoGAN/DualGAN

• CycleGAN (CVPR’17)• Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial

Networks -to-image translation with conditional adversarial networks

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Easier to collect training data

• More practical

Paired Unpaired

1-to-1 Correspondence

No Correspondence

Page 15: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Goal / Problem Setting• Image translation across two distinct domains • Unpaired training data

• Idea• Autoencoding-like image translation• Cycle consistency between two domains

Photo PaintingUnpaired

Cycle Consistency

Photo Painting Photo

Training data

Painting Photo Painting

Cycle Consistency

Page 16: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Method (Example: Photo & Painting)

• Based on 2 GANs• First GAN (G1, D1): Photo to Painting• Second GAN (G2, D2): Painting to Photo

• Cycle Consistency• Photo consistency• Painting consistency

Photo(Input)

Painting(Generated)

G1

D1

Painting(Real)

or Real / Fake

Photo(Generated)

Painting(Input)

G2

D2or Real / Fake

Photo(Real)

Page 17: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Method (Example: Photo vs. Painting)

• Based on 2 GANs• First GAN (G1, D1): Photo to Painting• Second GAN (G2, D2): Photo to Painting

• Cycle Consistency• Photo consistency• Painting consistency

Photo Consistency

Photo Painting PhotoG1 G2

Painting Photo Painting

Painting Consistency

G1G2

Page 18: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Learning

• Adversarial Loss• First GAN (G1, D1):

• Second GAN (G2, D2):

Overall objective functionG𝐿∗ , G2∗ = arg min

G1,G2maxD1,D2

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G2, D2 + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G𝐿, G2First GAN Second GAN

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿 = 𝔼𝔼 log 1 − D𝐿(G𝐿(𝑥𝑥)) + 𝔼𝔼 log D𝐿 𝑦𝑦

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G2, D2 = 𝔼𝔼 log 1 − D2(G2(𝑦𝑦)) + 𝔼𝔼 log D2 𝑥𝑥

Photo(Input)

Painting(Generated)

G1

D1

Painting(Real)

or

Real/ Fake

Photo(Generated)

Painting(Input)

G2

D2or Real/ Fake

Photo(Real)

𝑥𝑥 G𝐿(𝑥𝑥)

𝑦𝑦

𝑦𝑦 G2(𝑦𝑦)

𝑥𝑥

Page 19: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Learning

• Consistency Loss• Photo and Painting consistency

Overall objective functionG𝐿∗ , G2∗ = arg min

G1,G2maxD1,D2

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G2, D2 + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G𝐿, G2Cycle Consistency

ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G𝐿, G2 = 𝔼𝔼 G2 G𝐿 𝑥𝑥 − 𝑥𝑥 𝐿 + G𝐿 G2 𝑦𝑦 − 𝑦𝑦 𝐿

Photo Consistency

Photo Painting Photo

G1 G2𝑥𝑥 G𝐿 𝑥𝑥 G2 G𝐿 𝑥𝑥

Painting Photo Painting

Painting Consistency

G1G2𝑦𝑦 G2 𝑦𝑦 G𝐿 G2 𝑦𝑦

Page 20: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CycleGAN

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

• Example results

Project Page: https://junyanz.github.io/CycleGAN/

Page 21: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Image Translation Using Unpaired Training Data

Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.Kim et al. "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks.“, ICML 2017

Yi et al. "DualGAN: Unsupervised dual learning for image-to-image translation." ICCV 2017

• CycleGAN, DiscoGAN, and DualGAN

CycleGANICCV’17

DiscoGANICML’17

DualGANICCV’17

Page 22: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Representation Disentanglement• InfoGAN: Unsupervised representation disentanglement• AC-GAN: Supervised representation disentanglement

• Image Translation• Pix2pix (CVPR’17): Pairwise cross-domain training data• CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data• UNIT (NIPS’17): Learning cross-domain image representation (with unpaired training data)• DTN (ICLR’17) : Learning cross-domain image representation (with unpaired training data)

• Joint Image Translation & Disentanglement• StarGAN (CVPR’18) : Image translation via representation disentanglement• CDRD (CVPR’18) : Cross-domain representation disentanglement and translation

From Image Understanding to Image Manipulation

Page 23: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

UNIT

• Unsupervised Image-to-Image Translation Networks (NIPS’17)• Image translation via learning cross-domain joint representation

Liu et al., "Unsupervised image-to-image translation networks.“, NIPS 2017

𝑥𝑥𝐿𝑥𝑥2

𝑧𝑧𝒵𝒵: Joint latent space

𝒳𝒳𝐿 𝒳𝒳2

𝑥𝑥𝐿𝑥𝑥2

𝑧𝑧𝒵𝒵: Joint latent space

𝒳𝒳𝐿 𝒳𝒳2

Stage1: Encode to the joint space Stage2: Generate cross-domain images

Day Night Day Night

Page 24: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

UNIT

• Goal/Problem Setting• Image translation

across two distinct domains• Unpaired training image data

• Idea• Based on two parallel VAE-GAN models

Liu et al., "Unsupervised image-to-image translation networks.“, NIPS 2017

VAE GAN

Page 25: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

UNIT

• Goal/Problem Setting• Image translation

across two distinct domains• Unpaired training image data

• Idea• Based on two parallel VAE-GAN models• Learning of joint representation

across image domains

Liu et al., "Unsupervised image-to-image translation networks.“, NIPS 2017

Page 26: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

UNIT

• Goal/Problem Setting• Image translation

across two distinct domains• Unpaired training image data

• Idea• Based on two parallel VAE-GAN models• Learning of joint representation

across image domains• Generate cross-domain images

from joint representation

Liu et al., "Unsupervised image-to-image translation networks.“, NIPS 2017

Page 27: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Variation Autoencoder Loss

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 = 𝔼𝔼 G𝐿 E𝐿 𝑥𝑥𝐿 − 𝑥𝑥𝐿 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞𝐿(𝑧𝑧)||𝑝𝑝(𝑧𝑧))𝔼𝔼 G2 E2 𝑥𝑥2 − 𝑥𝑥2 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞2(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

VAEG𝐿E𝐿 D𝐿

E2 G2 D2

Adversarial Lossℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2 = 𝔼𝔼 log 1 − D𝐿(G𝐿(𝑧𝑧) + 𝔼𝔼 log D𝐿 𝑦𝑦𝐿

𝔼𝔼 log 1 − D2(G2(𝑧𝑧) + 𝔼𝔼 log D2 𝑦𝑦2

G𝐿(𝑧𝑧)

G2(𝑧𝑧)

GAN

Variation Autoencoder Adversarial

UNIT

Page 28: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2 = 𝔼𝔼 log 1 − D𝐿(G𝐿(𝑧𝑧) + 𝔼𝔼 log D𝐿 𝑦𝑦𝐿𝔼𝔼 log 1 − D2(G2(𝑧𝑧) + 𝔼𝔼 log D2 𝑦𝑦2

Variation Autoencoder Loss

• LearningOverall objective function

G = arg minG

maxD

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 = 𝔼𝔼 G𝐿 E𝐿 𝑥𝑥𝐿 − 𝑥𝑥𝐿 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞𝐿(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

𝔼𝔼 G2 E2 𝑥𝑥2 − 𝑥𝑥2 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞2(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

VAEG𝐿E𝐿 D𝐿

E2 G2 D2

Adversarial Loss

G𝐿 E𝐿 𝑥𝑥𝐿

G2 E2 𝑥𝑥2

Reconstruction

UNIT

Page 29: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2 = 𝔼𝔼 log 1 − D𝐿(G𝐿(𝑧𝑧) + 𝔼𝔼 log D𝐿 𝑦𝑦𝐿𝔼𝔼 log 1 − D2(G2(𝑧𝑧) + 𝔼𝔼 log D2 𝑦𝑦2

Variation Autoencoder Loss

• LearningOverall objective function

G = arg minG

maxD

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 = 𝔼𝔼 G𝐿 E𝐿 𝑥𝑥𝐿 − 𝑥𝑥𝐿 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞𝐿(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

𝔼𝔼 G2 E2 𝑥𝑥2 − 𝑥𝑥2 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞2(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

VAEG𝐿E𝐿 D𝐿

E2 G2 D2

Adversarial Loss

G𝐿 E𝐿 𝑥𝑥𝐿

G2 E2 𝑥𝑥2

Prior Loss

UNIT

Page 30: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Variation Autoencoder Loss

• LearningOverall objective function

G = arg minG

maxD

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 = 𝔼𝔼 G𝐿 E𝐿 𝑥𝑥𝐿 − 𝑥𝑥𝐿 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞𝐿(𝑧𝑧)||𝑝𝑝(𝑧𝑧))𝔼𝔼 G2 E2 𝑥𝑥2 − 𝑥𝑥2 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞2(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

G𝐿E𝐿 D𝐿

E2 G2 D2

Adversarial Lossℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2 = 𝔼𝔼 log 1 − D𝐿(G𝐿(𝑧𝑧) + 𝔼𝔼 log D𝐿 𝑦𝑦𝐿

𝔼𝔼 log 1 − D2(G2(𝑧𝑧) + 𝔼𝔼 log D2 𝑦𝑦2

G𝐿(𝑧𝑧)

G2(𝑧𝑧)

GAN

Generated

𝑦𝑦𝐿

𝑦𝑦2

UNIT

Page 31: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Variation Autoencoder Loss

• LearningOverall objective function

G = arg minG

maxD

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2

ℒ𝑉𝑉𝐺𝐺𝑉𝑉 E𝐿, G𝐿, E2, G2 = 𝔼𝔼 G𝐿 E𝐿 𝑥𝑥𝐿 − 𝑥𝑥𝐿 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞𝐿(𝑧𝑧)||𝑝𝑝(𝑧𝑧))𝔼𝔼 G2 E2 𝑥𝑥2 − 𝑥𝑥2 2 + 𝔼𝔼 𝒦𝒦ℒ(𝑞𝑞2(𝑧𝑧)||𝑝𝑝(𝑧𝑧))

E𝐿

E2

Adversarial Lossℒ𝐺𝐺𝐺𝐺𝐺𝐺 G𝐿, D𝐿, G2, D2 = 𝔼𝔼 log 1 − D𝐿(G𝐿(𝑧𝑧) + 𝔼𝔼 log D𝐿 𝑦𝑦𝐿

𝔼𝔼 log 1 − D2(G2(𝑧𝑧) + 𝔼𝔼 log D2 𝑦𝑦2Real

G𝐿 D𝐿

G2 D2

G𝐿(𝑧𝑧)

G2(𝑧𝑧)

GAN𝑦𝑦𝐿

𝑦𝑦2Real

Real

UNIT

Page 32: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

UNIT• Example results

Liu et al., "Unsupervised image-to-image translation networks.“, NIPS 2017

Sunny → Rainy

Rainy → Sunny

Real Street-view → Synthetic Street-view

Synthetic Street-view → Real Street-view

Github Page: https://github.com/mingyuliutw/UNIT

Page 33: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Representation Disentanglement• InfoGAN: Unsupervised representation disentanglement• AC-GAN: Supervised representation disentanglement

• Image Translation• Pix2pix (CVPR’17): Pairwise cross-domain training data• CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data• UNIT (NIPS’17): Learning cross-domain image representation (with unpaired training data)• DTN (ICLR’17) : Learning cross-domain image representation (with unpaired training data)

• Joint Image Translation & Disentanglement• StarGAN (CVPR’18) : Image translation via representation disentanglement• CDRD (CVPR’18) : Cross-domain representation disentanglement and translation

From Image Understanding to Image Manipulation

Page 34: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Domain Transfer Networks

• Unsupervised Cross-Domain Image Generation (ICLR’17)• Goal/Problem Setting

• Image translation across two domains• One-way only translation• Unpaired training data

• Idea • Apply unified model to learn

joint representation across domains.

Taigman et al., "Unsupervised cross-domain image generation.“, ICLR 2016

Page 35: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Domain Transfer Networks

• Unsupervised Cross-Domain Image Generation (ICLR’17)• Goal/Problem Setting

• Image translation across two domains• One-way only translation• Unpaired training data

• Idea • Apply unified model to learn

joint representation across domains.• Consistency observed in image and feature spaces

Taigman et al., "Unsupervised cross-domain image generation.“, ICLR 2016

Image consistency

feature consistency

Page 36: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Learning• Unified model to translate across domains

• Consistency of feature and image space

• Adversarial loss

G∗ = arg minG

maxD

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G + ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G = 𝔼𝔼 𝑔𝑔 𝑓𝑓 𝑦𝑦 − 𝑦𝑦 2

ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G = 𝔼𝔼 𝑓𝑓(𝑔𝑔 𝑓𝑓 𝑥𝑥 ) − 𝑓𝑓(𝑥𝑥) 2

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥) + 𝔼𝔼 log 1 − D(G(𝑦𝑦) + 𝔼𝔼 log D 𝑦𝑦

G D

Domain Transfer Networks

Page 37: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Learning• Unified model to translate across domains

• Consistency of image and feature space

• Adversarial loss

G∗ = arg minG

maxD

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G + ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G = 𝔼𝔼 𝑔𝑔 𝑓𝑓 𝑦𝑦 − 𝑦𝑦 2

ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G = 𝔼𝔼 𝑓𝑓(𝑔𝑔 𝑓𝑓 𝑥𝑥 ) − 𝑓𝑓(𝑥𝑥) 2

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥) + 𝔼𝔼 log 1 − D(G(𝑦𝑦) + 𝔼𝔼 log D 𝑦𝑦

𝑦𝑦

𝑥𝑥

G D

Image consistency

feature consistency

G = {𝑓𝑓,𝑔𝑔}

Domain Transfer Networks

Page 38: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Learning• Unified model to translate across domains

• Consistency of feature and image space

• Adversarial loss

G∗ = arg minG

maxD

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G + ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G = 𝔼𝔼 𝑔𝑔 𝑓𝑓 𝑦𝑦 − 𝑦𝑦 2

ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G = 𝔼𝔼 𝑓𝑓(𝑔𝑔 𝑓𝑓 𝑥𝑥 ) − 𝑓𝑓(𝑥𝑥) 2

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥) + 𝔼𝔼 log 1 − D(G(𝑦𝑦) + 𝔼𝔼 log D 𝑦𝑦

𝑦𝑦

𝑥𝑥

G DG(𝑦𝑦)

G(𝑥𝑥)

Domain Transfer Networks

Page 39: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Learning• Unified model to translate across domains

• Consistency of feature and image space

• Adversarial loss

G∗ = arg minG

maxD

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G + ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G + ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D

ℒ𝑖𝑖𝑖𝑖𝑖𝑖 G = 𝔼𝔼 𝑔𝑔 𝑓𝑓 𝑦𝑦 − 𝑦𝑦 2

ℒ𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 G = 𝔼𝔼 𝑓𝑓(𝑔𝑔 𝑓𝑓 𝑥𝑥 ) − 𝑓𝑓(𝑥𝑥) 2

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥) + 𝔼𝔼 log 1 − D(G(𝑦𝑦) + 𝔼𝔼 log D 𝑦𝑦

𝑦𝑦

𝑥𝑥

G DG(𝑦𝑦)

G(𝑥𝑥)

Domain Transfer Networks

Page 40: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

DTN• Example results

Taigman et al., "Unsupervised cross-domain image generation.“, ICLR 2016

SVHN 2 MNIST Photo 2 Emoji

Page 41: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Representation Disentanglement• InfoGAN: Unsupervised representation disentanglement• AC-GAN: Supervised representation disentanglement

• Image Translation• Pix2pix (CVPR’17): Pairwise cross-domain training data• CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data• UNIT (NIPS’17): Learning cross-domain image representation (with unpaired training data)• DTN (ICLR’17) : Learning cross-domain image representation (with unpaired training data)

• Joint Image Translation & Disentanglement• StarGAN (CVPR’18) : Image translation via representation disentanglement• CDRD (CVPR’18) : Cross-domain representation disentanglement and translation

From Image Understanding to Image Manipulation

Page 42: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Goal• Unified GAN for multi-domain image-to-image translation

Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

Traditional Cross-Domain Models Unified Multi-Domain Model(StarGAN)

Page 43: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Goal• Unified GAN for multi-domain image-to-image translation

Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

Traditional Cross-Domain ModelsUnified Multi-Domain Model

(StarGAN)

UnifiedG

𝐺𝐺𝐿2

𝐺𝐺𝐿3 𝐺𝐺24

𝐺𝐺34

𝐺𝐺𝐿4

𝐺𝐺23

Page 44: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Goal / Problem Setting• Single image translation model across

multiple domains • Unpaired training data

Page 45: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Goal / Problem Setting• Single Image translation model across multiple

domains • Unpaired training data

• Idea• Concatenate image and target domain label as input of generator• Auxiliary domain classifier on Discriminator

Target domain Image

Page 46: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Goal / Problem Setting• Single Image translation model across multiple

domains • Unpaired training data

• Idea• Concatenate image and target domain label as input of

Generator• Auxiliary domain classifier as discriminator too

Page 47: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Goal / Problem Setting• Single Image translation model across

multiple domains • Unpaired training data

• Idea• Concatenate image and target domain label as input of

Generator• Auxiliary domain classifier on Discriminator• Cycle consistency across domains

Page 48: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN• Goal / Problem Setting

• Single Image translation model across multiple domains

• Unpaired training data

• Idea• Auxiliary domain classifier as discriminator• Concatenate image and target domain label as input• Cycle consistency across domains

Page 49: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

Page 50: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Learning

• Adversarial Loss

Overall objective functionG∗ = arg min

GmaxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

Adversarial Loss 𝑦𝑦G(𝑥𝑥, 𝑐𝑐)

𝑥𝑥𝑐𝑐

Page 51: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Learning

• Adversarial Loss

• Domain Classification Loss (Disentanglement)

Overall objective functionG∗ = arg min

GmaxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

Domain Classification Loss 𝑦𝑦G(𝑥𝑥, 𝑐𝑐)

𝑥𝑥𝑐𝑐

ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = 𝔼𝔼 − log Dcls(𝑐𝑐′|𝑦𝑦) + 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

Page 52: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Learning

• Adversarial Loss

• Domain Classification Loss (Disentanglement)

Overall objective functionG∗ = arg min

GmaxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

Domain Classification Loss 𝑦𝑦G(𝑥𝑥, 𝑐𝑐)

ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = 𝔼𝔼 − log Dcls(𝑐𝑐′|𝑦𝑦) + 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

Real dataw.r.t. its domain label

𝑐𝑐

𝑥𝑥

Dcls(𝑐𝑐′|𝑦𝑦)

Page 53: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Learning

• Adversarial Loss

• Domain Classification Loss (Disentanglement)

Overall objective functionG∗ = arg min

GmaxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

Domain Classification Loss 𝑦𝑦G(𝑥𝑥, 𝑐𝑐)

ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = 𝔼𝔼 − log Dcls(𝑐𝑐′|𝑦𝑦) + 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

Generated dataw.r.t. assigned label

𝑐𝑐 𝑥𝑥

Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

Page 54: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Learning

• Adversarial Loss

• Domain Classification Loss (Disentanglement)

• Cycle Consistency Loss

Overall objective functionG∗ = arg min

GmaxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

Consistency Loss

𝑐𝑐𝑥𝑥

G(𝑥𝑥, 𝑐𝑐)

𝑥𝑥𝑐𝑐ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D= 𝔼𝔼 − log Dcls(𝑐𝑐′|𝑦𝑦) + 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G = 𝔼𝔼 G G 𝑥𝑥, 𝑐𝑐 , 𝑐𝑐𝑥𝑥 − 𝑥𝑥 𝐿

G G 𝑥𝑥, 𝑐𝑐 , 𝑐𝑐𝑥𝑥

Page 55: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN

• Learning

• Adversarial Loss

• Domain Classification Loss

• Cycle Consistency Loss

Overall objective functionG∗ = arg min

GmaxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D + ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = 𝔼𝔼 log 1 − D(G(𝑥𝑥, 𝑐𝑐)) + 𝔼𝔼 log D 𝑦𝑦

ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = 𝔼𝔼 − log Dcls(𝑐𝑐′|𝑦𝑦) + 𝔼𝔼 − log Dcls(𝑐𝑐|G(𝑥𝑥, 𝑐𝑐))

ℒ𝑐𝑐𝑦𝑦𝑐𝑐 G = 𝔼𝔼 G G 𝑥𝑥, 𝑐𝑐 , 𝑐𝑐𝑥𝑥 − 𝑥𝑥 𝐿

Page 56: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

StarGAN• Example results

• StarGAN can somehow be viewed as a representation disentanglement model, instead of an image translation one.

Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

Multiple Domains Multiple Domains

Github Page: https://github.com/yunjey/StarGAN

Page 57: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

• Representation Disentanglement• InfoGAN: Unsupervised representation disentanglement• AC-GAN: Supervised representation disentanglement

• Image Translation• Pix2pix (CVPR’17): Pairwise cross-domain training data• CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data• UNIT (NIPS’17): Learning cross-domain image representation (with unpaired training data)• DTN (ICLR’17) : Learning cross-domain image representation (with unpaired training data)

• Joint Image Translation & Disentanglement• StarGAN (CVPR’18) : Image translation via representation disentanglement• CDRD (CVPR’18) : Cross-domain representation disentanglement and translation

From Image Understanding to Image Manipulation

57

Page 58: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Cross-Domain Representation Disnentanglement (CDRD)

Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation”. CVPR 2018

• Goal / Problem Setting• Learning cross-domain joint disentangled

representation • Single domain supervision• Cross-domain image translation with attribute of

interest

• Idea• Bridge the domain gap across domains• Auxiliary attribute classifier on Discriminator• Semantics of disentangled factor is learned from label

in source domain

Joint Space

58

Page 59: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• Goal / Problem Setting• Learning cross-domain joint disentangled representation • Single domain supervision• Cross-domain image translation with attribute of interest

• Idea• Bridge the domain gap across domains• Auxiliary attribute classifier on Discriminator• Semantics of disentangled factor is learned from label in

source domain

Supervised(w/ or w/oEyeglass)

Unsupervised

59

Page 60: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

w/ eyeglasses

w/o eyeglasses

• Goal / Problem Setting• Learning cross-domain joint disentangled representation • Single domain supervision• Cross-domain image translation with attribute of interest

• Idea• Bridge the domain gap across domains• Auxiliary attribute classifier on Discriminator• Semantics of disentangled factor is learned from label in

source domain

w/ eyeglasses

w/o eyeglasses

60

Page 61: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• Goal / Problem Setting• Learning cross-domain joint disentangled representation • Single domain supervision• Cross-domain image translation with attribute of interest

• Idea• Based on GAN• Auxiliary attribute classifier on Discriminator• Bridge the domain gap across domains• Semantics of disentangled factor is learned from label in

source domain

Real or Fake

𝑧𝑧

Generator

Discriminator

𝑋𝑋𝑆𝑆 �𝑋𝑋𝑆𝑆

G

D

61

Page 62: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• Goal / Problem Setting• Learning cross-domain joint disentangled representation • Single domain supervision• Cross-domain image translation with attribute of interest

• Idea• Based on GAN• Auxiliary attribute classifier as Discriminator• Bridge the domain gap across domains• Semantics of disentangled factor is learned from label in

source domain

Real or Fake

𝑧𝑧

Generator

Discriminator

𝑋𝑋𝑆𝑆 �𝑋𝑋𝑆𝑆

G

D

𝑙𝑙

Classificationw.r.t. 𝑙𝑙𝑆𝑆 or 𝑙𝑙

𝑙𝑙𝑆𝑆

Supervised

62

Page 63: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• Goal / Problem Setting• Learning cross-domain joint disentangled representation • Single domain supervision• Cross-domain image translation with attribute of interest

• Idea• Based on GAN• Auxiliary attribute classifier on Discriminator• Bridge the domain gap

with division of high and low-level layers• Semantics of disentangled factor is learned from label in

source domain

63

Page 64: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• Goal / Problem Setting• Learning cross-domain joint disentangled representation • Single domain supervision• Cross-domain image translation with attribute of interest

• Idea• Based on GAN• Auxiliary attribute classifier on Discriminator• Bridge the domain gap with division of high- and low-

level layer• Semantics of disentangled factor is learned from

label info in source domain

64

Page 65: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

65

Page 66: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D = 𝔼𝔼 log D𝐶𝐶(DS(𝑋𝑋𝑆𝑆)) + 𝔼𝔼 log 1 − D𝐶𝐶(DS( �𝑋𝑋𝑆𝑆))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, D = 𝔼𝔼 log D𝐶𝐶(DT(𝑋𝑋𝑇𝑇)) + 𝔼𝔼 log 1 − D𝐶𝐶(DT( �𝑋𝑋𝑇𝑇))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D + ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, D

Adversarial Loss

66

Page 67: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Real

CDRD

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D = 𝔼𝔼 log D𝐶𝐶(DS(𝑋𝑋𝑆𝑆)) + 𝔼𝔼 log 1 − D𝐶𝐶(DS( �𝑋𝑋𝑆𝑆))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, D = 𝔼𝔼 log D𝐶𝐶(DT(𝑋𝑋𝑇𝑇)) + 𝔼𝔼 log 1 − D𝐶𝐶(DT( �𝑋𝑋𝑇𝑇))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D + ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, DAdversarial Loss

67

Page 68: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Generated

CDRD

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D = 𝔼𝔼 log D𝐶𝐶(DS(𝑋𝑋𝑆𝑆)) + 𝔼𝔼 log 1 − D𝐶𝐶(DS( �𝑋𝑋𝑆𝑆))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, D = 𝔼𝔼 log D𝐶𝐶(DT(𝑋𝑋𝑇𝑇)) + 𝔼𝔼 log 1 − D𝐶𝐶(DT( �𝑋𝑋𝑇𝑇))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D + ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, DAdversarial Loss

68

Page 69: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

AC-GAN

CDRD

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D = 𝔼𝔼 log D𝐶𝐶(DS(𝑋𝑋𝑆𝑆)) + 𝔼𝔼 log 1 − D𝐶𝐶(DS( �𝑋𝑋𝑆𝑆))ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, D = 𝔼𝔼 log D𝐶𝐶(DT(𝑋𝑋𝑇𝑇)) + 𝔼𝔼 log 1 − D𝐶𝐶(DT( �𝑋𝑋𝑇𝑇))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D + ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, DAdversarial Loss

ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑆𝑆 G, D = 𝔼𝔼 log𝑃𝑃(𝑙𝑙 = 𝑙𝑙| �𝑋𝑋𝑆𝑆) + 𝔼𝔼 log𝑃𝑃(𝑙𝑙 = 𝑙𝑙𝑆𝑆|𝑋𝑋𝑆𝑆)

ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑇𝑇 G, D = 𝔼𝔼 log𝑃𝑃(𝑙𝑙 = 𝑙𝑙| �𝑋𝑋𝑇𝑇)

Disentangle Lossℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑆𝑆 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑇𝑇 G, D

69

Page 70: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

InfoGAN

CDRD

• LearningOverall objective function

G∗ = arg minG

maxD

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D

ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D = 𝔼𝔼 log D𝐶𝐶(DS(𝑋𝑋𝑆𝑆)) + 𝔼𝔼 log 1 − D𝐶𝐶(DS( �𝑋𝑋𝑆𝑆))ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, D = 𝔼𝔼 log D𝐶𝐶(DT(𝑋𝑋𝑇𝑇)) + 𝔼𝔼 log 1 − D𝐶𝐶(DT( �𝑋𝑋𝑇𝑇))

ℒ𝐺𝐺𝐺𝐺𝐺𝐺 G, D = ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑆𝑆 G, D + ℒ𝐺𝐺𝐺𝐺𝐺𝐺𝑇𝑇 G, DAdversarial Loss

ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑆𝑆 G, D = 𝔼𝔼 log𝑃𝑃(𝑙𝑙 = 𝑙𝑙| �𝑋𝑋𝑆𝑆) + 𝔼𝔼 log𝑃𝑃(𝑙𝑙 = 𝑙𝑙𝑆𝑆|𝑋𝑋𝑆𝑆)

ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑇𝑇 G, D = 𝔼𝔼 log𝑃𝑃(𝑙𝑙 = 𝑙𝑙| �𝑋𝑋𝑇𝑇)

Disentangle Lossℒ𝑐𝑐𝑐𝑐𝑐𝑐 G, D = ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑆𝑆 G, D + ℒ𝑐𝑐𝑐𝑐𝑐𝑐𝑇𝑇 G, D

70

Page 71: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD• Add an additional encoder

• Input: Gaussian Noise Image• Image translation with attribute of interest• Input: Gaussian Noise

71

Page 72: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD

• Experiment resultsNo Label Supervision

Input

Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation”. CVPR 2018 72

Page 73: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CDRD• Experiment results

Digits

Face

Scene

Cross-Domain Classification

Figure: t-SNE for digit

CVPR’17

NIPS’16

ICML’15

ICCV’15

CVPR’17

PAMI’17

CVPR’14

CVPR’13

ECCV’16

arXiv’16

NIPS’16

NIPS’17

NIPS’16

NIPS’17

Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation”. CVPR 2018 73

Page 74: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Comparisons

Cross-Domain Image Translation Representation Disentanglement

Unpaired Training Data

Multi-domains Bi-direction Joint

Representation Unsupervised Interpretability of disentangled factor

Pix2pix X X X X

Cannot disentangle representation

CycleGAN O X O XStarGAN O O O X

UNIT O X O ODTN O X X O

infoGAN

Cannot translate image across domainsO X

AC-GAN X OCDRD (Ours) O O O O Partially O

74

Page 75: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

What’s to Be Covered Today…

• Visualization of NNs• t-SNE• Visualization of Feature Maps• Neural Style Transfer

• Understanding NNs• Image Translation• Feature Disentanglement

• Recurrent Neural Networks• LSTM & GRU

Many slides from Fei-Fei Li, Bhiksha Raj, Yuying Yeh, and Hsuan-I Ho 75

Page 76: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

What We Have Learned…

• CNN for Image Classification

DOG

CAT

MONKEY

76

Page 77: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

What We Have Learned…

• Generative Models (e.g., GAN) for Image Synthesis

77

Page 78: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

But…

• Is it sufficiently effective and robust to perform visual analysis from a single image?

• E.g., Are the hands of this man tied?

78

Page 79: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

But…

• Is it sufficiently effective and robust to perform visual analysis from a single image?

• E.g., Are the hands of this man tied?

79

Page 80: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

So What Are The Limitations of CNN?

• Can’t easily model series data• Both input and output might be sequential data (scalar or vector)

• Simply feed-forward processing• Cannot memorize, no long-term feedback

DOG

CAT

MONKEY

80

Page 81: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Example of (Visual) Sequential Data

https://quickdraw.withgoogle.com/#

81

Page 82: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

More Applications in Vision

Image Captioning

Figure from Vinyals et al, “Show and tell: A neural image caption generator”, CVPR 201582

Page 83: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

More Applications in Vision

Visual Question Answering (VQA)

output

Input

Figure from Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 201683

Page 84: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

How to Model Sequential Data?

• Deep learning for sequential data• Recurrent neural networks (RNN)• 3-dimensional convolution neural networks

RNN 3D convolution

84

Page 85: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Recurrent Neural Networks

• Parameter sharing + unrolling• Keeps the number of parameters fixed• Allows sequential data with varying lengths

• Memory ability• Capture and preserve information which has been extracted

85

Page 86: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Recurrence Formula

• Same function and parameters used at every time step:

ℎ𝑓𝑓 ,𝑦𝑦𝑓𝑓 = 𝑓𝑓𝑊𝑊 ℎ𝑓𝑓−𝐿, 𝑥𝑥𝑓𝑓

𝑥𝑥

RNN

𝑦𝑦new statefor time t

state attime t-1

input vector at time t

output vectorat time t

functionwith parameters W

86

Page 87: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Recurrence Formula

• Same function and parameters used at every time step:

ℎ𝑓𝑓 ,𝑦𝑦𝑓𝑓 = 𝑓𝑓𝑊𝑊 ℎ𝑓𝑓−𝐿, 𝑥𝑥𝑓𝑓

ℎ𝑓𝑓 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ 𝑊𝑊ℎℎℎ𝑓𝑓−𝐿 + 𝑊𝑊𝑥𝑥ℎ𝑥𝑥𝑓𝑓𝑦𝑦𝑓𝑓 = 𝑊𝑊ℎ𝑦𝑦ℎ𝑓𝑓

𝑥𝑥

RNN

𝑦𝑦

87

Page 88: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Multiple Recurrent Layers

88

Page 89: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Multiple Recurrent Layers

89

Page 90: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0

90

Page 91: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2

91

Page 92: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2……

ℎ𝑇𝑇𝑓𝑓𝑊𝑊

𝑥𝑥3

ℎ3

92

Page 93: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2……

ℎ𝑇𝑇𝑓𝑓𝑊𝑊

𝑥𝑥3

ℎ3

Wxh

Whh

93

Page 94: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2……

ℎ𝑇𝑇𝑓𝑓𝑊𝑊

𝑥𝑥3

ℎ3

𝑦𝑦𝐿 𝑦𝑦2 𝑦𝑦3 𝑦𝑦𝑇𝑇

Why

94

Page 95: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

e.g., video indexinge.g., video predictione.g., image caption e.g., action recognition

95

Page 96: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Example: Image Captioning

Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 201596

Page 97: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

CNN

97

Page 98: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

98

Page 99: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

99

Page 100: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

100

Page 101: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

101

Page 102: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

102

Page 103: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

103

Page 104: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Example: Action Recognition

Figure from “Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features” 104

Page 105: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2……

ℎ𝑇𝑇𝑓𝑓𝑊𝑊

𝑥𝑥3

ℎ3

𝑦𝑦𝐿 𝑦𝑦2 𝑦𝑦3 𝑦𝑦𝑇𝑇

105

Page 106: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2……

ℎ𝑇𝑇𝑓𝑓𝑊𝑊

𝑥𝑥3

ℎ3

𝑦𝑦𝐿 𝑦𝑦2 𝑦𝑦3 𝑦𝑦𝑇𝑇

softmax

golf

standingexercise

106

Page 107: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

𝑥𝑥𝐿

𝑓𝑓𝑊𝑊 ℎ𝐿ℎ0 𝑓𝑓𝑊𝑊

𝑥𝑥2

ℎ2……

ℎ𝑇𝑇𝑓𝑓𝑊𝑊

𝑥𝑥3

ℎ3

𝑦𝑦𝐿 𝑦𝑦2 𝑦𝑦3 𝑦𝑦𝑇𝑇

action label

softmax

cross entropyloss

Back propagation through time

107

Page 108: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Back Propagation Through Time (BPTT)

108

Page 109: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Back Propagation Through Time (BPTT)

109

Page 110: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Training RNNs via BPTT

110

Page 111: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Training RNNs: Forward Pass

• Forward pass each training instance:• Pass the entire data sequence through the net• Generate outputs

111

Page 112: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Training RNNs: Computing Gradients

• After forward passing each training instance:• Backward pass: compute gradients via BP• More specifically, BPTT

112

Page 113: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Training RNNs: BPTT

• Let’s focus on one training instance.• The divergence computed is between

the sequence of outputs by the networkand the desired sequence of outputs.

• Generally, this is not just the sum of the divergences at individual times.

113

Page 114: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Gradient Vanishing & Exploding

● Computing gradient involves many factors of W

○ Exploding gradients : Largest singular value > 1

○ Vanishing gradients : Largest singular value < 1

114

Page 115: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Solutions…

● Gradients clipping : rescale gradients if too large

● How about vanishing gradients?

standard gradient descent trajectories

gradient clipping to fix problem

115

Page 116: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Variants of RNN

• Long Short-term Memory (LSTM) [Hochreiter et al., 1997]

• Additional memory cell • Input/Forget/Output Gates• Handle gradient vanishing• Learn long-term dependencies

• Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014]

• Similar to LSTM• No additional memory cell• Reset / Update Gates• Fewer parameters than LSTM• Comparable performance to LSTM [Chung et al., NIPS Workshop 2014]

116

Page 117: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Vanilla RNN vs. LSTM

Memory Cell

Input in time t

Input in time t

Output in time t

Output in time t

LSTM

117

Forget gate

Input gate

Input activation

Output gate

Cell state

Hidden state

Page 118: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Long Short-Term Memory (LSTM)

Image Credit: Hung-Yi Lee 118

Memory Cell

Input in time tOutput in time t

Forget gate

Input gate

Input activation

Output gate

Cell state

Hidden state

Page 119: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

tanh

Cell state

Hidden state

Forget gate

Input gate

Input activation

Output gate

Cell state

Hidden state

Long Short-Term Memory (LSTM)

119

Page 120: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Calculate forget gate fCell state

Hidden state

Long Short-Term Memory (LSTM)

120

Forget gate

Page 121: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Long Short-Term Memory (LSTM)

Calculate forget gate fCell state

Hidden state

121

Input gate

Page 122: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

Calculate input activationCell state

Hidden state

Long Short-Term Memory (LSTM)

122

Input activation

Page 123: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

Calculate output gateCell state

Hidden state

Long Short-Term Memory (LSTM)

123

Output gate

Page 124: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

Update memory cellCell state

Hidden state

Long Short-Term Memory (LSTM)

124

Cell state

Page 125: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

tanh

Calculate output htCell state

Hidden state

Long Short-Term Memory (LSTM)

125

Hidden state

Page 126: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

tanh

Prevent gradient vanishing if forget gate is open (>0).

Remarks:- Forget gate f provides shortcut connection across time steps- Linear relationship between ct and ct-1 instead of multiplication relationship of ht and ht-1 in vanilla RNN

Cell state

Hidden state

Long Short-Term Memory (LSTM)

126

Page 127: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Variants of RNN

• Long Short-term Memory (LSTM) [Hochreiter et al., 1997]

• Additional memory cell • Input / Forget / Output Gates• Handle gradient vanishing• Learn long-term dependencies

• Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014]

• Similar to LSTM• No additional memory cell• Reset/Update Gates• Fewer parameters than LSTM• Comparable performance to LSTM [Chung et al., NIPS Workshop 2014]

127

Page 128: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

LSTM vs. GRU

LSTM GRU

Forget gate

Input gate

Input activation

Output gate

Cell state

Hidden state

Reset gate

Update gate

Input activation

Hidden state

128

Page 129: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Gated Recurrent Unit (GRU)

tanh

Hidden state

129

Page 130: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Hidden stateCalculate reset gate r

Gated Recurrent Unit (GRU)

130

Reset gate

Page 131: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Hidden stateCalculate update gate z

Gated Recurrent Unit (GRU)

131

Update gate

Page 132: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

Hidden stateCalculate input activation

Gated Recurrent Unit (GRU)

132

Input activation

Page 133: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

Hidden stateCalculate output

Gated Recurrent Unit (GRU)

133

Hidden state

Page 134: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

tanh

Hidden statePrevent gradient vanishing if update gate is open!

Remarks:- Update gate z provides shortcut connection across time steps- Linear relationship between ht and ht-1 instead of multiplication relationship of ht and ht-1 in vanilla RNN

Gated Recurrent Unit (GRU)

134

Page 135: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Input in time tOutput in time t

tanh

tanh

tanh

135

Vanilla RNN, LSTM, & GRU

𝑥𝑥

RNN

𝑦𝑦

Page 136: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

LSTM vs. GRU

Vanilla RNN LSTM GRU

Cell state

Number of Gates

Parameters

Gradient Vanishing / Exploding

136

Page 137: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

References

bookhttp://www.deeplearningbook.org/contents/rnn.html

VThttps://computing.ece.vt.edu/~f15ece6504/slides/L14_RNNs.pptx.pdfhttps://computing.ece.vt.edu/~f15ece6504/slides/L15_LSTMs.pptx.pdf

CS231nhttp://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf

Jia-Bing Huanghttps://filebox.ece.vt.edu/~jbhuang/teaching/ece5554-4554/fa17/lectures/Lecture_23_ActionRec.pdf

MLDShttps://www.csie.ntu.edu.tw/~yvchen/f106-adl/doc/171030+171102_Attention.pdf

137

Page 138: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

Web Tutorial

● http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

● https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html

138

Page 139: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w9__1...Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative

What We’ve Learned Today…

• Understanding NNs• Image Translation• Feature Disentanglement

• Recurrent Neural Networks• LSTM & GRU

• HW4 is due 5/18 Fri 23:59!

139