80
Deep Learning Alexandre Allauzen, Mich` ele Sebag CNRS & Universit´ e Paris-Sud Oct. 17th, 2018 Credit for slides: Sanjeev Arora; Yoshua Bengio;Yann LeCun; Nando de Freitas; Pascal Germain; L´ eon Gatys; Weidi Xie; Max Welling; Victor Berger; Kevin Frans; Lars Mescheder et al.; Mehdi Sajjadi et al. 1 / 72

Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Deep Learning

Alexandre Allauzen, Michele SebagCNRS & Universite Paris-Sud

Oct. 17th, 2018Credit for slides: Sanjeev Arora; Yoshua Bengio;Yann LeCun; Nando de

Freitas; Pascal Germain; Leon Gatys; Weidi Xie; Max Welling; VictorBerger; Kevin Frans; Lars Mescheder et al.; Mehdi Sajjadi et al.

1 / 72

Page 2: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Representation is everything

Auto-EncodersDenoising Auto-EncodersExploiting latent representations

Siamese Networks

Variational Auto-Encoders

Generative Adversarial Networks

2 / 72

Page 3: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

The Deep ML revolution: what is new ?

Former state of the art e.g. in computer vision

SIFT: scale invariant feature transformHOG: histogram of oriented gradients

Textons: “vector quantized responses of a linear filter bank”

3 / 72

Page 4: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

What is new, 2

Traditional approach

→ Manually craftedfeatures

→ Trainableclassifier

Deep learning

→ Trainablefeature extractor

→ Trainableclassifier

4 / 72

Page 5: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

A new representation is learned

Bengio et al. 2006

5 / 72

Page 6: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Good features ?

Le Cun, 2015: https://www.youtube.com/watch?v=Y-XTcTusUxQ

6 / 72

Page 7: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Losses and labels

Lesson learned, quasi consensus @ ICML 2018

I Introduce many auxiliary tasks / losses

I Why ? Smoothens the optimization landscape (conjecture)

7 / 72

Page 8: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Tutorial Sanjeev Arora @ ICML18

http://unsupervised.cs.princeton.edu/deeplearningtutorial.html

Key issues

I Optimization: highly non convex

I Overparametrisation / Generalization

Min (w − 1)2 + (w + 1)2 vs Min (w1 − 1)2 + (w2 + 1)2

I Role of depthmore expressivity; filter noise

I Make it simpler ? One game is to find a solution; another game is tosimplify it. see also Max Welling’s talk.

8 / 72

Page 9: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Representation is everything

Auto-EncodersDenoising Auto-EncodersExploiting latent representations

Siamese Networks

Variational Auto-Encoders

Generative Adversarial Networks

9 / 72

Page 10: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Auto-encoders

E = {xi ∈ IRD , i = 1 . . . n}x −→ h1 ∈ IRd −→ x

I An auto-encoder:

Find W ∗ = arg minW

(∑i

||W ′oW (xi )− xi ||2)

(*) Instead of min squared error, use cross-entropy loss:∑j

xi,j log xi,j + (1− xi,j) log (1− xi,j)

(**) Why W for encoding and W ′ for decoding ?10 / 72

Page 11: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Auto-encoders and Principal Component Analysis

Assume

I A single layer

I Linear activation

Then

I An auto-encoder with k hidden neurons ≈ first k eigenvectors of PCA

Why ?

11 / 72

Page 12: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Stacked auto-encoders were used to initialize deep networks

In the early Deep Learning era...Bengio, Lamblin, Popovici, Larochelle 06

First layerx −→ h1 −→ x

Second layer same, replacing x with h1

h1 −→ h2 −→ h1

12 / 72

Page 13: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Denoising Auto-Encoders

Vincent, Larochelle, Bengio, Manzagol, 08

PrincipleI Add noise to x

I Gaussian noiseI Or binary masking noise (akin drop-out)

I Recover x.

Find W ∗ = arg minW

(∑i

||W ′oW (xi + ε)− xi ||2)

13 / 72

Page 14: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Auto-encoders for domain adaptation

Glorot, Bordes, Bengio, 11

Stacked Denoising Auto-Encoders

I on source and target instances

I use latent representation to learn on source domain

Why should it work ?SDAs are able to disentangle hidden factors which explain the variations in theinput data, and automatically group features in accordance with theirrelatedness to these factors. This helps transfer across domains as these genericconcepts are invariant to domain-specific vocabularies.

CONS: Computationally expensive

14 / 72

Page 15: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Marginalized Denoising Auto-Encoders

Chen, Xu, Weinberger, Sha 12

Marginalizing a single linear layer

I X : m copies of X , X : coordinate value independently zeroed withprobability p

I Find W = arg min ‖X−W X‖I Solution: W = PQ−1 with P = XX′ and Q = XX′

I ConsiderW = IE(P)IE(Q−1) as m→∞

I Define q = (1− p, . . . , 1− p, 1) (Last entry is for the bias)

IE(Q)α,β =

{XαX

′βqαqβ if α 6= β

XαX′αqα otherwise

Then

I Inject non-linearity on the top of WX consider σ(WX )

I Use linear classifier or SVM.

15 / 72

Page 16: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Representation is everything

Auto-EncodersDenoising Auto-EncodersExploiting latent representations

Siamese Networks

Variational Auto-Encoders

Generative Adversarial Networks

16 / 72

Page 17: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Visualization with (non linear) AutoencodersFor d = 2,

dimensionality reduction on the latent representationI Multidimensional scalingI t-Distributed Stochastic Neighbor Embedding (t-SNE)

vdMaaten, Hinton 08

t-SNE Do and Don’t https://distill.pub/2016/misread-tsne/17 / 72

Page 18: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Morphing of representationsGatys et al. 15, 16

I Style and contents in a convolutional NN are separableI Use a trained VGG-19 Net:

I applied on image 1 (content)I applied on image 2 (style)I find input matching hidden representation of image 1 (weight α) and hidden

representation of image 2 (weight β)18 / 72

Page 19: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

The style

Gatys et al. 15, 16

19 / 72

Page 20: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

The content

Gatys et al. 15, 16

FinallyUse image x0 for content (AE φ), x1 for style (AEφ′)Morphing: Find input image x minimizing

α‖φ(x)− φ(x0)‖+ β〈φ′(x), φ′(x1)〉

20 / 72

Page 21: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Morphing of representations, 2

Gatys et al. 15, 16

I Contents (bottom): convolutions with decreasing precision

I Style (top): correlations between the convol. features

21 / 72

Page 22: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Morphing of representations, 2

Gatys et al. 15, 16

21 / 72

Page 23: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Representation is everything

Auto-EncodersDenoising Auto-EncodersExploiting latent representations

Siamese Networks

Variational Auto-Encoders

Generative Adversarial Networks

22 / 72

Page 24: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Siamese Networks

Classes or similarities ?

23 / 72

Page 25: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Siamese Networks

Bromley et al. 93

Principle

I Neural Networks can be used to define a latent representation

I Siamese: optimize the related metrics

Schema

24 / 72

Page 26: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Siamese Networks, 2

Data

E = {xi ∈ IRd , i ∈ [[1, n]]};S = {(xi,`, xj` , c`) s.t. c` ∈ {−1, 1}, ` ∈ [[1, L]]}

Experimental setting

I Often: few similar pairs; by default, pairs are dissimilar

I Subsample dissimilar pairs (optimal ratio between 2/1 ou 10/1)

I Possible to use domain knowledge in selection of dissimilar pairs

25 / 72

Page 27: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Loss

Given similar and dissimilar pairs (E+ and E−)

L =∑

(i,j)inE+

L+(i , j) +∑

(k,`)inE−

L−(k, `)

26 / 72

Page 28: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Applications

I Signature recognition

I Image recognition, search

I Article, Title

I Filter out typos

I Recommandation systems, collaborative filtering

27 / 72

Page 29: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Siamese Networks for one-shot image recognition

Koch, Zemel, Salakhutdinov 15

Training similarity

One-shot setting

28 / 72

Page 30: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Ingredients

Koch et al. 15

Architecture

29 / 72

Page 31: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Ingredients, 2

Koch et al. 15

Architecture

Distance

d(x , x ′) = σ

(∑k

αk |zk(x)− zk(x ′)|

)

LossGiven a batch ((xi , x

′i ), yi ) with yi = 1 iff xi and x ′i are similar

L(w) =∑i

yi log d(xi , x′i ) + (1− yi ) log (1− d(xi , x

′i )) + λ‖w‖2

30 / 72

Page 32: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

ResultsKoch et al. 15

Omniglot

Results

31 / 72

Page 33: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Siamese Networks

PROS

I Learn metrics, invariance operators

I Generalization beyond train data

CONS

I More computationally intensive

I More hyperparameters and fine-tuning, more training

32 / 72

Page 34: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Representation is everything

Auto-EncodersDenoising Auto-EncodersExploiting latent representations

Siamese Networks

Variational Auto-Encoders

Generative Adversarial Networks

33 / 72

Page 35: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Beyond AEI A compressed (latent) representation

x ∈ IRD 7→ z = Enc(x) ∈ IRd 7→ Dec(z) ∈ IRD ≈ x

I Distance in latent space is meaningful d(Enc(x),Enc(x ′)) reflects d(x , x ′)

I But ∀z ∈ IRd : is Dec(z) ∈ IRD meaningful ?

“What I cannot create I do not understand” Feynman 88

34 / 72

Page 36: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Beyond AEI A compressed (latent) representation

x ∈ IRD 7→ z = Enc(x) ∈ IRd 7→ Dec(z) ∈ IRD ≈ x

I Distance in latent space is meaningful d(Enc(x),Enc(x ′)) reflects d(x , x ′)

I But ∀z ∈ IRd : is Dec(z) ∈ IRD meaningful ?

“What I cannot create I do not understand” Feynman 88

34 / 72

Page 37: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Variational Auto-EncodersKingma Welling 14, Salimans Kingma Welling 15

What we have:I Enc a memorization of the data s.t. exists Dec ≈ Enc−1

What we want:I z ∼ P s.t. Dec(z) ∼ Ddata

35 / 72

Page 38: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Distribution estimation

DataE = {x1, . . . , xn, xi ∈ X}

Goal

I Find a probability distribution that models the data

pθ : X 7→ [0, 1] s.t. θ = arg max∏i

pθ(xi )

≡ maximize the log likelihood of data

arg max∏i

pθ(xi ) = arg max∑i

log(pθ(xi ))

Gaussian case

θ = (µ, σ) pθ(x) =1

σ√

2πexp

(− 1

2σ2(x − µ)2

)

36 / 72

Page 39: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Akin Graphical models

Find hidden variables z s.t.

z 7→ x s.t. good p(x|z)

Bayes relationp(x, z) = p(z|x).p(x) = p(x|z).p(z)

Hence

p(z|x) = p(x|z).p(z)/

∫p(x|z).p(z)dz

Problem:denominator computationally intractable...

State of art

I Monte-Carlo estimation

I Variational Inferencechoose z well-behaved, and make q(z) “close” to p(z|x).

37 / 72

Page 40: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Variational Inference

I Approximate p(z|x) by q(z)

I Minimize distance between both, using Kullback-Leibler divergence

Reminder

I information (x) = −log(p(x))

I entropy(x1, . . . xk) = −∑

i p(xi )log(p(xi ))

I Kullback-Leibler divergence between distribution q and p

KL(q||p) =∑x

q(x)logq(x)

p(x)

Beware: not symmetrical, hence not a distance; plus numerical issueswhen supports are different

Variational inference

Minimize KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z|x)dz

38 / 72

Page 41: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound (ELBO)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z|x)dz

use p(z|x) = p(z, x)/p(x)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)p(x)

p(z, x)dz

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz +

∫q(z)log(p(x))dz

as∫q(z)dz = 1

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz + log(p(x))

recover KL(q(z)||p(z, z)

KL(q(z)||p(z|x)) = −∫

q(z)logp(z, x)

q(z)dz + log(p(x))

39 / 72

Page 42: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound (ELBO)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z|x)dz

use p(z|x) = p(z, x)/p(x)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)p(x)

p(z, x)dz

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz +

∫q(z)log(p(x))dz

as∫q(z)dz = 1

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz + log(p(x))

recover KL(q(z)||p(z, z)

KL(q(z)||p(z|x)) = −∫

q(z)logp(z, x)

q(z)dz + log(p(x))

39 / 72

Page 43: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound (ELBO)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z|x)dz

use p(z|x) = p(z, x)/p(x)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)p(x)

p(z, x)dz

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz +

∫q(z)log(p(x))dz

as∫q(z)dz = 1

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz + log(p(x))

recover KL(q(z)||p(z, z)

KL(q(z)||p(z|x)) = −∫

q(z)logp(z, x)

q(z)dz + log(p(x))

39 / 72

Page 44: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound (ELBO)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z|x)dz

use p(z|x) = p(z, x)/p(x)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)p(x)

p(z, x)dz

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz +

∫q(z)log(p(x))dz

as∫q(z)dz = 1

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz + log(p(x))

recover KL(q(z)||p(z, z)

KL(q(z)||p(z|x)) = −∫

q(z)logp(z, x)

q(z)dz + log(p(x))

39 / 72

Page 45: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound (ELBO)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z|x)dz

use p(z|x) = p(z, x)/p(x)

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)p(x)

p(z, x)dz

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz +

∫q(z)log(p(x))dz

as∫q(z)dz = 1

KL(q(z)||p(z|x)) =

∫q(z)log

q(z)

p(z, x)dz + log(p(x))

recover KL(q(z)||p(z, z)

KL(q(z)||p(z|x)) = −∫

q(z)logp(z, x)

q(z)dz + log(p(x))

39 / 72

Page 46: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound, 2

Define

L(q(z)) =

∫q(z)log

p(z, x)

q(z)dz

Last slide:KL(q(z)||p(z|x)) = log(p(x))− L(q(z)))

Hence

Minimize Kullback-Leibler divergence ≡ Maximize L(q(z))

40 / 72

Page 47: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound, 3

More formula massaging

L(q(z)) =

∫q(z)log

p(z, x)

q(z)dz

use p(z, x) = p(z|x)p(x)

L(q(z)) =

∫q(z)log

p(z|x)p(x)

q(z)dz

L(q(z)) =

∫q(z)log

p(z|x)

q(z)dz +

∫q(z)log(p(x))dz

L(q(z)) = −KL(q(z)||p(z|x)) + IEq[log(p(x))]

FinallyMaximize IEq[log(p(x)]− KL(q(z)||p(z|x))

make p(x) great under q akin data fittingwhile minimizing the KL divergence between the two akin regularization

41 / 72

Page 48: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Evidence Lower Bound, 3

More formula massaging

L(q(z)) =

∫q(z)log

p(z, x)

q(z)dz

use p(z, x) = p(z|x)p(x)

L(q(z)) =

∫q(z)log

p(z|x)p(x)

q(z)dz

L(q(z)) =

∫q(z)log

p(z|x)

q(z)dz +

∫q(z)log(p(x))dz

L(q(z)) = −KL(q(z)||p(z|x)) + IEq[log(p(x))]

FinallyMaximize IEq[log(p(x)]− KL(q(z)||p(z|x))

make p(x) great under q akin data fittingwhile minimizing the KL divergence between the two akin regularization

41 / 72

Page 49: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Where neural nets come in

Searching p and q

I We want p(x|z), we search p(z|x)

I Let p(z|x) be defined as a neural net (encoder)

I We want it to be close to a well-behaved ( Gaussian) distribution q(z)

Minimize KL(q(z)||p(z|x))

I And from z we generate a distribution p(x|z) (defined as a neural net,“decoder”)

I such that p(x|z) gives a high probability mass to our data (next slide)

Maximize IEq[log(p(x))]

Good newsAll these criteria are differentiable ! can be used to train the neural net.

42 / 72

Page 50: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

The loss of the variational decoder

Continuous case

I x 7→ z; Gaussian case, z ∼ p(z|x)

I Now z is given as input to the decoder, generates x (deterministic)

I p(x|x) = F (exp{−‖x− x‖2})I ... back to the L2 loss

Binary case

I Exercize: back to the cross-entropy loss

43 / 72

Page 51: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Variational auto-encoders

Kingma et al. 13

Position

I Like an auto-encoder (data fitting term) with a regularizer, the KLdivergence between the distribution of the hidden variables z and thetarget distribution.

I Say the hidden variable follows a Gaussian distribution: z ∼ N (µ,Σ)

I Therefore, the encoder must compute the parameters µ and Σ

44 / 72

Page 52: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Variational auto-encoders, 2

Kingma et al. 13

45 / 72

Page 53: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

The reparameterization trick

Principle

I Hidden layer: parameters of a distribution N (µ, σ2)

I Distribution used to generate values z = µ+ σ ×N (0, 1)

I Enables backprop; reduces variances of gradient

46 / 72

Page 54: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Examples

47 / 72

Page 55: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Examples

Also: https://www.youtube.com/watch?v=XNZIN7Jh3Sg

48 / 72

Page 56: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Discussion

PROS

I A trainable generative model

CONS

I The generative model has a Gaussian distribution at its core: blurry

49 / 72

Page 57: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Discussion

PROS

I A trainable generative model

CONS

I The generative model has a Gaussian distribution at its core: blurry

49 / 72

Page 58: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Representation is everything

Auto-EncodersDenoising Auto-EncodersExploiting latent representations

Siamese Networks

Variational Auto-Encoders

Generative Adversarial Networks

50 / 72

Page 59: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Generative Adversarial Networks

Goodfellow et al., 14

Goal: Find a generative model

I Classical: learn a distribution hard

I Idea: replace a distribution evaluation by a 2-sample test

Principle

I Find a good generative model, s.t. generated samples cannot bediscriminated from real samples

(not easy)

51 / 72

Page 60: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Principle

Goodfellow, 2017

Elements

I True samples x ( real)

I A generator G (variational auto-encoder):generates from x ( reconstructed) or from scratch (fake)

I A discriminator D: discriminates fake from others ( real andreconstructed)

52 / 72

Page 61: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Principle, 2

Goodfellow, 2017

Mechanism

I Alternate minimization

I Optimize D to tell fake from rest

I Optimize G to deceive D Turing test

MinG MaxDIEx∈data[log(D(x))] + IEz∼px (z)[log(1− D(z))]

Caveat

I The above loss has a vanishing gradient problem because of the terms inlog(1− D(z)).

I We can replace it with − log((1− D(z)/D(z)), which has the same fixedpoint (the true distribution) but doesn’t saturate.

53 / 72

Page 62: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

GAN vs VAE

54 / 72

Page 63: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Generative adversarial networks

Goodfellow, 2017

55 / 72

Page 64: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Generative adversarial networks, 2

Goodfellow, 2017

56 / 72

Page 65: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Limitations

Training instableco-evolution of Generator / Discriminator

Mode collapse

57 / 72

Page 66: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Limitations, 2

Generating monsters

58 / 72

Page 67: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Towards Principled Methods for Training Generative AdversarialNetworks

Arjovsky, Bottou 17

Why minimizing KL fails

Minimizing KL(Preal ||Pgen) =

∫Preal log

Preal

Pgen

I For Preal high and Pgen low (mode dropping), high costI For Preal low and Pgen high (gen. monsters), no cost

The GAN solution: minimizing

IEx∼Pr [logD(x)] + IEx∼Pg [log 1− D(x)]

with

D∗(x) =Pr (x)

Pr (x) + Pg (x)

i.e., up to a constant, GAN minimizes

JS(Preal ,Pgen) =1

2(KL(Preal ||M) + KL(Pgen||M))

with M = 12(Preal + Pgen)

59 / 72

Page 68: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Towards Principled Methods for Training Generative AdversarialNetworks, 2

Arjovsky, Bottou 17

UnfortunatelyIf Pr and Pg lie on non-aligned manifolds, exists a perfect discriminator; this isthe end of optimization !

Proposed alternative: use Wasserstein distance

minG maxD IEx∼Pg [D(x)]− IEx∼Pr [D(x)] = minG W (Pr ,Pg )

60 / 72

Page 69: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Does not solve all issues !

Pb of vanishing/exploding gradients in WGAN, addressed through weightclipping careful tuning needed

New Regularizations

Improved Training of Wasserstein GANsGulrajani, Ahmed, Arjovsky, Dumoulin, Courville 17

Stabilizing Training of Generative Adversarial Networks throughRegularization

Roth, Lucchi, Nowozin, Hofmann, 17

61 / 72

Page 70: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Which Training Methods for GANs do actually Converge?

Mescheder, Geiger and Nowozin, 18

Simple experiments, simple theorems are the building blocks that helpus understand more complicated systems. Ali Rahimi - Test of TimeAward speech, NIPS 2017

62 / 72

Page 71: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Which Training Methods for GANs do actually Converge? 2Mescheder, Geiger and Nowozin, 18

Toy examplePr = δ0 Pg = δθ

63 / 72

Page 72: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Which Training Methods for GANs do actually Converge? 2Mescheder, Geiger and Nowozin, 18

Lesson learned: cyclic behavior for GAN and WGAN

64 / 72

Page 73: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

State-of-the-art Generative Adversarial Networks

Mescheder, Geiger and Nowozin, 2018

65 / 72

Page 74: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Tempered Adversarial Networks

Sajjadi, Parascandolo, Mehrjou, Scholkopf 18

Principle: Life too easy for the discriminator !

66 / 72

Page 75: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Tempered Adversarial Networks

Sajjadi, Parascandolo, Mehrjou, Scholkopf 18

Principle: An adversary to the adversary

I ⇒ Provide L(X ) instead, with L aimed at: i) deceiving the discriminator;ii) staying close from original images

minGmaxDIEx∼Pr [logD(x)] + IEx∼Pg [log(1D(x))]

with D trained from {(L(x), 1)} ∪ {G(z), 0} and Lens L optimized

L∗ = arg min−λL(D) +∑i

‖L(xi )− xi‖2

and λ→ 0.

67 / 72

Page 76: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Tempered Adversarial Networks, 2Sajjadi, Parascandolo, Mehrjou, Scholkopf 18

68 / 72

Page 77: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

Partial conclusions

I Deep revolution: Learning representations

I Adversarial revolution: a Turing test for machines

I Where is the limitation ?VAE: great but blurryGAN: great but mode droppingthe loss function needs more work.

69 / 72

Page 78: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

References (tbc)

see: github.com/artix41/awesome-transfer-learning

I Martin Arjovsky, Soumith Chintala, Lon Bottou, Wasserstein GAN, 2017

I Martin Arjovsky, Lon Bottou, Towards Principled Methods for TrainingGenerative Adversarial Networks, 2017

I Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle: GreedyLayer-Wise Training of Deep Networks. NIPS 2006: 153-160

I Yoshua Bengio: Learning Deep Architectures for AI. Foundations andTrends in Machine Learning 2(1): 1-127 (2009)

I Diederik P Kingma and Max Welling. Auto-encoding variational Bayes.ICLR, 2014.

I Gregory Koch, Richard Zemel, Ruslan Salakhutdinov: Siamese NeuralNetworks for One-shot Image Recognition, ICML 15

I Leon A. Gatys, Alexander S. Ecker, Matthias Bethge: A Neural Algorithmof Artistic Style. NIPS 2015

I Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scalesentiment classification: A deep learning approach. ICML 11.

70 / 72

Page 79: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

References (tbc)

I Ian J. Goodfellow, Yoshua Bengio, Aaron C. Courville: Deep Learning.Adaptive computation and machine learning, MIT Press 2016, ISBN978-0-262-03561-3, pp. 1-775

I Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, GenerativeAdversarial Networks, NIPS 2014

I Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,Aaron Courville, Improved Training of Wasserstein GANs, 2017

I van der Maaten, L.J.P.; Hinton, G.E.: Visualizing Data Using t-SNE.Journal of Machine Learning Research. 9: 25792605. 2008

I Lars Mescheder, Andreas Geiger, Sebastian Nowozin, Which TrainingMethods for GANs do actually Converge?, ICML

I Portilla, J., Simoncelli, E. P., A Parametric Texture Model Based on JointStatistics of Complex Wavelet Coefficients. Int. J. Comput. Vis. 40,49-70 (2000)

I Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, Thomas Hofmann,Stabilizing Training of Generative Adversarial Networks throughRegularization, NIPS 2017

71 / 72

Page 80: Alexandre Allauzen, Mich ele Sebag CNRS & Universit e ...sebag/Slides/Cours_Representation_18.pdf · Auto-encoders for domain adaptation Glorot, Bordes, Bengio, 11 Stacked Denoising

References (tbc)

I Mehdi S. M. Sajjadi, Giambattista Parascandolo, Arash Mehrjou, BernhardSchlkopf: Tempered Adversarial Networks, ICML, 2018

I Tim Salimans, Diederik Kingma, and Max Welling. Markov chainMonte-Carlo and variational inference: Bridging the gap. ICML, 2015

I Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. Extractingand composing robust features with denoising autoencoders. ICML, 2008.

72 / 72