Variational Autoencoders - An IntroductionIntroduction I Auto-Encoding Variational Bayes, Diederik...

Variational Autoencoders - An Introduction

Devon Graham

University of British Columbiadrgraham@cs.ubc.ca

Oct 31st, 2017

Table of contents

Introduction

Deep Learning Perspective

Probabilistic Model Perspective

Applications

Conclusion

Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014

I Generative model

I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)

I https://jaan.io/

what-is-variational-autoencoder-vae-tutorial/

I Deep Learning perspective and Probabilistic Model perspective

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction - Autoencoders

I Attempt to learn identity function

I Constrained in some way (e.g., small latent vectorrepresentation)

I Can generate new images by giving different latent vectors totrained network

I Variational: use probabilistic latent encoding

4 / 30

5 / 30

I Goal: Build a neural network that generates MNIST digitsfrom random (Gaussian) noise

I Define two sub-networks: Encoder and Decoder

I Define a Loss Function

6 / 30

Encoder

I A neural network qθ(z |x)

I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)

I Output: encoding z , drawn from Gaussian density withparameters θ

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Decoder

I A neural network pφ(x |z), parameterized by φ

I Input: encoding z , output from encoder

I Output: reconstruction x̃ , drawn from distribution of the data

I E.g., output parameters for 28× 28 Bernoulli variables

8 / 30

Decoder

8 / 30

Decoder

8 / 30

Decoder

8 / 30

Decoder

8 / 30

Loss Function

I x̃ is reconstructed from z where |z | � |x̃ |

I How much information is lost when we go from x to z to x̃?

I Measure this with reconstruction log-likelihood: log pφ(x |z)

I Measures how effectively the decoder has learned toreconstruct x given the latent representation z

9 / 30

Loss Function

I x̃ is reconstructed from z where |z | � |x̃ |I How much information is lost when we go from x to z to x̃?

9 / 30

Loss Function

9 / 30

Loss Function

9 / 30

Loss Function

I Loss function is negative reconstruction log-likelihood +regularizer

I Loss decomposes into term for each datapoint:

L(θ, φ) =N∑i=1

li (θ, φ)

I Loss for datapoint xi :

li (θ, φ) = −Ez∼qθ(z|xi )[

log pφ(xi |z)]

+ KL(qθ(z |xi )||p(z)

10 / 30

Loss Function

L(θ, φ) =N∑i=1

li (θ, φ)

log pφ(xi |z)]

10 / 30

Loss Function

L(θ, φ) =N∑i=1

li (θ, φ)

log pφ(xi |z)]

10 / 30

Loss Function

I Negative reconstruction log-likelihood:

−Ez∼qθ(z|xi )[

log pφ(xi |z)]

I Encourages decoder to learn to reconstruct the data

I Expectation taken over distribution of latent representations

11 / 30

Loss Function

−Ez∼qθ(z|xi )[

log pφ(xi |z)]

11 / 30

Loss Function

−Ez∼qθ(z|xi )[

log pφ(xi |z)]

11 / 30

Loss Function

I KL Divergence as regularizer:

KL(qθ(z |xi )||p(z)

)= Ez∼qθ(z|xi )

[log qθ(z |xi )− log p(z)

I Measures information lost when using qθ to represent p

I We will use p(z) = N (0, I)

I Encourages encoder to produce z ’s that are close to standardnormal distribution

I Encoder learns a meaningful representation of MNIST digits

I Representation for images of the same digit are close togetherin latent space

I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

]I Measures information lost when using qθ to represent p

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

MNIST latent variable space

13 / 30

Reparameterization trick

I We want to use gradient descent to learn the model’sparameters

I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?

I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product

I Can take derivatives of (functions of) z w.r.t. µ and σ

I Output of qθ(z |x) is vector of µ’s and vector of σ’s

14 / 30

I We can reparameterize: z = µ+ σ � ε

I ε ∼ N (0, I), and � is element-wise product

14 / 30

Summary

I Deep Learning objective is to minimize the loss function:

L(θ, φ) =N∑i=1

(− Ez∼qθ(z|xi )

[log pφ(xi |z)

(qθ(z |xi )||p(z)

15 / 30

16 / 30

I Data x and latent variables z

I Joint pdf of the model: p(x , z) = p(x |z)p(z)

I Decomposes into likelihood: p(x |z), and prior: p(z)

I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Suppose we want to do inference in this model

I We would like to infer good values of z , given observed data

I Then we could use them to generate real-looking MNISTdigits

I We want to calculate the posterior:

p(z |x) =p(x |z)p(z)

I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz

I Integral over all configurations of latent variables /I Intractable

18 / 30

I Integral over all configurations of latent variables /

I Intractable

18 / 30

I Variational inference to the rescue!

I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)

I Which choice of λ gives the ‘best’ qλ(z |x)?

I KL divergence measures information lost when using qλ toapproximate p

I Choose λ to minimize KL(qλ(z |x)||p(z |x)

(qλ||p

19 / 30

(qλ||p

19 / 30

(qλ||p

19 / 30

(qλ||p

19 / 30

(qλ||p

19 / 30

KL(qλ||p

):= Ez∼qλ

[log qλ(z |x)− log p(z |x)

]= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

I Still contains p(x) term! So cannot compute directly

I But p(x) does not depend on λ, so still hope

20 / 30

KL(qλ||p

):= Ez∼qλ

]= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

20 / 30

KL(qλ||p

):= Ez∼qλ

]= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

20 / 30

I Define Evidence Lower BOund:

ELBO(λ) := Ez∼qλ[

log p(x , z)]− Ez∼qλ

[log qλ(z |x)

I Then

KL(qλ||p

)= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

= −ELBO(λ) + log p(x)

I So minimizing KL(qλ||p

)w.r.t. λ is equivalent to maximizing

ELBO(λ)

21 / 30

[log qλ(z |x)

]I Then

KL(qλ||p

)= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

ELBO(λ)

21 / 30

[log qλ(z |x)

]I Then

KL(qλ||p

)= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

ELBO(λ)

21 / 30

I Since no two datapoints share latent variables, we can write:

ELBO(λ) =N∑i=1

ELBOi (λ)

I Where

ELBOi (λ) = Ez∼qλ(z|xi )[

log p(xi , z)]− Ez∼qλ(z|xi )

[log qλ(z |xi )

22 / 30

I Since no two datapoints share latent variables, we can write:

ELBO(λ) =N∑i=1

ELBOi (λ)

I Where

[log qλ(z |xi )

22 / 30

I We can rewrite the term ELBOi (λ):

[log qλ(z |xi )

]= Ez∼qλ(z|xi )

[log p(xi |z) + log p(z)

]− Ez∼qλ(z|xi )

[log qλ(z |xi )

]= Ez∼qλ(z|xi )

[log p(xi |z)

]− Ez∼qλ(z|xi )

[log qλ(z |xi )− log p(z)

]= Ez∼qλ(z|xi )

[log p(xi |z)

]− KL

(qλ(z |xi )||p(z)

23 / 30

I How do we relate λ to φ and θ seen earlier?

I We can parameterize approximate posterior qθ(z |x , λ) by anetwork that takes data x and outputs parameters λ

I Parameterize the likelihood p(x |z) with a network that takeslatent variables and outputs parameters to the datadistribution pφ(x |z)

I So we can re-write

ELBOi (θ, φ) = Ez∼qθ(z|xi )[

log pφ(xi |z)]− KL

(qθ(z |xi )||p(z)

24 / 30

(qθ(z |xi )||p(z)

24 / 30

(qθ(z |xi )||p(z)

24 / 30

(qθ(z |xi )||p(z)

24 / 30

Probabilistic Model Objective

I Recall the Deep Learning objective derived earlier. We wantto minimize:

L(θ, φ) =N∑i=1

[log pφ(xi |z)

(qθ(z |xi )||p(z)

I The objective just derived for the Probabilistic Model was tomaximize:

ELBO(θ, φ) =N∑i=1

(Ez∼qθ(z|xi )

[log pφ(xi |z)

]− KL

(qθ(z |xi )||p(z)

))I They are equivalent!

25 / 30

Probabilistic Model Objective

I Recall the Deep Learning objective derived earlier. We wantto minimize:

L(θ, φ) =N∑i=1

[log pφ(xi |z)

(qθ(z |xi )||p(z)

I The objective just derived for the Probabilistic Model was tomaximize:

ELBO(θ, φ) =N∑i=1

(Ez∼qθ(z|xi )

[log pφ(xi |z)

]− KL

(qθ(z |xi )||p(z)

))I They are equivalent!

25 / 30

Applications - Image generation

A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. arXivpreprint arXiv :1602.02644, 2016.

26 / 30

Applications - Caption generation

Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning ofimages, labels and captions. In NIPS, 2016.

27 / 30

Applications - Semi-/Un-supervised document classification

Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved variational autoencoders for text modelingusing dilated convolutions. In Proceedings of The 34rd International Conference on Machine Learning, 2017.

28 / 30

Applications - Pixel art videogame characters

https://mlexplained.wordpress.com/category/generative-models/vae/.

29 / 30

Conclusion

I We derived the same objective from

I 1) A deep learning point of view, and

I 2) A probabilistic models point of view

I Showed they are equivalent

I Saw some applications

I Thank you. Questions?

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Variational Autoencoders - An IntroductionIntroduction I Auto-Encoding Variational Bayes, Diederik...

Documents

Neonatal Ventilation: “The Bivent” Paul Kingma, MD PhD

Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Improved Variational Inference with Inverse Autoregressive ... · Improved Variational Inference with Inverse Autoregressive Flow Diederik P. Kingma dpkingma@openai.com Tim Salimans

Generating Classical Chinese Poems via Conditional Variational Autoencoder … · 2018-10-28 · over, boosting autoencoder with variational infer-ence (Kingma and Welling,2014),

Variational Bayes and Variational Message Passing · Variational Bayes and Variational Message Passing Mohammad Emtiyaz Khan CS,UBC Variational Bayes and Variational Message Passing

How to Train Deep Variational Autoencoders and ... · The recently introduced variational autoencoder (VAE) (Kingma & Welling,2013;Rezende et al.,2014) provides a framework for deep

Improving Variational Autoencoders with Inverse Autoregressive … · with Inverse Autoregressive Flow Diederik P. Kingma dpkingma@openai.com Tim Salimans tim@openai.com Rafal Jozefowicz

Jaar 1130 - Diederik van den Elzas en Poperinge · 2018. 12. 16. · Jaar 1130 - Diederik van den Elzas en Poperinge In 1128 werd Diederik van de Elzas graaf van Vlaanderen en deze

Digital Marketing Live! 2014 - Nugg.ad - Diederik Stijnen

GPU Kernels for Block-Sparse Weights...GPU Kernels for Block-Sparse Weights Scott Gray, Alec Radford and Diederik P. Kingma OpenAI [scott,alec,dpkingma]@openai.com Abstract We’re

Kingma & Welling's variational auto-encoder (VAE) - arXiv · PDF fileAuto-Encoding Variational Bayes Diederik P. Kingma Machine Learning Group Universiteit van Amsterdam dpkingma@gmail.com

Variational Dropout and the Local Reparameterization Trick · 2015-12-22 · Variational Dropout and the Local Reparameterization Trick Diederik P. Kingma , Tim Salimans and Max Wellingy

Diederik Fokkema - Thesis

Deep Learning for Biomedicine Genomics and Drug Design · 2020. 8. 9. · Variational Autoencoder (Kingma & Welling, 2014) Two separate processes: generative (hidden visible)

Adam: AMethodforStochasticOptimization - 2017/18 moodle2 HUJI · Adam: AMethodforStochasticOptimization Diederik P. Kingma and Jimmy Lei Ba NadavCohen The Hebrew University of Jerusalem

Glow: Generative Flow with Invertible 1x1 Convolutionspapers.nips.cc/paper/8224-glow-generative-flow... · Glow: Generative Flow with Invertible 1 1 Convolutions Diederik P. Kingma*y,

APo-VAE: Text Generation in Hyperbolic Space · tent language hierarchies in hyperbolic space. 1 Introduction Variational Autoencoder (VAE) (Kingma and Welling,2013;Rezende et al.,2014)

cv Lidewij Diederik

MWG Congres 2014 - Diederik Reitsma - BMW

Dag diederik