Auto-encoding Variational Bayes - Universitetet i oslofolk.uio.no/geirs/STK9200/Vinit_Autoencoding.pdf · 2019-10-28 · Auto-encoding Variational Bayes Vinit Ravishankar Language

Auto-encoding Variational Bayes

Vinit Ravishankar

Language Technology Group, IFI

September 11, 2019

Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 1 / 19

Motivation

How do we perform approximate inference/learning when latent variablesand/or parameters have intractable posteriors?


Goals

1 Efficient parameter estimation for θ

2 Approximate posterior inference of z , given x and parameters θ (eg.encoding a sentence)

3 Approximate marginal inference over x (eg. denoising an image)


On intractability

Cannot differentiate the marginal: pθ(x) =∫pθ(z)pθ(x|z)dz is

intractable (very common if eg. the likelihood function is a neuralnetwork with a non-linearity)

Cannot perform EM: the posterior, pθ(z|x) = pθ(x|z)pθ(z)pθ(x)

is alsointractable

Mean-field – too many assumptions re. factorisability, also intractable


On intractability




is alsointractable



On intractability




is alsointractable



Approximation

Introduce an approximation to the true posterior, parameterised by φ –qφ(z|x).

θ and φ are jointly optimised.

Minimise some distance metric between qφ(z|x) and pθ(z|x) - typically theKullback-Leibler divergence DKL.


Approximation

Introduce an approximation to the true posterior, parameterised by φ –qφ(z|x).θ and φ are jointly optimised.

Minimise some distance metric between qφ(z|x) and pθ(z|x) - typically theKullback-Leibler divergence DKL.


ELBo

For an individual datapoint i :

DKL(qφ(z|x(i))||pθ(z|x(i))) = Eqφ(z|x(i))[log qφ(z|x(i))log pθ(z|x(i))

]

= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(z|x(i))]

= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(x

(i)|z)− log pθ(z)] + log pθ(x(i))

= −L(θ, φ; x(i)) + log pθ(x(i))

Minimising this is annoying, because of the intractable pθ(x(i)).

Remember that KL divergences are always positive:

log pθ(x(i)) = L(θ, φ; x(i)) + DKL(qφ(z|x(i))||pθ(z|x(i))) ≥ L(θ, φ; x(i))

log pθ(x) ≥ Eqφ(z|x)[log pθ(x|z) + log pθ(z)− log qφ(z|x)]

log pθ(x) ≥ Eqφ(z|x)[log pθ(x, z)− log qφ(z|x)]


ELBo

For an individual datapoint i :

DKL(qφ(z|x(i))||pθ(z|x(i))) = Eqφ(z|x(i))[log qφ(z|x(i))log pθ(z|x(i))

]

= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(z|x(i))]

= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(x

(i)|z)− log pθ(z)] + log pθ(x(i))

= −L(θ, φ; x(i)) + log pθ(x(i))

Minimising this is annoying, because of the intractable pθ(x(i)).

Remember that KL divergences are always positive:

log pθ(x(i)) = L(θ, φ; x(i)) + DKL(qφ(z|x(i))||pθ(z|x(i))) ≥ L(θ, φ; x(i))

log pθ(x) ≥ Eqφ(z|x)[log pθ(x|z) + log pθ(z)− log qφ(z|x)]

log pθ(x) ≥ Eqφ(z|x)[log pθ(x, z)− log qφ(z|x)]


ELBo

DKL = log pθ(x(i))− L(θ, φ; x(i))

Minimising the KL divergence is equivalent to maximising L(θ, φ; x(i)).

Abetter way to look at it:

log pθ(x(i)) = DKL + L(θ, φ; x(i))

≥ L(θ, φ; x(i))

≥ Eqφ(z|x(i))[log pθ(x(i)|z) + log pθ(z)− log qφ(z|x(i))]

≥ Eqφ(z|x(i))[log pθ(x(i)|z)]− DKL(qφ(z|x(i))||pθ(z))

≥ reconstruction loss− regularisation loss


ELBo

DKL = log pθ(x(i))− L(θ, φ; x(i))

Minimising the KL divergence is equivalent to maximising L(θ, φ; x(i)). Abetter way to look at it:

log pθ(x(i)) = DKL + L(θ, φ; x(i))

≥ L(θ, φ; x(i))

≥ Eqφ(z|x(i))[log pθ(x(i)|z) + log pθ(z)− log qφ(z|x(i))]

≥ Eqφ(z|x(i))[log pθ(x(i)|z)]− DKL(qφ(z|x(i))||pθ(z))

≥ reconstruction loss− regularisation loss


Reparameterisation

The main technical innovation in this paper is providing a gradientestimator for the lower bound.

MC estimators (eg. score function) have a high variance and don’twork well with too much data

Relevant to deep neural networks – random samples fromdistributions cannot be backpropagated through


Reparameterisation

Consider the regularisation loss:

Eqφ(z|x)[log pθ(z)− log qφ(z|x(i))] = Eqφ(z|x)[fθ,φ(z)]

The gradient wrt. φ is given by:

∇φ = Eqφ(z|x(i))[fφ(z)]

= ∇φ[∫zqφ(z|x(i))fφ(z)dz]

=

∫z∇φ[qφ(z|x(i))fφ(z)]dz

=

∫zqφ(z|x(i))∇φfφ(z)dz+

∫zfφ(z)∇φqφ(z|x(i))dz

= Eqφ(z|x(i))[

∫zfφ(z)dz] +

∫zfφ(z)∇φqφ(z|x(i))dz

Solvable when qφ(z|x(i)) is analytical, but not in the general case.


Reparameterisation


Reparameterisation

Sample noise ε ∼ p(ε), for example from N (0, I )

Instead of sampling z ∼ qφ(z|x), let z̃ = gφ(ε, x)

Gradients can now be Monte Carlo estimated:

∇φEqφ(z|x(i))[f (z)] = ∇φEp(ε)[f (gφ(ε, x(i)))] ' 1

L

L∑l=1

f (∇φgφ(ε(l), x(i)))


SGVB - L̃A

SGVB = stochastic gradient variational Bayes = an ELBo estimatorRemember the different ways ELBo can be expressed:

L(θ, φ; x(i)) = Eqφ(z|x)[log pθ(x(i), z)− log qφ(z|x(i))]

gives us an estimator:

L̃A(θ, φ; x(i)) =1

L

L∑l=1

log pθ(x (i), z(i ,l))− log qφ(z(i ,l)|x (i))


SGVB - L̃B

L(θ, φ; x(i)) = Eqφ(z|x(i))[log pθ(x(i)|z)]− DKL(qφ(z|x(i))||pθ(z))

Often, the regularisation loss DKL(qφ(z|x(i))||pθ(z)) can be integratedanalytically, giving a better estimator:

L̃B(θ, φ; x(i)) = −DKL(qφ(z|x(i))||pθ(z)) +1

L

L∑l=1

(log pθ(x (i)|z(i ,l)))


AEVB

The name of the paper is also the name of the principal algorithm theauthors propose.M ← batch sizeL ← samples


VAE

Variational autoencoders are a specific example where the posterior’sapproximation is modelled using a feed-forward neural network.

The prior pθ(z) is a centred multivariate Gaussian, N (z; 0, I )

The likelihood pθ(x|z) is a multivariate Gaussian, with parameterscomputed from MLP(z) (implying an intractable posterior)

Assuming the true posterior is approximately Gaussian andapproximately diagonal, we can modelqφ(z|x(i)) = logN (z;µ(i), σ2(i)), where we obtain µ(i) and σ(i) froman encoder MLP.


VAE

Reparam step: do not directly sample z(i ,l) ∼ qφ(z|x(i)). Instead,sample ε(l) ∼ N (0, I ) and let z(i ,l) = gφ(x (i), ε(l)) = µ(i) + σ(i) � ε(l)

pθ(z) and qφ(z|x) both being Gaussian, we can use the ‘easier’estimator L̃B(θ, φ; x(i)) - our KL divergence has a closed-form solution

L̃B(θ, φ; x(i)) ' 1

2

J∑j=1

(1 + log((σ(i)j )2)− (µ

(i)j )2 − (σ

(i)j )2 +

1

L

L∑i=1

log pθ(x(i)|z(i,l))


On neural networks

A brief glossary:

Feed-forward NNs: function approximators – stackedtransformations of the form f (Wx + b), where f is a non-linearity

Encoder: some system that produces representations for some input

Backpropagation: computes gradients for gradient-descent basedoptimisation

Batch: optimise gradient descent in batches


Experiments

Trained generation on two datasets – MNIST (digits) and Frey Faces(faces). Parameters sampled from N (0, 0.01), gradient descent optimisedwith Adagrad, minibatches sized 100, 1 latent sample per datapoint.


Generation


Documents

Auto-encoding Variational Bayes - Universitetet i oslofolk.uio.no/geirs/STK9200/Vinit_Autoencoding.pdf · 2019-10-28 · Auto-encoding Variational Bayes Vinit Ravishankar Language