Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Auto-encoding Variational Bayes
Vinit Ravishankar
Language Technology Group, IFI
September 11, 2019
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 1 / 19
Motivation
How do we perform approximate inference/learning when latent variablesand/or parameters have intractable posteriors?
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 2 / 19
Goals
1 Efficient parameter estimation for θ
2 Approximate posterior inference of z , given x and parameters θ (eg.encoding a sentence)
3 Approximate marginal inference over x (eg. denoising an image)
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 3 / 19
On intractability
Cannot differentiate the marginal: pθ(x) =∫pθ(z)pθ(x|z)dz is
intractable (very common if eg. the likelihood function is a neuralnetwork with a non-linearity)
Cannot perform EM: the posterior, pθ(z|x) = pθ(x|z)pθ(z)pθ(x)
is alsointractable
Mean-field – too many assumptions re. factorisability, also intractable
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 4 / 19
On intractability
Cannot differentiate the marginal: pθ(x) =∫pθ(z)pθ(x|z)dz is
intractable (very common if eg. the likelihood function is a neuralnetwork with a non-linearity)
Cannot perform EM: the posterior, pθ(z|x) = pθ(x|z)pθ(z)pθ(x)
is alsointractable
Mean-field – too many assumptions re. factorisability, also intractable
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 4 / 19
On intractability
Cannot differentiate the marginal: pθ(x) =∫pθ(z)pθ(x|z)dz is
intractable (very common if eg. the likelihood function is a neuralnetwork with a non-linearity)
Cannot perform EM: the posterior, pθ(z|x) = pθ(x|z)pθ(z)pθ(x)
is alsointractable
Mean-field – too many assumptions re. factorisability, also intractable
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 4 / 19
Approximation
Introduce an approximation to the true posterior, parameterised by φ –qφ(z|x).
θ and φ are jointly optimised.
Minimise some distance metric between qφ(z|x) and pθ(z|x) - typically theKullback-Leibler divergence DKL.
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 5 / 19
Approximation
Introduce an approximation to the true posterior, parameterised by φ –qφ(z|x).θ and φ are jointly optimised.
Minimise some distance metric between qφ(z|x) and pθ(z|x) - typically theKullback-Leibler divergence DKL.
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 5 / 19
ELBo
For an individual datapoint i :
DKL(qφ(z|x(i))||pθ(z|x(i))) = Eqφ(z|x(i))[log qφ(z|x(i))log pθ(z|x(i))
]
= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(z|x(i))]
= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(x
(i)|z)− log pθ(z)] + log pθ(x(i))
= −L(θ, φ; x(i)) + log pθ(x(i))
Minimising this is annoying, because of the intractable pθ(x(i)).
Remember that KL divergences are always positive:
log pθ(x(i)) = L(θ, φ; x(i)) + DKL(qφ(z|x(i))||pθ(z|x(i))) ≥ L(θ, φ; x(i))
log pθ(x) ≥ Eqφ(z|x)[log pθ(x|z) + log pθ(z)− log qφ(z|x)]
log pθ(x) ≥ Eqφ(z|x)[log pθ(x, z)− log qφ(z|x)]
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 6 / 19
ELBo
For an individual datapoint i :
DKL(qφ(z|x(i))||pθ(z|x(i))) = Eqφ(z|x(i))[log qφ(z|x(i))log pθ(z|x(i))
]
= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(z|x(i))]
= Eqφ(z|x(i))[log qφ(z|x(i))− log pθ(x
(i)|z)− log pθ(z)] + log pθ(x(i))
= −L(θ, φ; x(i)) + log pθ(x(i))
Minimising this is annoying, because of the intractable pθ(x(i)).
Remember that KL divergences are always positive:
log pθ(x(i)) = L(θ, φ; x(i)) + DKL(qφ(z|x(i))||pθ(z|x(i))) ≥ L(θ, φ; x(i))
log pθ(x) ≥ Eqφ(z|x)[log pθ(x|z) + log pθ(z)− log qφ(z|x)]
log pθ(x) ≥ Eqφ(z|x)[log pθ(x, z)− log qφ(z|x)]
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 6 / 19
ELBo
DKL = log pθ(x(i))− L(θ, φ; x(i))
Minimising the KL divergence is equivalent to maximising L(θ, φ; x(i)).
Abetter way to look at it:
log pθ(x(i)) = DKL + L(θ, φ; x(i))
≥ L(θ, φ; x(i))
≥ Eqφ(z|x(i))[log pθ(x(i)|z) + log pθ(z)− log qφ(z|x(i))]
≥ Eqφ(z|x(i))[log pθ(x(i)|z)]− DKL(qφ(z|x(i))||pθ(z))
≥ reconstruction loss− regularisation loss
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 7 / 19
ELBo
DKL = log pθ(x(i))− L(θ, φ; x(i))
Minimising the KL divergence is equivalent to maximising L(θ, φ; x(i)). Abetter way to look at it:
log pθ(x(i)) = DKL + L(θ, φ; x(i))
≥ L(θ, φ; x(i))
≥ Eqφ(z|x(i))[log pθ(x(i)|z) + log pθ(z)− log qφ(z|x(i))]
≥ Eqφ(z|x(i))[log pθ(x(i)|z)]− DKL(qφ(z|x(i))||pθ(z))
≥ reconstruction loss− regularisation loss
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 7 / 19
Reparameterisation
The main technical innovation in this paper is providing a gradientestimator for the lower bound.
MC estimators (eg. score function) have a high variance and don’twork well with too much data
Relevant to deep neural networks – random samples fromdistributions cannot be backpropagated through
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 8 / 19
Reparameterisation
Consider the regularisation loss:
Eqφ(z|x)[log pθ(z)− log qφ(z|x(i))] = Eqφ(z|x)[fθ,φ(z)]
The gradient wrt. φ is given by:
∇φ = Eqφ(z|x(i))[fφ(z)]
= ∇φ[∫zqφ(z|x(i))fφ(z)dz]
=
∫z∇φ[qφ(z|x(i))fφ(z)]dz
=
∫zqφ(z|x(i))∇φfφ(z)dz+
∫zfφ(z)∇φqφ(z|x(i))dz
= Eqφ(z|x(i))[
∫zfφ(z)dz] +
∫zfφ(z)∇φqφ(z|x(i))dz
Solvable when qφ(z|x(i)) is analytical, but not in the general case.
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 9 / 19
Reparameterisation
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 10 / 19
Reparameterisation
Sample noise ε ∼ p(ε), for example from N (0, I )
Instead of sampling z ∼ qφ(z|x), let z̃ = gφ(ε, x)
Gradients can now be Monte Carlo estimated:
∇φEqφ(z|x(i))[f (z)] = ∇φEp(ε)[f (gφ(ε, x(i)))] ' 1
L
L∑l=1
f (∇φgφ(ε(l), x(i)))
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 11 / 19
SGVB - L̃A
SGVB = stochastic gradient variational Bayes = an ELBo estimatorRemember the different ways ELBo can be expressed:
L(θ, φ; x(i)) = Eqφ(z|x)[log pθ(x(i), z)− log qφ(z|x(i))]
gives us an estimator:
L̃A(θ, φ; x(i)) =1
L
L∑l=1
log pθ(x (i), z(i ,l))− log qφ(z(i ,l)|x (i))
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 12 / 19
SGVB - L̃B
L(θ, φ; x(i)) = Eqφ(z|x(i))[log pθ(x(i)|z)]− DKL(qφ(z|x(i))||pθ(z))
Often, the regularisation loss DKL(qφ(z|x(i))||pθ(z)) can be integratedanalytically, giving a better estimator:
L̃B(θ, φ; x(i)) = −DKL(qφ(z|x(i))||pθ(z)) +1
L
L∑l=1
(log pθ(x (i)|z(i ,l)))
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 13 / 19
AEVB
The name of the paper is also the name of the principal algorithm theauthors propose.M ← batch sizeL ← samples
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 14 / 19
VAE
Variational autoencoders are a specific example where the posterior’sapproximation is modelled using a feed-forward neural network.
The prior pθ(z) is a centred multivariate Gaussian, N (z; 0, I )
The likelihood pθ(x|z) is a multivariate Gaussian, with parameterscomputed from MLP(z) (implying an intractable posterior)
Assuming the true posterior is approximately Gaussian andapproximately diagonal, we can modelqφ(z|x(i)) = logN (z;µ(i), σ2(i)), where we obtain µ(i) and σ(i) froman encoder MLP.
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 15 / 19
VAE
Reparam step: do not directly sample z(i ,l) ∼ qφ(z|x(i)). Instead,sample ε(l) ∼ N (0, I ) and let z(i ,l) = gφ(x (i), ε(l)) = µ(i) + σ(i) � ε(l)
pθ(z) and qφ(z|x) both being Gaussian, we can use the ‘easier’estimator L̃B(θ, φ; x(i)) - our KL divergence has a closed-form solution
L̃B(θ, φ; x(i)) ' 1
2
J∑j=1
(1 + log((σ(i)j )2)− (µ
(i)j )2 − (σ
(i)j )2 +
1
L
L∑i=1
log pθ(x(i)|z(i,l))
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 16 / 19
On neural networks
A brief glossary:
Feed-forward NNs: function approximators – stackedtransformations of the form f (Wx + b), where f is a non-linearity
Encoder: some system that produces representations for some input
Backpropagation: computes gradients for gradient-descent basedoptimisation
Batch: optimise gradient descent in batches
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 17 / 19
Experiments
Trained generation on two datasets – MNIST (digits) and Frey Faces(faces). Parameters sampled from N (0, 0.01), gradient descent optimisedwith Adagrad, minibatches sized 100, 1 latent sample per datapoint.
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 18 / 19
Generation
Vinit Ravishankar (LTG, IFI) AEVB September 11, 2019 19 / 19