The Information Autoencoding Family: A Lagrangian Perspective on Latent Variable …auai.org/uai2018/proceedings/papers/361.pdf · 2018. 7. 7. · example, in latent variable modeling

The Information Autoencoding Family:A Lagrangian Perspective on Latent Variable Generative Models

Shengjia ZhaoComputer Science Department

Stanford [email protected]

Jiaming SongComputer Science Department


Stefano ErmonComputer Science Department


Abstract

A large number of objectives have been proposedto train latent variable generative models. Weshow that many of them are Lagrangian dualfunctions of the same primal optimization prob-lem. The primal problem optimizes the mutualinformation between latent and visible variables,subject to the constraints of accurately model-ing the data distribution and performing correctamortized inference. By choosing to maximizeor minimize mutual information, and choosingdifferent Lagrange multipliers, we obtain differ-ent objectives including InfoGAN, ALI/BiGAN,ALICE, CycleGAN, beta-VAE, adversarial au-toencoders, AVB, AS-VAE and InfoVAE. Basedon this observation, we provide an exhaustivecharacterization of the statistical and computa-tional trade-offs made by all the training objec-tives in this class of Lagrangian duals. Next,we propose a dual optimization method wherewe optimize model parameters as well as the La-grange multipliers. This method achieves Paretooptimal solutions in terms of optimizing informa-tion and satisfying the constraints.

1 INTRODUCTION

Latent variable generative models are designed to accom-plish a wide variety of tasks in computer vision (Rad-ford et al., 2015; Kuleshov & Ermon, 2017), natural lan-guage processing (Yang et al., 2017), reinforcement learn-ing (Li et al., 2017b), compressed sensing Dhar et al.(2018),etc. Prominent examples include Variational Au-toencoders (VAE, Kingma & Welling (2013); Rezendeet al. (2014)), with extensions such as β-VAE (Higginset al., 2016), Adversarial Autoencoders (Makhzani et al.,2015), and InfoVAE (Zhao et al., 2017); Generative Ad-versarial Networks (Goodfellow et al., 2014), with exten-sions such as ALI/BiGAN (Dumoulin et al., 2016a; Don-

ahue et al., 2016), InfoGAN (Chen et al., 2016a) and AL-ICE (Li et al., 2017a); hybrid objectives such as CycleGAN(Zhu et al., 2017), DiscoGAN (Kim et al., 2017), AVB(Mescheder et al., 2017) and AS-VAE (Pu et al., 2017).All these models attempt to fit an empirical data distribu-tion, but differ in multiple ways: how they measure thesimilarity between distributions; whether or not they allowfor efficient (amortized) inference; whether the latent vari-ables should retain or discard information about the data;and how the model is optimized, which can be likelihood-based or likelihood-free (Mohamed & Lakshminarayanan,2016; Grover et al., 2018).

In this paper, we generalize existing training objectivesfor latent variable generative models. We show that allthe above training objectives can be viewed as Lagrangiandual functions of a constrained optimization problem (pri-mal problem). The primal problem optimizes over the pa-rameters of a generative model and an (amortized) infer-ence distribution. The optimization objective is to max-imize or minimize mutual information between latent andobserved variables; the constraints (which we term “consis-tency constraints”) are to accurately model the data distri-bution and to perform correct amortized inference. By con-sidering the Lagrangian dual function and different settingsof the Lagrange multipliers, we can obtain all the afore-mentioned generative modeling training objectives. Sur-prisingly, under mild assumptions, the aforementioned ob-jectives can be linearly combined to produce every possibleprimal objective/multipliers in this model family.

In Lagrangian dual optimization, the dual function is max-imized with respect to the Lagrange multipliers, and mini-mized with respect to the primal parameters. Under strongduality, the optimal parameters found by this procedurealso solve the original primal problem. However, the afore-mentioned objectives use fixed (rather than maximized)multipliers. As a consequence, strong duality does not gen-erally hold.

To overcome this problem, we propose a new learning ap-proach where the Lagrange multipliers are also optimized.We show that strong duality holds in distribution space,

so this optimization procedure is guaranteed to optimizethe primal objective while satisfying the consistency con-straints. As an application of this approach, we proposeLagrangian VAE, a Lagrangian optimization algorithm forthe InfoVAE (Zhao et al., 2017) objective. Lagrangian VAEcan explicitly trade-off optimization of the primal objectiveand consistency constraint satisfaction. In addition, boththeoretical properties (of Lagrangian optimization) and em-pirical experiments show that solutions obtained by La-grangian VAE Pareto dominate solutions obtained with In-foVAE: Lagrangian VAE either obtains better mutual in-formation or better constraint satisfaction, regardless of thehyper-parameters used by either method.

2 BACKGROUND

We consider two groups of variables: observed variablesx ∈ X and latent variables z ∈ Z . Our algorithm receivesinput distributions q(x), p(z) over x and z respectively.Each distribution is either specified explicitly through atractable analytical expression such as N (0, I), or implic-itly through a set of samples. For example, in latent variablegenerative modeling of images (Kingma & Welling, 2013;Goodfellow et al., 2014), X is the space of images, and Zis the space of latent features. q(x) is a dataset of sam-ple images, and p(z) is a simple “prior” distribution, e.g.,a Gaussian; in unsupervised image translation (Zhu et al.,2017), X and Z are both image spaces and q(x), p(z) aresample images from two different domains (e.g., picturesof horses and zebras).

The underlying joint distribution on (x, z) is not known,and we are not given any sample from it. Our goal isto nonetheless learn some model of the joint distributionrmodel(x, z) with the following desiderata:

Desideratum 1. Matching Marginal The marginals ofrmodel(x, z) over x, z respectively match the provided dis-tributions q(x), p(z).

Desideratum 2. Meaningful Relationship rmodel(x, z)captures a meaningful relationship between x and z. Forexample, in latent variable modeling of images, the latentvariables z should correspond to semantically meaningfulfeatures describing the image x. In unsupervised imagetranslation, rmodel(x, z) should capture the “correct” pair-ing between x and z.

We address desideratum 1 in this section, and desideratum2 in section 3. The joint distribution rmodel(x, z) can berepresented in factorized form by chain rule. To do so,we define conditional distribution families {pθp(x|z), θp ∈Θp} and {qθq (z|x), θq ∈ Θq}. We require that for any zwe can both efficiently sample from pθp(x|z) and com-pute log pθp(x|z), and similarly for qθq (z|x). For com-pactness we use θ = (θp, θq) to denote the parameters ofboth distributions pθ and qθ. We define the joint distribu-

tion rmodel(x, z) in two ways:

rmodel(x, z)def= pθ(x, z)

def= p(z)pθ(x|z) (1)

and symmetrically

rmodel(x, z)def= qθ(x, z)

def= q(x)qθ(z|x) (2)

Defining the model in two (redundant) ways seem unusualbut has significant computational advantages: given x wecan tractably sample z, and vice versa. For example, in la-tent variable models, given observed data x we can samplelatent features from z ∼ qθ(z|x) (amortized inference),and given latent feature z we can generate novel samplesfrom x ∼ pθ(x|z) (ancestral sampling).

If the two definitions (1), (2) are consistent, which wedefine as pθ(x, z) = qθ(x, z), we automatically satisfydesideratum 1:

rmodel(x) =

∫z

rmodel(x, z)dz =

∫z

qθ(x, z)dz = q(x)

rmodel(z) =

∫x

rmodel(x, z)dx =

∫x

pθ(x, z)dx = p(z)

Based on this observation, we can design objectives that en-courage consistency. Many latent variable generative mod-els fit into this framework. For example, variational au-toencoders (VAE, Kingma & Welling (2013)) enforce con-sistency by minimizing the KL divergence:

minθDKL(qθ(x, z)‖pθ(x, z))

This minimization is equivalent to maximizing the evi-dence lower bound (LELBO) (Kingma & Welling, 2013):

maxθ−DKL(qθ(x, z)‖pθ(x, z)) (3)

= −Eqθ(x,z) [log(qθ(z|x)q(x))− log(pθ(x|z)p(z))]

= Eqθ(x,z)[log pθ(x|z)] +Hq(x)−Eq(x) [DKL(qθ(z|x)‖p(z))]

≡ Eqθ(x,z)[log pθ(x|z)]−Eq(x) [DKL(qθ(z|x)‖p(z))]

}LELBO (4)

where Hq(x) is the entropy of q(x) and is a constant thatcan be ignored for the purposes of optimization over modelparameters θ (denoted ≡).

As another example, BiGAN/ALI (Donahue et al., 2016;Dumoulin et al., 2016b) use an adversarial discriminator toapproximately minimize the Jensen-Shannon divergence

minθDJS(qθ(x, z)‖pθ(x, z))

Many other ways of enforcing consistency are possible.Most generally, we can enforce consistency with a vector ofdivergences D = [D1, . . . , Dm], where each Di takes two

probability measures as input, and outputs a non-negativevalue which is zero if and only if the two input measures arethe same. Examples of possible divergences include Maxi-mum Mean Discrepancy (MMD, Gretton et al. (2007)), de-notedDMMD; Wasserstein distance (Arjovsky et al., 2017),denoted DW; f -divergences (Nowozin et al., 2016), de-noted Df ; and Jensen-Shannon divergence (Goodfellowet al., 2014), denoted DJS.

Each Di can be any divergence applied to a pair of prob-ability measures. The pair of probability measures can bedefined over either both variables (x, z), a single variablex, z, or conditional x|z, z|x. If the probability measure isdefined over a conditional x|z, z|x, we also take expecta-tion over the conditioning variable with respect to pθ or qθ.Some examples of Di are:

Eqθ(z)[DKL(qθ(x|z)‖pθ(x|z))]DMMD(qθ(z)‖p(z))

DW(pθ(x, z)‖qθ(x, z))

Eq(x)[Df (qθ(z|x)‖pθ(z|x))]

DJS(q(x)‖pθ(x))

We only require that

Di = 0,∀i ∈ {1, . . . ,m} ⇐⇒ pθ(x, z) = qθ(x, z)

so D = 0 implies consistency. Note that each Di implic-itly depends on the parameters θ through pθ and qθ, butnotationally we neglect this for simplicity.

Enforcing consistency pθ(x, z) = qθ(x, z) by D = 0satisfies desideratum 1 (matching marginal), but does notdirectly address desideratum 2 (meaningful relationship).A large number of joint distributions can have the samemarginal distributions p(z) and q(x) (including ones wherez and x are independent), and only a small fraction of themencode meaningful models.

3 GENERATIVE MODELING ASCONSTRAINT OPTIMIZATION

To address desideratum 2, we modify the training objec-tive and specify additional preferences among consistentpθ(x, z) and qθ(x, z). Formally we solve the followingprimal optimization problem

minθf(θ) subject to D = 0 (5)

where f(θ) encodes our preferences over consistent distri-butions, and depends on θ through pθ(x|z) and qθ(x|z).

An important preference is the mutual information betweenx and z. Depending on the downstream application, wemay maximize mutual information (Chen et al., 2016b;Zhao et al., 2017; Li et al., 2017a; Chen et al., 2016a) sothat the features (latent variables) z can capture as much

information as possible about x, or minimize mutual in-formation (Zhao et al., 2017; Higgins et al., 2016; Tishby& Zaslavsky, 2015; Shamir et al., 2010) to achieve com-pression. To implement mutual information preference weconsider the following objective

fI(θ;α1, α2) = α1Iqθ (x; z) + α2Ipθ (x; z) (6)

where Ipθ (x; z) = Epθ(x,z)[log pθ(x, z)− log pθ(x)p(z)]is the mutual information under pθ(x, z), and Iqθ (x; z) istheir mutual information under qθ(x, z).

The optimization problem in Eq.(5) with mutual informa-tion f(θ) in Eq.(6) has the following Lagrangian dual func-tion:

α1Iqθ (x; z) + α2Ipθ (x; z) + λ>D (7)

where λ = [λ1, . . . , λm] is a vector of Lagrange multi-pliers, one for each of the m consistency constraints inD = [D1, . . . , Dm].

In the next section, we will show that many existing train-ing objectives for generative models minimize the La-grangian dual in Equation 7 for some fixed α1, α2, D andλ. However, dual optimization requires maximization overthe dual parameters λ, which should not be kept fixed. Wediscuss dual optimization in Section 5.

4 GENERALIZING OBJECTIVES WITHFIXED MULTIPLIERS

Several existing objectives for latent variable generativemodels can be rewritten in the dual form of Equation 7 withfixed Lagrange multipliers. We provide several exampleshere and provide more in Appendix A.

VAE (Kingma & Welling, 2013) Per our discussion inSection 2, the VAE training objective commonly writtenas ELBO maximization in Eq.(4) is actually equivalentto Equation 3. This is a dual form where we set D =[DKL(qθ(x, z)‖pθ(x, z)], α1 = α2 = 0 and λ = 1. Be-cause α1 = α2 = 0, this objective has no information pref-erence, confirming previous observations that the learneddistribution can have high, low or zero mutual informationbetween x and z. Chen et al. (2016b); Zhao et al. (2017).

β-VAE (Higgins et al., 2016) The following objectiveLβ−VAE is proposed to learn disentangled features z:

−Eqθ(x,z)[log pθ(x|z)] + βEq(x) [DKL(qθ(z|x)‖p(z))]

This is equivalent to the following dual form:

Lβ−VAE

≡ Eqθ(x,z)

[log

qθ(x|z)q(x)

pθ(x|z)qθ(x|z)+ β log

qθ(z|x)qθ(z)

qθ(z)p(z)

]≡ (β − 1)Iqθ (x; z) (primal)

+βDKL(qθ(z)‖p(z))) (consistency)+Eqθ(z)[DKL(qθ(x|z)‖pθ(x|z))]

f(p, q) Likelihood Based Unary Likelihood Free Binary Likelihood Free0 VAE (Kingma & Welling, 2013) VAE-GAN (Makhzani et al., 2015) ALI (Dumoulin et al., 2016b)

α1Iq β-VAE (Higgins et al., 2016) InfoVAE (Zhao et al., 2017) ALICE (Li et al., 2017a)α2Ip VMI (Barber & Agakov, 2003) InfoGAN (Chen et al., 2016a) -

α1Iq + α2Ip - CycleGAN (Zhu et al., 2017) AS-VAE (Pu et al., 2017)

Table 1: For each choice of α and computability class (Definition 2) we list the corresponding existing model. Severalother objectives are also Lagrangian duals, but they are not listed because they are similar to models in the table. Theseobjectives include DiscoGAN (Kim et al., 2017), BiGAN (Donahue et al., 2016), AAE (Makhzani et al., 2015), WAE(Tolstikhin et al., 2017).

where we use ≡ to denote “equal up to a value that doesnot depend on θ”. In this case,

α1 = β − 1, α2 = 0

λ = [β, 1]

D = [KL(qθ(z)‖p(z))),Eqθ(z)[DKL(qθ(x|z)‖pθ(x|z))]

When α1 > 0 or equivalently β > 1, there is an incentiveto minimize mutual information between x and z.

InfoGAN (Chen et al., 2016a) As another example, theInfoGAN objective 1 :

LInfoGAN = DJS(q(x)‖pθ(x))− Epθ(x,z)[log qθ(z|x)]

is equivalent to the following dual form:

LInfoGAN ≡ Epθ(x,z)[− log pθ(z|x) + log p(z)

+ log pθ(z|x)− log qθ(z|x)] +DJS(q(x)‖pθ(x))

≡ −Ipθ (x; z) (primal)+Epθ(x)[DKL(pθ(z|x)‖qθ(z|x))] (consistency)+DJS(q(x)‖pθ(x))

In this case α1 = 0, and α2 = −1 < 0, the model maxi-mizes mutual information between x and z.

In fact, all objectives in Table 1 belong to this class2.Derivations for additional models can be found in Ap-pendix A.

4.1 ENUMERATION OF ALL OBJECTIVES

The Lagrangian dual form of an objective reveals its mu-tual information preference (α1, α2), type of consistencyconstraints (D), and weighting of the constraints (λ). Thissuggests that the Lagrangian dual perspective may unifymany existing training objectives. We wish to identify andcategorize all objectives that have Lagrangian dual form asin Eq.7). However, this has two technical difficulties thatwe proceed to resolve.

1For conciseness we use z to denote structured latent vari-ables, which is represented as c in (Chen et al., 2016a).

2Variational Mutual Information Maximization (VMI) is nottruly a Lagrangian dual because it does not enforce consistencyconstraints (λ = 0).

1. Equivalence: Many objectives appear different, butare actually identical for the purposes of optimization (aswe have shown). To handle this we characterize “equiva-lent objectives” with a set of pre-specified transformations.

Definition 1. Equivalence (Informal): An objective L isequivalent to L′ when there exists a constant C, so that forall parameters θ, L(θ) = L′(θ) + C. We denote this asL ≡ L′.L and L′ are elementary equivalent if L′ can be obtainedfromL by applying chain rule or Bayes rule to probabilitiesin L, and addition/subtraction of constants Eq(x)[log q(x)]and Ep(z)[log p(z)].

A more formal but verbose definition is in Appendix B,Definition 1.

Elementary equivalences define simple yet flexible trans-formations for deriving equivalent objectives. For exam-ple, all the transformations in Section 4 (VAE, β-VAE andInfoGAN) and Appendix A are elementary. This impliesthat all objectives in Table 1 are elementary equivalent to aLagrangian dual function in Eq.(7) . However, these trans-formations are not exhaustive. For example, tranformingEpθ [g(x)] into Eqθ [g(x)pθ(x)/qθ(x)] via importance sam-pling is not accounted for, hence the two objectives are notconsidered to be elementary equivalent.

2. Optimization Difficulty: Some objectives are easier toevaluate/optimize than others. For example, variational au-toencoder training is robust and stable, adversarial trainingis less stable and requires careful hyper-parameter selec-tion (Kodali et al., 2018), and direct optimization of thelog-likelihood log pθ(x) is very difficult for latent variablemodels and almost never used Grover et al. (2018).

To assign a “hardness score” to each objective, we firstgroup the “terms” (an objective is a sum of terms) fromeasy to hard to optimize. An objective belongs to a “hard-ness class” if it cannot be transformed into an objectivewith easier terms. This is formalized below:

Definition 2. Effective Optimization: We define

1. Likelihood-based terms as the following set

T1 = {Epθ(x,z)[log pθ(x|z)],Epθ(x,z)[log pθ(x, z)],

Epθ(z)[log p(z)],Epθ(x,z)[log qθ(z|x)]

Eqθ(x,z)[log pθ(x|z)],Eqθ(x,z)[log pθ(x, z)],

Eqθ(z)[log p(z)],Eqθ(x,z)[log qθ(z|x)]}

2. Unary likelihood-free terms as the following set

T2 = {D(q(x)‖pθ(x)), D(qθ(z)‖p(z))}

3. Binary likelihood-free terms as the following set

T3 = {D(qθ(x, z)‖pθ(x, z))}

where each D can be f -divergence, Jensen Shannon diver-gence, Wasserstein distance, or Maximum Mean Discrep-ancy. An objective L is likelihood-based computable if L iselementary equivalent to some L′ that is a linear combina-tion of elements in T1; unary likelihood-free computable ifL′ is a linear combination of elements in T1 ∪ T2; binarylikelihood-free computable if L′ is a linear combination ofelements in T1 ∪ T2 ∪ T3.

The rationale of this categorization is that elements inT1 can be estimated by Monte-Carlo estimators and opti-mized by stochastic gradient descent effectively in practice(with low bias and variance) (Kingma & Welling, 2013;Rezende et al., 2014). In contrast, elements in T2 are op-timized by likelihood-free approaches such as adversarialtraining (Goodfellow et al., 2014) or kernelized methodssuch as MMD (Gretton et al., 2007) or Stein variationalgradient (Liu & Wang, 2016). These optimization pro-cedures are known to suffer from stability problems (Ar-jovsky et al., 2017) or cannot handle complex distributionsin high dimensions (Ramdas et al., 2015). Finally, elementsin T3 are over both x and z, and they are empirically shownto be even more difficult to optimize (Li et al., 2017a). Wedo not include terms such as Eq(x)[log pθ(x)] because theyare seldom feasible to compute or optimize for latent vari-able generative models.

Now we are able to fully characterize all Lagrangian dualobjectives in Eq.( 7) that are likelihood-based / unary like-lihood free / binary likelihood free computable in Table 1.

In addition, Table 1 contains essentially all possible modelsfor each optimization difficulty class in Definition 2. This isshown in the following theorem (informal, formal versionand proof in Appendix B, Theorem 3,4,5)Theorem 1. Closure theorem (Informal): Denote a La-grangian objectives in the form of Equation 7 where alldivergences are DKL a KL Lagrangian objective. Underelementary equivalence defined in Definition 1,

1) Any KL Lagrangian objective that is elementary equiv-alent to a likelihood based computable objective is equiva-lent to a linear combination of VMI and β-VAE.

2) Any KL Lagrangian objective that is elementary equiv-alent to a unary likelihood computable objective is equiva-lent to a linear combination of InfoVAE and InfoGAN.

3) Any KL Lagrangian objective that is elementary equiva-lent to a binary likelihood computable objective is equiva-lent to a linear combination of ALICE, InfoVAE and Info-GAN.

We also argue in the Appendix (without formal proof) thatthis theorem holds for other divergences including DMMD,DW, Df or DJS.

Intuitively, this suggests a rather negative result: if a newlatent variable model training objective contains mutual in-formation preference and consistency constraints (definedthrough DKL, DMMD, DW, Df or DJS), and this objec-tive can be effectively optimized as in Definition 1 andDefinition 2, then this objective is a linear combinationof existing objectives. Our limitation is that we are re-stricted to elementary transformations and the set of termsdefined in Definition 2. To derive new training objectives,we should consider new transformations, non-linear com-binations and/or new terms.

5 DUAL OPTIMIZATION FOR LATENTVARIABLE GENRATIVE MODELS

While existing objectives for latent variable generativemodels have dual form in Equation 7, they are not solvingthe dual problem exactly because the Lagrange multipliersλ are predetermined instead of optimized. In particular,if we can show strong duality, the optimal solution to thedual is also an optimal solution to the primal (Boyd & Van-denberghe, 2004). However if the Lagrange multipliers arefixed, this property is lost, and the parameters θ obtainedvia dual optimization may be suboptimal for minθ f(θ), orviolate the consistency conditions D = 0.

5.1 RELAXATION OF CONSISTENCYCONSTRAINTS

This observation motivates us to directly solve the dual op-timization problem where we also optimize the Lagrangemultipliers.

maxλ≥0

minθf(θ) + λTD

Unfortunately, this is usually impractical because the con-sistency constrains are difficult to satisfy when the modelhas finite capacity, so in practice the primal optimizationproblem is actually infeasible and λ will be optimized to+∞.

One approach to this problem is to use relaxed consistencyconstraints, where compared to Eq.(5) we require consis-

tency up to some error ε > 0:

minθf(θ) subject to D ≤ ε (8)

For a sufficiently large ε, the problem is feasible. This hasthe corresponding dual problem:

maxλ≥0

minθf(θ) + λ>(D − ε) (9)

Whenλ is constant (instead of maximized), Equation 9 stillreduces to existing latent variable generative modeling ob-jectives since λ>ε is a constant, so the objective simplybecomes

minθf(θ) + λTD + constant

In contrast, we propose to find λ∗, θ∗ that optimize the La-grangian dual in Eq.(9). If we additionally have strong du-ality, θ∗ is also the optimal solution to the primal problemin Eq.(8).

5.2 STRONG DUALITY WITH MUTUALINFORMATION OBJECTIVES

This section aims to show that strong duality for Eq.(8)holds in distribution space if we replace mutual informa-tions in f with upper and lower bounds. We prove this viaSlater’s condition (Boyd & Vandenberghe, 2004), whichhas three requirements: 1. ∀D ∈ D, D is convex in θ;2. f(θ) is convex for θ ∈ Θ; 3. the problem is strictlyfeasible: ∃θ s.t. D < ε. We propose weak conditions tosatisfy all three in distribution space, so strong duality isguaranteed.

For simplicity we focus on discrete X and Z . We param-eterize qθ(z|x) with a parameter matrix θq ∈ R|X ||Z| (weadd the superscript q to distinguish parameters of qθ fromthat of pθ) where

qθ(z = j|x = i) = θqij ,∀i ∈ X , j ∈ Z (10)

The only restriction is that θq must correspond to validconditional distributions. More formally, we require thatθq ∈ Θq , where

Θq =

θq ∈ R|X ||Z| s.t. 0 ≤ θqij ≤ 1,∑j

θqij = 1

(11)

Similarly we can define θp ∈ Θp for pθ. We still use

θ = [θq, θp], Θ = Θq ×Θp (12)

to denote both sets of parameters.

1) Constraints D ∈ D are convex: We show that somedivergences used in existing models are convex in distribu-tion space.

Lemma 1 (Convex Constraints (Informal)). DKL, DMMD,or Df over any marginal distributions on x or z or jointdistributions on (x, z) are convex with respect to θ ∈ Θ asdefined in Eq.(12).

Therefore if one only uses these convex divergences, thefirst requirement for Slater’s condition is satisfied.

2) Convex Bounds for f(θ): f(θ) = α1Iqθ (x; z) +α2Ipθ (x; z) is not itself guaranteed to be convex in general.However we observe that mutual information has a convexupper bound, and a concave lower bound, which we denoteas Iqθ and Iqθ respectively:

Iqθ (x; z) (13)

= Eq(x)[DKL(qθ(z|x)‖p(z))] convex upper bound Iqθ−DKL(qθ(z)‖p(z)) bound gap Iqθ − Iqθ

= Eqθ(x,z)[log pθ(x|z)] +Hq(x) concave lower bound Iqθ+Ep(z)DKL(q(x|z)‖pθ(x|z)) bound gap Iqθ − Iqθ

The convexity/concavity of these bounds is shown by thefollowing lemma, which we prove in the appendixLemma 2 (Convex/Concave Bounds). Iqθ is convex withrespect to θ ∈ Θ as defined in Eq.(12), and Iqθ is concavewith respect to θ ∈ Θ.

A desirable property of these bounds is that if we look atthe bound gaps (difference between bound and true value)in Eq.(13), they are 0 if the consistency constraint is satis-fied (i.e., pθ(x, z) = qθ(x, z)). They will be tight (boundgaps are small) when consistency constraints are approx-imately satisfied (i.e., pθ(x, z) ≈ qθ(x, z)). In additionwe also denote identical bounds for Ipθ as Ipθ and IpθSimilar bounds for mutual information have been discussedin (Alemi et al., 2017).

3) Strict Feasibility: the optimization problem has nonempty feasible set, which we show in the following lemma:Lemma 3 (Strict Feasibility). For discrete X and Z , andε > 0, ∃θ ∈ Θ such that D < ε.

Therefore we have shown that for convex/concave upperand lower bounds on f , all three of Slater’s conditions aresatisfied, so strong duality holds. We summarize this in thefollowing theorem.Theorem 2 (Strong Duality). If D contains only diver-gences in Lemma 1, then for all ε > 0:

If α1, α2 ≥ 0 strong duality holds for the following prob-lems:

minθ∈Θ

α1Iqθ + α2Ipθ subject to D ≤ ε (14)

If α1, α2 ≤ 0, strong duality holds for the following prob-lem

minθ∈Θ

α1Iqθ + α2Ipθ subject to D ≤ ε (15)

Algorithm 1 Dual Optimization for Latent Variable Gen-erative Models

Input: Analytical form for p(z) and samples from q(x);constraints D; α1, α2 that specify maximization / mini-mization of mutual information; ε > 0 which specifiesthe strength of constraints; step size ρθ, ρλ for θ and λ.Output: θ (parameters for pθ(x|z) and qθ(z|x)).

Initialize θ randomlyInitialize the Lagrange multipliers λ := 1if α1, α2 > 0 thenf(θ)← α1Iqθ + α2Ipθ

elsef(θ)← α1Iqθ + α2Ipθ

end iffor t = 0, 1, 2, . . . doθ ← θ − ρθ(∇θf(θ) + λ>∇θD)λ← λ+ ρλ(D − ε)

end for

5.3 DUAL OPTIMIZATION

Because the problem is convex in distribution space andsatisfies Slater’s condition, the θ∗ that achieves the saddlepoint

λ?, θ? = arg maxλ≥0arg minθf(θ) + λT (D − ε) (16)

is also a solution to the original optimization problemEq.(8) (Boyd & Vandenberghe, 2004)(Chapter 5.4). In ad-dition the max-min problem Eq.(16) is convex with respectto θ and concave (linear) with respect to λ, so one canapply iterative gradient descent/ascent over θ (minimize)and λ (maximize) and achieve stable convergence to saddlepoint (Holding & Lestas, 2014). We describe the iterativealgorithm in Algorithm 1.

In practice, we do not optimize over distribution space and{pθ(x|z)}, {qθ(z|x)} are some highly complex and non-convex families of functions. We show in the experimentalsection that this scheme is stable and effective despite non-convexity.

6 LAGRANGIAN VAE

In this section we consider a particular instantiation ofthe general dual problem proposed in the previous section.Consider the following primal problem, with α1 ∈ R:

minθα1Iqθ (x; z) (17)

subject to DKL(qθ(x, z)‖pθ(x, z))) ≤ ε1DMMD(qθ(z)‖p(z)) ≤ ε2

For mutual information minimization / maximization, werespectively replace the (possibly non-convex) mutual in-formation by upper bound Iqθ if α1 ≥ 0 and lower bound

Iqθ if α1 < 0. The corresponding dual optimization prob-lem can be written as:

maxλ≥0

minθ

{α1Iqθ + λ>(DInfoVAE − ε), α1 ≥ 0α1Iqθ + λ>(DInfoVAE − ε), α1 < 0

(18)

where ε = [ε1, ε2], λ = [λ1, λ2] and

DInfoVAE = [DKL(qθ(x, z)‖pθ(x, z))),

DMMD(qθ(z)‖p(z))]

We call the objective in 18 Lagrangian (Info)VAE (Lag-VAE). Note that setting a constant λ for the dual functionrecovers the InfoVAE objective (Zhao et al., 2017). ByTheorem 2 strong duality holds for this problem and findingthe max-min saddle point of LagVAE in Eq.(18) is identi-cal to finding the optimal solution to original problem ofEq.(17).

The final issue is choosing the ε hyper-parameters so thatthe constraints are feasible. This is non-trivial since select-ing ε that describe feasible constraints depends on the taskand model structure. We introduce a general strategy thatis effective in all of our experiments. First we learn a pa-rameter θ∗ that satisfies the consistency constraints “as wellas possible” without considering mutual information max-imization/minimization. Formally this is achieved by thefollowing optimization (for any choice of λ > 0),

θ∗ = arg minθ

λTDInfoVAE (19)

This is the original training objective for InfoVAE withα1 = 0 and can be optimized by

minθλTDInfoVAE

=λ1DKL(qθ(x, z)‖pθ(x, z))) + λ2DMMD(qθ(z)‖p(z))

≡λ1LELBO(θ) + λ2DMMD(qθ(z)‖p(z)) (20)

where LELBO(θ) is the evidence lower bound defined inEq.(4). Because we only need a rough estimate of howwell consistency constraints can be satisfied, the selectionof weighing λ1 and λ2 is unimportant. The recommenda-tion in (Zhao et al., 2017) works well in all our experiments(λ1 = 1, λ2 = 100).

Now we introduce a “slack” to specify how much we arewilling to tolerate consistency error to achieve higher/lowermutual information. Formally, we define ε as the diver-gences DInfoVAE evaluated at the above θ∗. Under this εthe following constraint must be feasible (because θ∗ is asolution):

DInfoVAE ≤ εNow we can safely set ε = γ + ε, where γ > 0, and theconstraint

DInfoVAE ≤ ε

must still be feasible (and strictly feasible). γ has a verynice intuitive interpretation: it is the “slack” that we arewilling to accept. Compared to tuning α1 and λ for Info-VAE, tuning γ is much more interpretable: we can antici-pate the final consistency error before training.

Another practical consideration is that the one of the con-straints DKL(qθ(x, z)‖pθ(x, z)) is difficult to estimate.However, we have

DKL(qθ(x, z)‖pθ(x, z)) = −LELBO −Hq(x)

where LELBO is again, the evidence lower bound in Eq.(4)of Section 2, and Hq(x) is the entropy of the true distri-bution q(x). LELBO is empirically easy to estimate, andHq(x) is a constant irrelevant to the optimization prob-lem. The optimization problem is identical if we replacingthe more difficult constraintDKL(qθ(x, z)‖pθ(x, z)) ≤ ε1with the easier-to-optimize/estimate constraint−LELBO ≤ε′1 (where ε′1 = ε1 +Hq(x)). In addition, ε′1 can be selectedby the technique in the previous paragraph.

7 EXPERIMENTS

We compare the performance of LagVAE, where we learnλ automatically, and InfoVAE, where we set λ in advance(as hyperparameters). Our primal problem is to find solu-tions that maximize / minimize mutual information underthe consistency constraints. Therefore, we consider twoperformance metrics:

• Iq(x, z) the mutual information between x and z. Wecan estimate the mutual information via the identity:

Iq(x; z) = Eqθ(x,z) [log qθ(z|x)− log qθ(z)] (21)

where we approximate qθ(z) with a kernel density es-timator.

• the consistency divergences DKL(qθ(x, z)‖pθ(x, z))and DMMD(qθ(z)‖p(z)). As stated in Section 6, wereplace DKL(qθ(x, z)‖pθ(x, z)) with the evidencelower bound LELBO.

In the remainder of this section we demonstrate the follow-ing empirical observations:

• LagVAE reliably maximizes/minimizes mutual infor-mation without violating the consistency constraints.InfoVAE, on the other hand, makes unpredictable andtask-specific trade-offs between mutual informationand consistency.

• LagVAE is Pareto optimal, as no InfoVAE hyper-parameter choice is able to achieve both better mu-tual information and better consistency (measured byDMMD and LELBO) than LagVAE.

7.1 VERIFICATION OF DUAL OPTIMIZATION

We first verify that LagVAE reliably maximizes/minimizesmutual information subject to consistency constraints. Wetrain LagVAE on MNIST according to Algorithm 1. ε is se-lected according to Section 6, where we first compute ε =(ε1, ε2) without information maximization/minimizationby Eq.(20). Next we choose slack variables γ = (γ1, γ2),and set ε = ε + γ. For γ1 we explore values from 0.1 to4.0, and for γ2 we use the fixed value 0.5ε2.

The results are shown in Figure 1, where mutual informa-tion is estimated according to Eq.(21). For any given slackγ, setting α1 to positive values and negative values re-spectively minimizes or maximizes the mutual informationwithin the feasible set D ≤ ε. In particular, the absolutevalue of α1 does not affect the outcome, and only the signmatters. This is consistent with the expected behavior (Fig-ure 1 Left) where the model finds the maximum/minimummutual information solution within the feasible set.

7.2 VERIFICATION OF PARETOIMPROVEMENTS

In this section we verify Pareto optimality of LagVAE.We evaluate LagVAE and InfoVAE on the MNIST datasetwith a wide variety of hyper-parameters. For LagVAE,we set ε1 for LELBO to be {83, 84, . . . , 95} and ε2 forDMMD to be 0.0005. For InfoVAE, we set α ∈ {1,−1},λ1 ∈ {1, 2, 5, 10} and λ2 ∈ {1000, 2000, 5000, 10000} 3.

Figure 2 plots the mutual information and LELBO achievedby both methods. Each point is the outcome of one hyper-parameter choice of LagVAE / InfoVAE. Regardless of thehyper-parameter choice of both models, no InfoVAE hyper-parameter lead to better performance on both mutual infor-mation and LELBO on the training set. This is expected be-cause LagVAE always finds the maximum/minimum mu-tual information solution out of all solutions with givenconsistency value. The same trend is true even on the testset, indicating that it is not an outcome of over-fitting.

8 CONCLUSION

Many existing objectives for latent variable generativemodeling are Lagrangian dual functions of the same typeof constrained optimization problem with fixed Lagrangianmultipliers. This allows us to explore their statistical andcomputational trade-offs, and characterize all models inthis class. Moreover, we propose a practical dual optimiza-tion method that optimizes both the Lagrange multipliersand the model parameters, allowing us to specify inter-pretable constraints and achieve Pareto-optimality empir-ically.

3Code for this set of experiments is available at https://github.com/ermongroup/lagvae

https://github.com/ermongroup/lagvae

https://github.com/ermongroup/lagvae

Feasible Set

minimize mutual information

maximize mutual information

D > εD ≤ ε

Figure 1: Left: Effect of α1 and γ1 on the primal objective (mutual information). When α1 is positive we minimizemutual information within the feasible set, and when α1 is negative we maximize mutual information. When α1 is zerothe preference is undetermined, and mutual information varies depending on initialization. Note that mutual informationdoes not depend on the absolute value of α1 but only on its sign. Right: An illustration of this effect. Lagrangian dualoptimization finds the maximum/minimum mutual information solution in the feasible set D ≤ ε.

Figure 2: LagVAE Pareto dominates InfoVAE with respectto Mutual information and consistency (LELBO values) ontrain (top) and test (bottom) set. Each point is the out-come of one hyper-parameter choice for LagVAE / Info-VAE. When we maximize mutual information (α1 < 0),for any given LELBO value, LagVAE always achieve simi-lar or larger mutual information; when we minimize mutualinformation (α1 > 0), for any given ELBO value, LagVAEalways achieve similar or smaller mutual information.

In this work, we only considered Lagrangian (Info)VAE,but the method is generally applicable to other Lagrangiandual objectives. In addition we only considered mutual in-formation preference. Exploring different preferences is apromising future directions.

Acknolwledgements

This research was supported by Intel Corporation, TRI,NSF (#1651565, #1522054, #1733686) and FLI (#2017-158687).

ReferencesAlemi, Alexander A., Poole, Ben, Fischer, Ian, Dillon,

Joshua V., Saurous, Rif A., and Murphy, Kevin. Aninformation-theoretic analysis of deep latent-variablemodels. CoRR, abs/1711.00464, 2017. URL http://arxiv.org/abs/1711.00464.

Arjovsky, M., Chintala, S., and Bottou, L. WassersteinGAN. ArXiv e-prints, January 2017.

Barber, David and Agakov, Felix. The im algorithm: a vari-ational approach to information maximization. In Pro-ceedings of the 16th International Conference on Neu-ral Information Processing Systems, pp. 201–208. MITPress, 2003.

Boyd, Stephen and Vandenberghe, Lieven. Convex opti-mization. Cambridge university press, 2004.

Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John,Sutskever, Ilya, and Abbeel, Pieter. Infogan: Inter-pretable representation learning by information max-imizing generative adversarial nets. arXiv preprintarXiv:1606.03657, 2016a.

Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan,Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, andAbbeel, Pieter. Variational lossy autoencoder. arXivpreprint arXiv:1611.02731, 2016b.

http://arxiv.org/abs/1711.00464


Dhar, Manik, Grover, Aditya, and Ermon, Stefano. Sparse-gen: Modeling sparse deviations for compressed sensingusing generative models. International Conference onMachine Learning, 2018.

Donahue, Jeff, Krahenbuhl, Philipp, and Darrell, Trevor.Adversarial feature learning. CoRR, abs/1605.09782,2016. URL http://arxiv.org/abs/1605.09782.

Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb,Alex, Arjovsky, Martin, Mastropietro, Olivier, andCourville, Aaron. Adversarially learned inference. arXivpreprint arXiv:1606.00704, 2016a.

Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb,Alex, Arjovsky, Martin, Mastropietro, Olivier, andCourville, Aaron. Adversarially learned inference. arXivpreprint arXiv:1606.00704, 2016b.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In Advances in Neural Information Processing Systems,pp. 2672–2680, 2014.

Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte,Scholkopf, Bernhard, and Smola, Alex J. A kernelmethod for the two-sample-problem. In Advances inneural information processing systems, pp. 513–520,2007.

Grover, Aditya, Dhar, Manik, and Ermon, Stefano. Flow-GAN: Combining maximum likelihood and adversariallearning in generative models. In AAAI Conference onArtificial Intelligence, 2018.

Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christo-pher, Glorot, Xavier, Botvinick, Matthew, Mohamed,Shakir, and Lerchner, Alexander. beta-vae: Learning ba-sic visual concepts with a constrained variational frame-work. 2016.

Holding, Thomas and Lestas, Ioannis. On the convergenceto saddle points of concave-convex functions, the gradi-ent method and emergence of oscillations. In Decisionand Control (CDC), 2014 IEEE 53rd Annual Conferenceon, pp. 1143–1148. IEEE, 2014.

Kim, Taeksoo, Cha, Moonsu, Kim, Hyunsoo, Lee,Jung Kwon, and Kim, Jiwon. Learning to discovercross-domain relations with generative adversarial net-works. CoRR, abs/1703.05192, 2017. URL http://arxiv.org/abs/1703.05192.

Kingma, D. P and Welling, M. Auto-Encoding VariationalBayes. ArXiv e-prints, December 2013.

Kodali, Naveen, Hays, James, Abernethy, Jacob, and Kira,Zsolt. On convergence and stability of gans. 2018.

Kuleshov, Volodymyr and Ermon, Stefano. Deep hy-brid models: Bridging discriminative and generative ap-

proaches. In Proceedings of the Conference on Uncer-tainty in AI (UAI), 2017.

Li, Chunyuan, Liu, Hao, Chen, Changyou, Pu, Yunchen,Chen, Liqun, Henao, Ricardo, and Carin, Lawrence. To-wards understanding adversarial learning for joint distri-bution matching. 2017a.

Li, Yunzhu, Song, Jiaming, and Ermon, Stefano. Inferringthe latent structure of human decision-making from rawvisual inputs. 2017b.

Liu, Qiang and Wang, Dilin. Stein variational gradient de-scent: A general purpose bayesian inference algorithm.arXiv preprint arXiv:1608.04471, 2016.

Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, andGoodfellow, Ian. Adversarial autoencoders. arXivpreprint arXiv:1511.05644, 2015.

Mescheder, Lars, Nowozin, Sebastian, and Geiger, An-dreas. Adversarial variational bayes: Unifying varia-tional autoencoders and generative adversarial networks.arXiv preprint arXiv:1701.04722, 2017.

Mohamed, Shakir and Lakshminarayanan, Balaji. Learn-ing in implicit generative models. arXiv preprintarXiv:1610.03483, 2016.

Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota.f-gan: Training generative neural samplers using varia-tional divergence minimization. In Advances in NeuralInformation Processing Systems, pp. 271–279, 2016.

Pu, Yuchen, Wang, Weiyao, Henao, Ricardo, Chen, Liqun,Gan, Zhe, Li, Chunyuan, and Carin, Lawrence. Adver-sarial symmetric variational autoencoder. In Advances inNeural Information Processing Systems, pp. 4333–4342,2017.

Radford, Alec, Metz, Luke, and Chintala, Soumith. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434, 2015.

Ramdas, Aaditya, Reddi, Sashank Jakkam, Poczos,Barnabas, Singh, Aarti, and Wasserman, Larry A. Onthe decreasing power of kernel and distance based non-parametric hypothesis tests in high dimensions. In AAAI,pp. 3571–3577, 2015.

Rezende, D., Mohamed, S., and Wierstra, D. Stochas-tic Backpropagation and Approximate Inference in DeepGenerative Models. ArXiv e-prints, January 2014.

Shamir, Ohad, Sabato, Sivan, and Tishby, Naftali. Learn-ing and generalization with the information bottleneck.Theoretical Computer Science, 411(29-30):2696–2711,2010.

Tishby, Naftali and Zaslavsky, Noga. Deep learn-ing and the information bottleneck principle. CoRR,abs/1503.02406, 2015. URL http://arxiv.org/abs/1503.02406.







Tolstikhin, Ilya, Bousquet, Olivier, Gelly, Sylvain, andSchoelkopf, Bernhard. Wasserstein auto-encoders. arXivpreprint arXiv:1711.01558, 2017.

Yang, Zichao, Hu, Zhiting, Salakhutdinov, Ruslan, andBerg-Kirkpatrick, Taylor. Improved variational autoen-coders for text modeling using dilated convolutions.CoRR, abs/1702.08139, 2017. URL http://arxiv.org/abs/1702.08139.

Zhao, Shengjia, Song, Jiaming, and Ermon, Stefano. Info-vae: Information maximizing variational autoencoders.CoRR, abs/1706.02262, 2017. URL http://arxiv.org/abs/1706.02262.

Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, and Efros,Alexei A. Unpaired image-to-image translation us-ing cycle-consistent adversarial networks. CoRR,abs/1703.10593, 2017. URL http://arxiv.org/abs/1703.10593.







Documents

The Information Autoencoding Family: A Lagrangian Perspective on Latent Variable …auai.org/uai2018/proceedings/papers/361.pdf · 2018. 7. 7. · example, in latent variable modeling