Multi-Agent Diverse Generative Adversarial Networks · Fig. 1 demon-strates the surprising effectiveness of MAD-GAN in this challenging setting. Generators among themselves are able

Multi-Agent Diverse Generative Adversarial Networks

Arnab Ghosh∗

University of Oxford, [email protected]

Viveka Kulharia∗

University of Oxford, [email protected]

Vinay NamboodiriIIT Kanpur, India

[email protected]

Philip H.S. TorrUniversity of Oxford, [email protected]

Puneet K. DokaniaUniversity of Oxford, [email protected]

Abstract

We propose MAD-GAN, an intuitive generalization tothe Generative Adversarial Networks (GANs) and its condi-tional variants to address the well known problem of modecollapse. First, MAD-GAN is a multi-agent GAN architec-ture incorporating multiple generators and one discrimina-tor. Second, to enforce that different generators capture di-verse high probability modes, the discriminator of MAD-GAN is designed such that along with finding the real andfake samples, it is also required to identify the generatorthat generated the given fake sample. Intuitively, to succeedin this task, the discriminator must learn to push differentgenerators towards different identifiable modes. We per-form extensive experiments on synthetic and real datasetsand compare MAD-GAN with different variants of GAN. Weshow high quality diverse sample generations for challeng-ing tasks such as image-to-image translation and face gen-eration. In addition, we also show that MAD-GAN is able todisentangle different modalities when trained using highlychallenging diverse-class dataset (e.g. dataset with imagesof forests, icebergs, and bedrooms). In the end, we showits efficacy on the unsupervised feature representation task.In Appendix, we introduce a similarity based competing ob-jective (MAD-GAN-Sim) which encourages different gener-ators to generate diverse samples based on a user definedsimilarity metric. We show its performance on the image-to-image translation, and also show its effectiveness on theunsupervised feature representation task.

1. IntroductionGenerative models have attracted considerable attention

recently. The underlying idea behind such models is to at-

∗Joint first author. This is an updated version of our CVPR’18 paperwith the same title. In this version, we also introduce MAD-GAN-Sim inAppendix B.

Figure 1: Diverse-class data generation using MAD-GAN. Diverse-class dataset contains images from differ-ent classes/modalities (in this case, forests, icebergs, andbedrooms). Each row represents generations by a particu-lar generator and each column represents generations for agiven random noise input z. As shown, once trained us-ing this dataset, generators of MAD-GAN are able to disen-tangle different modalities, hence, each generator is able togenerate images from a particular modality.

tempt to capture the distribution of high-dimensional datasuch as images and texts. Though these models are highlyuseful in various applications, it is computationally expen-sive to train them as they require intractable integrationin a very high-dimensional space. This drastically limitstheir applicability. However, recently there has been con-siderable progress in deep generative models – conglom-erate of deep neural networks and generative models – asthey do not explicitly require the intractable integration, andcan be efficiently trained using back-propagation algorithm.Two such famous examples are Generative Adversarial Net-works (GANs) [13] and Variational Autoencoders [17].

In this paper we focus on GANs as they are known toproduce sharp and plausible images. Briefly, GANs employa generator and a discriminator where both are involved ina minimax game. The task of the discriminator is to learnthe difference between real samples (from true data distri-bution pd) and fake samples (from generator distributionpg). Whereas, the task of the generator is to maximize themistakes of the discriminator. At convergence, the genera-tor learns to produce real looking images. A few success-

1

arX

iv:1

704.

0290

6v3

[cs

.CV

] 1

6 Ju

l 201

8

ful applications of GANs are video generation [30], imageinpainting [25], image manipulation [33], 3D object gen-eration [31], interactive image generation using few brushstrokes [33], image super-resolution [20], diagrammatic ab-stract reasoning [18] and conditional GANs [23, 27].

Despite the remarkable success of GAN, it suffers fromthe major problem of mode collapse [2, 7, 8, 22, 28].Though, theoretically, convergence guarantees the genera-tor learning the true data distribution. However, practically,reaching the true equilibrium is difficult and not guaranteed,which potentially leads to the aforementioned problem ofmode collapse. Broadly speaking, there are two schools ofthought to address the issue: (1) improving the learning ofGANs to reach better optima [2, 22, 28]; and (2) explicitlyenforcing GANs to capture diverse modes [7, 8, 21]. Herewe focus on the latter.

Borrowing from the multi-agent algorithm [1] and cou-pled GAN [21], we propose to use multiple generators withone discriminator. We call this framework the Multi-AgentGAN architecture, as shown in Fig. 2. In detail, similar tothe standard GAN, the objective of each generator here is tomaximize the mistakes of the common discriminator. De-pending on the task, it might be useful for different genera-tors to share information. This is done using the initial layerparameters of generators. Another reason behind sharingthese parameters is the fact that initial layers capture low-frequency structures which are almost the same for a partic-ular type of dataset (for example, faces), therefore, sharingthem reduces redundant computations. However, when thedataset contains images from completely different modali-ties, one can avoid sharing these parameters. Naively usingmultiple generators may lead to the trivial solution where allthe generators learn to generate similar samples. To resolvethis issue and generate different visually plausible samplescapturing diverse high probability modes, we propose tomodify the objective function of the discriminator. In themodified objective, along with finding the real and the fakesamples, the discriminator also has to correctly identify thegenerator that generated the given fake sample. Intuitively,in order to succeed in this task, the discriminator must learnto push generations corresponding to different generatorstowards different identifiable modes. Combining the Multi-Agent GAN architecture with the diversity enforcing termallows us to generate diverse plausible samples, thus thename Multi-Agent Diverse GAN (MAD-GAN).

As an example, an intuitive setting where mode collapseoccurs is when a GAN is trained on a dataset containingimages from different modalities/classes. For example, adiverse-class dataset containing images such as forests, ice-berg, and bedrooms. This is of particular interest as it notonly requires the model to disentangle intra-class variations,it also requires inter-class disentanglement. Fig. 1 demon-strates the surprising effectiveness of MAD-GAN in this

challenging setting. Generators among themselves are ableto disentangle inter-class variations, and each generator isalso able to capture intra-class variations.

In addition, we analyze MAD-GAN through exten-sive experiments and compare it with several variants ofGAN. First, for the proof of concept, we perform experi-ments in controlled settings using synthetic dataset (mix-ture of Gaussians), and complicated Stacked/CompositionalMNIST datasets with hand engineered modes. In these set-tings, we empirically show that our approach outperformsall other GAN variants we compare with, and is able togenerate high quality samples while capturing large num-ber of modes. In a more realistic setting, we show highquality diverse sample generations for the challenging tasksof image-to-image translation [14] (conditional GAN) andface generation [8, 26]. Using the SVHN dataset [24], wealso show the efficacy of our framework for learning thefeature representation in an unsupervised setting.

We also provide theoretical analysis of this approach andshow that the proposed modification in the objective of dis-criminator allows generators to learn together as a mixturemodel where each generator represents a mixture compo-nent. We show that at convergence, the global optimumvalue of −(k + 1) log(k + 1) + k log k is achieved, wherek is the number of generators.

Figure 2: Multi-Agent Diverse GAN (MAD-GAN). Thediscriminator outputs k + 1 softmax scores signifying theprobability of its input sample being from either one of thek generators or the real distribution.

2. Related Work

The recent work called InfoGAN [8] proposed aninformation-theoretic extension to GANs in order to ad-dress the problem of mode collapse. Briefly, InfoGAN dis-entangles the latent representation by assuming a factoredrepresentation of the latent variables. In order to enforcethat the generator learns factor specific generations, Info-GAN maximizes the mutual information between the fac-tored latents and the generator distribution. Che et al. [7]proposed a mode regularized GAN (ModeGAN) which usesan encoder-decoder paradigm. The basic idea behind Mod-eGAN is that if a sample from the true data distribution pd

2

belongs to a particular mode, then the sample generated bythe generator (fake sample) when the true sample is passedthrough the encoder-decoder is likely to belong to the samemode. ModeGAN assumes that there exists enough truesamples from a mode for the generator to be able to captureit. Another work by Metz et al. [22] proposed a surrogateobjective for the update of the generator with respect to theunrolled optimization of the discriminator (UnrolledGAN)to address the issue of convergence of the training processof GANs. This improves the training process of the gen-erator which in turn allow the generators to explore bettercoverage to true data distribution.

Liu et al. [21] presented Coupled GAN, a method fortraining two generators with shared parameters to learn thejoint distribution of the data. The shared parameters guideboth the generators towards similar subspaces but since theyare trained independently on two domains, they promote di-verse generations. Durugkar et al. [10] proposed a modelwith multiple discriminators whereby an ensemble of multi-ple discriminators have been shown to stabilize the trainingof the generator by guiding it to produce better samples.

W-GAN [3] is a recent technique which employs integralprobability metrics based on the earth mover distance ratherthan the JS-divergences that the original GAN uses. BE-GAN [5] builds upon W-GAN using an autoencoder basedequilibrium enforcing technique alongside the Wassersteindistance. DCGAN [26] was a seminal technique which useda fully convolutional generator and discriminator for thefirst time along with the introduction of batch normalizationthus stabilizing the training procedure, and was able to gen-erate compelling generations. GoGAN [16] introduced atraining procedure for the training of the discriminator usinga maximum margin formulation alongside the earth moverdistance based on the Wasserstein-1 metric. [4] introduceda technique and theoretical formulation stating the impor-tance of multiple generators and discriminators in order tocompletely model the data distribution. In terms of employ-ing multiple generators, our work is closest to [4, 21, 11].However, while using multiple generators, our method ex-plicitly enforces them to capture diverse modes.

3. Preliminaries

Here we present a brief review of GANs [13]. Given aset of samples D = (xi)

ni=1 from the true data distribution

pd, the GAN learning problem is to obtain the optimal pa-rameters θg of a generatorG(z; θg) that can sample from anapproximate data distribution pg , where z ∼ pz is the priorinput noise (e.g. samples from a normal distribution). Inorder to learn the optimal θg , the GAN objective (Eq. (1))employs a discriminator D(x; θd) that learns to differenti-ate between a real (from pd) and a fake (from pg) sample x.

The overall GAN objective is:

minθg

maxθd

V (θd, θg) := Ex∼pd logD(x; θd)

+ Ez∼pz log(1−D(G(z; θg); θd)

)(1)

The above objective is optimized in a block-wise mannerwhere θd and θg are optimized one at a time while fixingthe other. For a given sample x (either from pd or pg)and the parameter θd, the function D(x; θd) ∈ [0, 1] pro-duces a score that represents the probability of x belongingto the true data distribution pd (or probability of it beingreal). The objective of the discriminator is to learn parame-ters θd that maximizes this score for the true samples (frompd) while minimizing it for the fake ones x = D(z; θg)(from pg). In the case of generator, the objective is to min-imize Ez∼pz log

(1−D(G(z; θg); θd)

), equivalently maxi-

mize Ez∼pz logD(G(z; θg); θd). Thus, the generator learnsto maximize the scores for the fake samples (from pg),which is exactly the opposite to what discriminator is try-ing to achieve. In this manner, the generator and the dis-criminator are involved in a minimax game where the taskof the generator is to maximize the mistakes of the discrim-inator. Theoretically, at equilibrium, the generator learns togenerate real samples, which means pg = pd.

4. Multi-Agent Diverse GANIn the GAN objective, one can argue that the task of

a generator is much harder than that of the discriminatoras it has to produce real looking images to maximize themistakes of the discriminator. This, along with the min-imax nature of the objective raise several challenges forGANs [2, 7, 8, 22, 28]: (1) mode collapse; (2) difficult op-timization; and (3) trivial solution. In this work we proposea new framework to address the first challenge of mode col-lapse by increasing the capacity of the generator while usingwell known tricks to partially avoid other challenges [2].

Briefly, we propose a Multi-Agent GAN architecture thatemploys multiple generators and one discriminator in orderto generate different samples from high probability regionsof the true data distribution. In addition, theoretically, weshow that our formulation allows generators to act as a mix-ture model with each generator capturing one component.

4.1. Multi-Agent GAN Architecture

Here we describe our proposed architecture (Fig. 2). Itinvolves k generators and one discriminator. In the caseof homogeneous data (all the images belong to same class,e.g. faces or birds), we allow all the generators to share in-formation by tying most of the initial layer parameters. Thisis essential to avoid redundant computations as initial lay-ers of a generator capture low-frequency structures whichare almost the same for a particular type of dataset. This

3

Figure 3: Visualization of different generators gettingpushed towards different modes. Here,M1 andM2 could bea cluster of modes where each cluster itself contains manymodes. The arrows abstractly represent generator specificgradients for the purpose of building intuition.

also allows different generators to converge faster. How-ever, in the case of diverse-class data (e.g. dataset with amixture of different classes such as forests, icebergs etc.), itis necessary to avoid sharing these parameters to allow eachgenerator to capture content specific structures. Thus, theextent to which one should share these parameters dependson the task at hand.

More specifically, given z ∼ pz for the i-th generator,similar to the standard GAN, the first step involves gen-erating a sample (for example, an image) xi. Since eachgenerator receives the same latent input sampled from thesame distribution, naively using this simple approach maylead to the trivial solution where all the generators learn togenerate similar samples. In what follows, we propose anintuitive solution to avoid this issue and allow the generatorsto capture diverse modes.

4.2. Enforcing Diverse Modes

Inspired by the discriminator formulation for the semi-supervised learning [28], we use a generator identificationbased objective function that, along with minimizing thescore D(x; θd), requires the discriminator to identify thegenerator that generated the given fake sample x. In orderto do so, as opposed to the standard GAN objective functionwhere the discriminator outputs a scalar value, we modifyit to output k + 1 soft-max scores. In more detail, giventhe set of k generators, the discriminator produces a soft-max probability distribution over k + 1 classes. The scoreat (k + 1)-th index (Dk+1(.)) represents the probabilitythat the sample belongs to the true data distribution and thescore at j ∈ {1, . . . , k}-th index represents the probabilityof it being generated by the j-th generator. Under this set-ting, while learning θd, we optimize the cross-entropy be-tween the soft-max output of the discriminator and the Diracdelta distribution δ ∈ {0, 1}k+1, where for j ∈ {1, . . . , k},δ(j) = 1 if the sample belongs to the j-th generator, other-wise δ(k + 1) = 1. Thus, the objective of the discrimina-tor, which is optimizing θd while keeping θg constant (refer

Eq. (1)), is modified to:

maxθd

Ex∼pH(δ,D(x; θd))

where, Supp(p) = ∪ki=1Supp(pgi)∪Supp(pd) and H(., .)is the negative of the cross entropy function. Intuitively,in order to correctly identify the generator that produceda given fake sample, the discriminator must learn to pushdifferent generators towards different identifiable modes.However, the objective of each generator remains the sameas in the standard GAN. Thus, for the i-th generator, theobjective is to minimize the following:

Ex∼pd logDk+1(x; θd)+Ez∼pz log(1−Dk+1(Gi(z; θig); θd))

To update the parameters, the gradient for each generatoris simply computed as ∇θig log(1 − Dk+1(Gi(z; θ

ig); θd)).

Notice that all the generators in this case can be up-dated in parallel. For the discriminator, given x ∼ p(can be real or fake) and corresponding δ, the gradientis ∇θd logDj(x; θd), where Dj(x; θd) is the j-th indexof D(x; θd) for which δ(j) = 1. Therefore, using thisapproach requires very minor modifications to the stan-dard GAN optimization algorithm and can be easily usedwith different variants of GAN. An intuitive visualization isshown in Fig. 3.

Theorem 1 shows that the above objective function actu-ally allows generators to form a mixture model where eachgenerator represents a mixture component and the globaloptimum of−(k+1) log(k+1)+k log k is achieved whenpd = 1

k

∑ki=1 pgi . Notice that, at k = 1, which is the

case with one generator, we obtain exactly the same Jensen-Shannon divergence based objective function as shownin [13] with the optimal value of − log 4.

Theorem 1. Given the optimal discriminator, the objectivefor training the generators boils down to minimizing

KL(pd(x)||pavg(x)

)+ kKL

(1

k

k∑i=1

pgi(x)||pavg(x))

− (k + 1) log(k + 1) + k log k (2)

where, pavg(x) =pd(x)+

∑ki=1 pgi (x)

k+1 . The above objective

function obtains its global minimum if pd = 1k

∑ki=1 pgi

with the objective value of −(k + 1) log(k + 1) + k log k.

Proof. The joint objective of all the generators is to mini-mize the following:

Ex∼pd logDk+1(x) +

k∑i=1

Ex∼pgi log(1−Dk+1(x))

4

Using Corollary 1, we substitute the optimal discriminatorin the above equation and obtain:

Ex∼pd log

[pd(x)

pd(x) +∑ki=1 pgi(x)

]+

k∑i=1

Ex∼pgi log

[ ∑ki=1 pgi(x)


]

= Ex∼pd log

[pd(x)

pavg(x)

]+ kEx∼pg log

[pg(x)

pavg(x)

]− (k + 1) log(k + 1) + k log k (3)

where, pg =∑k

i=1 pgik and pavg(x) =

pd(x)+∑k

i=1 pgi (x)

k+1 .Note that, Eq. (3) is exactly the same as Eq. (2). When

pd =∑k

i=1 pgik , both the KL terms become zero and the

global minimum is achieved.

Corollary 1. For fixed generators, the optimal distributionlearned by the discriminator D has the following form:

Dk+1(x) =pd(x)


,

Di(x) =pgi(x)


,∀i ∈ {1, · · · , k}.

where, Di(x) represents the i-th index of D(x; θd), pd thetrue data distribution, and pgi the distribution learned bythe i-th generator.

Proof. For fixed generators, the objective function of thediscriminator is to maximize

Ex∼pd logDk+1(x) +

k∑i=1

Exi∼pgi logDi(xi)

where,∑k+1i=1 Di(x) = 1 and Di(x) ∈ [0, 1],∀i. The above

equation can be written as:

∫x

pd(x) logDk+1(x)dx+

k∑i=1

∫x

pgi(x) logDi(x)dx

=

∫x∈p

k+1∑i=1

pi(x) logDi(x)dx (4)

where, pk+1(x) := pd(x), pi(x) := pgi(x),∀i ∈{1, · · · , k}, and Supp(p) =

⋃ki=1 Supp(pgi)

⋃Supp(pd),

Therefore, for a given x, the optimum of objective functiondefined in Eq. (4) with constraints defined above can be ob-tained using Proposition 1.

Proposition 1. Given y = (y1, · · · , yn), yi ≥ 0, and ai ∈R, the optimal solution for the objective function definedbelow is achieved at y∗i = ai∑n

i=1 ai,∀i

maxy

n∑i=1

ai log yi, s.t.

n∑i

yi = 1

Proof. The Lagrangian of the above problem is:

L(y, λ) =

n∑i=1

ai log yi + λ(

n∑i=1

yi − 1)

Differentiating w.r.t yi and λ, and equating to zero,

aiyi

+ λ = 0 ,

n∑i=1

yi − 1 = 0

Solving the above two equations, we obtain y∗i = ai∑ni=1 ai

.

5. ExperimentsWe present an extensive quantitative and qualitative

analysis of MAD-GAN on various synthetic and real-world datasets. First, we use a simple 1D mixture ofGaussians and also Stacked/Compositional MNIST dataset(1000 modes) to compare MAD-GAN with several knownvariants of GANs, such as DCGAN [26], WGAN [3], BE-GAN [5], GoGAN [16], Unrolled GAN [22], Mode-RegGAN [7] and InfoGAN [8]. Furthermore, we created an-other baseline, called MA-GAN (Multi-Agent GAN), whichis a trivial extension of GAN with multiple generators andone discriminator. As opposed to MAD-GAN, MA-GANhas a simple Multi-Agent architecture without modifica-tions to the objective of the discriminator. This compari-son allows us to understand the effect of explicitly enforc-ing diversity in the objective of the MAD-GAN. We useKL-divergence [19] and number of modes recovered [7]as the criterion for comparisons and show superior resultscompared to all the other methods. Additionally, we showdiverse generations for the challenging tasks of image-to-image translation [14], diverse-class data generation, andface generation. It is non-trivial to devise a metric to evalu-ate diversity on these high quality generation tasks, so weperform qualitative assessment. Note that, the image-to-image translation objective is known to learn the delta dis-tribution, thus, it is agnostic to the input noise vector. How-ever, we show that MAD-GAN is able to produce highlyplausible diverse generations for this task. In the end, weshow the efficacy of MAD-GAN in unsupervised featurerepresentation learning task. We provide detailed overviewof the architectures, datasets, and the parameters used in ourexperiments in the Appendix C.

5

(a) DCGAN (b) WGAN (c) BEGAN (d) GoGAN

(e) Unrolled GAN (f) Mode-Reg DCGAN (g) InfoGAN (h) MA-GAN (i) MAD-GAN (Our)

Figure 4: A toy example to understand the behaviour of different GAN variants in order to compare with MAD-GAN (eachmethod was trained for 198000 iterations). The orange bars show the density estimate of the training data and the blue onesfor the generated data points. After careful cross-validation, we chose the bin size of 0.1.

(a) 1 Generator (b) 2 Generators (c) 3 Generators (d) 4 Generators

(e) 5 Generators (f) 6 Generators (g) 7 Generators (h) 8 Generators

Figure 5: A toy example to understand the behavior of MAD-GAN with different number of generators (each method wastrained for 1, 98, 000 iterations). The orange bars show the density estimate of the training data and the blue ones for thegenerated data points. After careful cross-validation, we chose the bin size of 0.1.

In the case of InfoGAN [8], we varied the dimension ofthe categorical variable, depicting the number of modes, toobtain the best cross-validated results.

5.1. Non-Parametric Density Estimation

In order to understand the behavior of MAD-GAN anddifferent state-of-the-art GAN models, we first perform avery simple synthetic experiment, much easier than gen-erating high-dimensional complex images. We consider adistribution of 1D GMM [6] having five mixture compo-nents with modes at 10, 20, 60, 80 and 110, and standarddeviations of 3, 3, 2, 2 and 1, respectively. While the firsttwo modes overlap significantly, the fifth mode stands iso-

lated as shown in Fig. 4. We train different GAN modelsusing 200, 000 samples from this distribution and generate65, 536 data points from each model. In order to comparethe learned distribution with the ground truth distributions,we first estimate them using bins over the data points andcreate the histograms. These histograms are carefully cre-ated using different bin sizes and the best bin (found to be0.1) is chosen. Then, we use Chi-square distance and theKL-divergence to compute distance between the two his-tograms. From Fig. 4 and Tab. 1 it is evident that MAD-GAN is able to capture all the clustered modes which in-cludes significantly overlapped modes as well. MAD-GANobtains the minimum value in terms of both Chi-square

6

GAN Variants Chi-square(×105) KL-Div

DCGAN [26] 0.90 0.322WGAN [3] 1.32 0.614BEGAN [5] 1.06 0.944GoGAN [16] 2.52 0.652

Unrolled GAN [22] 3.98 1.321Mode-Reg DCGAN [7] 1.02 0.927

InfoGAN [8] 0.83 0.21MA-GAN 1.39 0.526

MAD-GAN (Our) 0.24 0.145

Table 1: Synthetic experiment on 1D GMM (Fig. 4).

# Generators Chi-square (×107) KL-Div

1 1.27 0.572 1.38 0.423 3.15 0.714 0.39 0.285 3.05 0.886 0.54 0.297 0.97 0.788 4.83 0.68

Table 2: Synthetic experiment with different number ofMAD-GAN generators (same setup as in Fig. 4).

distance and the KL-divergence. In this experiment, bothMAD-GAN and MA-GAN used four generators. In the caseof InfoGAN, we used 5 dimensional categorical variable,which provides the best result.

To understand the effect of varying the number of gen-erators in MAD-GAN, we use the same synthetic experi-ment setup, i.e. the real data distribution is same GMMwith 5 Gaussians. For better non-parametric estimation,we use 1 million sample points from real distribution (in-stead of 65, 536). We generate equal number of points fromeach of the generators such that they sum up to 1 million.The results are shown in Fig. 5 and corresponding Tab. 2.It is quite clear that as the number of generators are in-creased up to 4, the sampling keeps getting more realis-tic. In case when multiple modes are significantly over-lapped/clustered, a generator can capture cluster of modes.Therefore, for this real data distribution, 4 generators areenough to capture all the 5 modes. With 5 or more gener-ators, all the modes were still captured, but the two over-lapping modes have more than two generation peaks. Thisis mainly because multiple generators are capturing this re-gion and all the generators (mixture components) were as-signed equal weights during sampling.

Other works using more than one generators [21, 4] alsouse the number of generators as a hyper-parameter as know-ing a-priori the number of modes in a real-world data (e.g.

GAN Variants KL Div # Modes Covered

DCGAN [26] 2.15 712WGAN [3] 1.02 868BEGAN [5] 1.89 819GoGAN [16] 2.89 672

Unrolled GAN [22] 1.29 842Mode-Reg DCGAN [7] 1.79 827

InfoGAN [8] 2.75 840MA-GAN 3.4 700

MAD-GAN (Our) 0.91 890

Table 3: Stacked-MNIST experiments and comparisons.Note that three generators are used for MAD-GAN.

GAN Variants KL Div # Modes Covered

DCGAN [26] 0.18 980WGAN [3] 0.25 1000BEGAN [5] 0.19 999GoGAN [16] 0.87 972

Unrolled GAN [22] 0.091 1000Mode-Reg DCGAN [7] 0.12 992

InfoGAN [8] 0.47 990MA-GAN 1.62 997

MAD-GAN (Our) 0.074 1000

Table 4: Compositional-MNIST experiments and compar-isons. Note that three generators are used for MAD-GAN.

images) in itself is an open problem.

5.2. Stacked and Compositional MNIST

We now perform experiments on a more challengingsetup, similar to [7, 22], in order to examine and com-pare MAD-GAN with other GAN variants. [22] created aStacked-MNIST dataset with 25, 600 samples where eachsample has three channels stacked together with a randomMNIST digit in each of them. Thus, it creates 1000 distinctmodes in the data distribution. [22] used a stripped downversion of the generator and discriminator pair to reducethe modeling capacity. We do the same for fair comparisonsand used the same architecture as mentioned in their paper.Similarly, [7] created Compositional-MNIST whereby theytook 3 random MNIST digits and place them at the 3 quad-rants of a 64×64 dimensional image. This also resulted in adata distribution with 1000 hand-designed modes. The dis-tribution of the resulting generated samples was estimatedusing a pretrained MNIST classifier to classify each of thedigits either in the channels or the quadrants to decide themode it belongs to.

Tables 3 and 4 provide comparison of our method withvariants of GAN in terms of KL divergence and the num-ber of modes recovered for the Stacked and Compositional

7

Figure 7: Diverse generations for edges-to-handbags generation task. In each sub-figure, the first column is the input,columns 2-4 are generations by MAD-GAN (using three generators), and columns 5-7 are generations by InfoGAN (usingthree categorical codes). Clearly different generators of MAD-GAN are producing diverse results capturing different colors,textures, design patterns, etc. However, InfoGAN generations are visually almost the same, indicating mode collapse.

Figure 9: InfoGAN for edges-to-handbags task by sharing the discriminator and Q Network. In each sub-figure, the firstcolumn is the input, columns 2− 4 are generations when input is categorical code besides conditioning image, and columns5 − 7 are generations with noise as an additional input. The generations for both the architectures are visually the sameirrespective of the categorical code value, which clearly indicates that it is not able to capture diverse modes.

MNIST datasets, respectively. In Stacked-MNIST, as ev-ident from the Tab. 3, MAD-GAN outperforms all othervariants of GAN in both the criteria. Interestingly, in thecase of Compositional-MNIST, as shown in Tab. 4, MAD-GAN, WGAN and Unrolled GAN were able to recover allthe 1000 modes. However, in terms of KL divergence, thedistribution generated by MAD-GAN is the closest to thetrue data distribution.

5.3. Diverse Samples for Image-to-Image Transla-tion and Comparison to InfoGAN

Here we present experimental results on the challengingtask of image-to-image translation [14] which uses condi-tional variant of GANs [23]. Conditional GAN for this taskis known to learn the delta distribution, thus, generates thesame image irrespective of the variations in the input noisevector. Generating diverse samples in this setting in itself

Figure 11: Diverse generations for night-to-day image gen-eration task. First column in each sub-figure represents theinput. The remaining three columns show the diverse gen-erations of three different generators of MAD-GAN (Our).

8

Figure 13: Face generations using MAD-GAN. Each sub-figure shows generations by a single generator. The first generatoris generating faces with very dark background. The second one is generating female faces with long hair in light background,while the third one is generating faces with colored background and casual look (based on facial direction and expression).

is an open problem. We show that MAD-GAN is able togenerate diverse samples in these experiments as well. Weuse three generators for MAD-GAN experiments and showthree diverse generations. Note that, we do not claim to cap-ture all the possible modes present in the data distributionbecause firstly we cannot estimate the number of modes apriori, and secondly, even if we could, we do not know howdiverse the generations would be after using certain num-ber of generators. We follow the same approach as [14] andemploy patch based conditional GAN.

We compare MAD-GAN with InfoGAN [8] in these ex-periments as it is closest to our approach and can be used inimage-to-image translation task. Theoretically, latent codesin InfoGAN should enable diverse generations. However,InfoGAN can only be used when the bias introduced by thecategorical variables have significant impact on the genera-tor network. For image-to-image translation and high res-olution generations, the categorical variable does not havesufficient impact on the generations. As will be seen shortly,we validate this hypothesis by comparing our method withInfoGAN for this task. For the InfoGAN generator, to cap-ture three kinds of distinct modes, the categorical code ischosen to take three values. Since we are dealing with im-ages, in this case, the categorical code is a 2D matrix inwhich we set one third of the entries to 1 and remaining to0 for each category. The generator is fed input image alongwith categorical code appended channel wise to the image.Architecture of the Q network is same as that of the pix2pixdiscriminator [14], except that the output is a vector of size3 for the prediction of the categorical codes. Note that, wetried different variations of the categorical codes but did notobserve any significant variation in the generations.

Fig. 7 shows generations by MAD-GAN and InfoGANfor the edges-to-handbags task, where given the edges ofhandbags, the objective is to generate real looking hand-

bags. Clearly, each MAD-GAN generator is able to producemeaningful images but different from remaining generatorsin terms of color, texture, and patterns. However, InfoGANgenerations are almost the same for all the three categoricalcodes. The results shown for InfoGAN are obtained by notsharing the discriminator and Q network parameters.

To make our baseline as strong as possible, we did somemore experiments with InfoGAN for the edges-to-handbagstask. For Fig. 9, we did two experiments by sharing all theinitial layers of the discriminator and Q network. In thefirst experiment, the input is the categorical code besidesthe conditional image. In the second experiment, noise isalso added as an input. The architecture details are given inAppendix C.3.2. In Fig. 9, we show the results of both theseexperiments side by side. There are not much perceivablechanges as we vary the categorical code values. Generatorsimply learn to ignore the input noise as was also pointedby [14].

In addition, in Fig. 11, we show diverse generations forthe night-to-day task, where given night images of places,the objective is to generate their corresponding day images.As can be seen, the generated day images in Fig. 11 differin terms of lighting conditions, sky patterns, weather condi-tions, and many other minute yet useful cues.

5.4. Diverse-Class Data Generation

To further explore the mode capturing capacity of MAD-GAN, we experimented with a much more challenging taskof diverse-class data generation. In detail, we trained MAD-GAN (three generators) on a combined dataset consist-ing of various highly diverse images such as islets, ice-bergs, broadleaf-forest, bamboo-forest, and bedroom, ob-tained from the Places dataset [32]. Images were randomlyselected from each of them, creating a training dataset of24, 000 images. The generators have the same architecture

9

Figure 14: Face generations using MAD-GAN. Each gen-erator employed is DCGAN. Each row represents a genera-tor. Each column represents generations for a given randomnoise input z. Note that, the first generator is generatingfaces pointing to the left. The second generator is gener-ating female faces with long hair, while the third generatorgenerates images with light background.

as that of DCGAN. In this case, as the images in the datasetbelong to different classes, we did not share the generatorparameters. As shown in Fig. 1, to our surprise, we foundthat even in this highly challenging setting, the generationsfrom different generators belong to different classes. Thisclearly indicates that the generators in MAD-GAN are ableto disentangle inter-class variations. In addition, each gen-erator for different noise input is able to generate diversesamples, indicating intra-class diversity.

5.5. Diverse Face Generation

Here we show diverse face generations (CelebA dataset)using MAD-GAN where we use DCGAN [26] as each ofour three generators. Again, we use the same setting asprovided in DCGAN. The high quality face generations areshown in the Fig. 14.

To get better understanding about the possible diversi-ties, we show additional generations in Fig. 13.

5.6. Unsupervised Representation Learning

Similar to DCGAN [26], we train our framework usingSVHN dataset [24]. The trained discriminator is used to ex-tract features. Using these features, we train an SVM forthe classification task. For the MAD-GAN, with three gen-erators, we obtained misclassification error of 17.5% whichis almost 5% better than the results reported by DCGAN(22.48%). This clearly indicates that our framework is ableto learn a better feature space in an unsupervised setting.

6. ConclusionWe presented a very simple and effective framework,

Multi-Agent Diverse GAN (MAD-GAN), for generating di-verse and meaningful samples. We showed the efficacy ofour approach and compared it with various variants of GANthat it captures diverse modes while producing high qualitysamples. We presented a theoretical analysis of MAD-GANwith conditions for global optimality. Looking forward, aninteresting future direction would be to estimate a priori the

number of generators needed for a particular dataset. It isnot clear how to do that given that we do not have accessto the true data distribution. In addition, we would also liketo theoretically understand the limiting cases that dependon the relationship between the number of generators andthe complexity of the data distribution. Another interestingdirection would be to exploit different generators such thattheir combinations can be used to capture diverse modes.

7. AcknowledgementsThis work was supported by the EPSRC, ERC grant

ERC-2012-AdG 321162-HELIOS, EPSRC grant SeebibyteEP/M013774/1 and EPSRC/MURI grant EP/N019474/1.

References[1] M. Abadi and D. Andersen. Learning to protect communi-

cations with adversarial neural cryptography. arXiv preprintarXiv:1610.06918, 2016.

[2] M. Arjovsky and L. Bottou. Towards principled methodsfor training generative adversarial networks. In InternationalConference on Learning Representations, 2017.

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan.In International Conference on Machine Learning, 2017.

[4] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generaliza-tion and equilibrium in generative adversarial nets (gans). InInternational Conference on Machine Learning, 2017.

[5] D. Berthelot, T. Schumm, and L. Metz. Began: Boundaryequilibrium generative adversarial networks. arXiv preprintarXiv:1703.10717, 2017.

[6] C. Bishop. Pattern recognition and machine learning (infor-mation science and statistics), 1st edn. 2006. corr. 2nd print-ing edn. Springer, New York, 2007.

[7] T. Che, Y. Li, A. Jacob, Y. Bengio, and W. Li. Mode reg-ularized generative adversarial networks. In InternationalConference on Learning Representations, 2017.

[8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,and P. Abbeel. Infogan: Interpretable representation learningby information maximizing generative adversarial nets. InAdvances in Neural Information Processing Systems, 2016.

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. InComputer Vision and Pattern Recognition, 2009.

[10] I. Durugkar, I. Gemp, and S. Mahadevan. Generative multi-adversarial networks. In International Conference on Learn-ing Representations, 2017.

[11] A. Ghosh, V. Kulharia, and V. Namboodiri. Message passingmulti-agent gans. arXiv preprint arXiv:1612.01294, 2016.

[12] X. Glorot and Y. Bengio. Understanding the difficulty oftraining deep feedforward neural networks. In Proceedingsof the thirteenth international conference on artificial intel-ligence and statistics, 2010.

[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Advances in Neural InformationProcessing Systems, 2014.

10

[14] P. Isola, J.-Y. Zhu, T. Zhou, and A. Efros. Image-to-imagetranslation with conditional adversarial networks. In Com-puter Vision and Pattern Recognition, 2017.

[15] T. Joachims, T. Finley, and C. Yu. Cutting-plane training ofstructural SVMs. Machine Learning, 2009.

[16] F. Juefei-Xu, V. N. Boddeti, and M. Savvides. Gang ofgans: Generative adversarial networks with maximum mar-gin ranking. arXiv preprint arXiv:1704.04865, 2017.

[17] D. Kingma and M. Welling. Auto-encoding variationalbayes. In International Conference on Learning Represen-tations, 2014.

[18] V. Kulharia, A. Ghosh, A. Mukerjee, V. Namboodiri, andM. Bansal. Contextual rnn-gans for abstract reasoning dia-gram generation. In AAAI Conference on Artificial Intelli-gence, 2017.

[19] S. Kullback and R. A. Leibler. On information and suffi-ciency. The annals of mathematical statistics, 1951.

[20] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning-ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, andW. Shi. Photo-realistic single image super-resolution usinga generative adversarial network. In Computer Vision andPattern Recognition, 2017.

[21] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net-works. In Advances in neural information processing sys-tems, 2016.

[22] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolledgenerative adversarial networks. In International Conferenceon Learning Representations, 2017.

[23] M. Mirza and S. Osindero. Conditional generative adversar-ial nets. arXiv preprint arXiv:1411.1784, 2014.

[24] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng. Reading digits in natural images with unsupervised fea-ture learning. In NIPS Workshop on Deep Learning and Un-supervised Feature Learning, 2011.

[25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.Efros. Context encoders: Feature learning by inpainting. InComputer Vision and Pattern Recognition, 2016.

[26] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. arXiv preprint arXiv:1511.06434, 2015.

[27] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, andH. Lee. Generative adversarial text to image synthesis. InInternational Conference on Machine Learning, 2016.

[28] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-ford, and X. Chen. Improved techniques for training gans. InAdvances in Neural Information Processing Systems, 2016.

[29] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Al-tun. Support vector machine learning for interdependentand structured output spaces. In International Conferenceon Machine Learning, 2004.

[30] C. Vondrick, H. Pirsiavash, and A. Torralba. Generatingvideos with scene dynamics. In Advances In Neural Infor-mation Processing Systems, 2016.

[31] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum.Learning a probabilistic latent space of object shapes via 3dgenerative-adversarial modeling. In Advances in Neural In-formation Processing Systems, 2016.

[32] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2017.

[33] J.-Y. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros.Generative visual manipulation on the natural image mani-fold. In European Conference on Computer Vision, 2016.

11

AppendixHere, we first give better insights about the Theorem 1

and discuss how and when MAD-GAN leads to diversegenerations. In Appendix B, we introduce another wayof getting different generators to generate diverse samples.We introduce intuitive similarity based competing objective(MAD-GAN-Sim) which encourages different generators togenerate diverse samples. Finally in Appendix C, we pro-vide architecture details and data preparations for all the ex-periments reported for MAD-GAN and MAD-GAN-Sim.

A. Insights for Diversity in MAD-GANOne obvious question that could arise is that is it possi-

ble that all the generators learn to capture the same mode?.The short answer is, theoretically yes and in practice no.Let us begin with the discussion to understand this. Theo-retically, according to Theorem 1 if pgi = pd, for all i, thenalso the minimum objective value can be achieved. This im-plies, in worst case, MAD-GAN would perform same as thestandard GAN. However, as discussed below, this is possi-ble in following highly unlikely situations:

• all the generators always generate exactly similar sam-ples so that the discriminator is not able to differentiatethem. In this case, the discriminator will learn a uni-form distribution over the generator indices, thus, thegradients passed through the discriminator will be ex-actly the same for all the generators. However, thissituation in general is not possible as all the generatorsare initialized differently. Even a slight variation inthe samples from the generators will be enough for thediscriminator to identify them and pass different gradi-ent information to each generator. In addition, the ob-jective function of generators is only to generate realsamples, thus, there is nothing that encourages them togenerate exactly the same samples.

• the discriminator does not have enough capacity tolearn the optimal parameters. This is in contrast tothe assumption made in Theorem 1, which is that thediscriminator is optimal. Thus, it should have enoughcapacity to learn a feature representation such that itcan correctly identify samples from different genera-tors. In practice, this is a very easy task and we didnot have to modify anything up to the feature repre-sentation stage of the architecture of the discriminator.We used the standard architectures (explained in Ap-pendix C) for all the tasks.

Hence, with random initializations and sufficient capacitygenerator/discriminator, we can easily avoid the trivial solu-tion in which all the generators focus on exactly the same re-gion of the true data distribution. This has been very clearly

supported by various experiments showing diverse genera-tions by MAD-GAN.

B. Similarity based competing objectiveWe have discussed the MAD-GAN architecture using

generator identification based objective. In this section, wepropose a different extension to the standard GAN : simi-larity based competing objective, which we call as MAD-GAN-Sim. Here, we augment the GAN objective functionwith a diversity enforcing term. It ensures that the genera-tions from different generators are diverse where the diver-sity depends on a user-defined task-specific function.

The architecture is same as MAD-GAN discussed inSection 4.1 (refer Fig. 15).

Figure 15: MAD-GAN-Sim compared with MAD-GAN.All the generators share parameters of all the layers ex-cept the last one. Two proposed diversity enforcing objec-tives, ‘competing’ (MAD-GAN-Sim) and ‘generator identi-fication’ (MAD-GAN), are shown at the end of the discrim-inator

B.1. Approach

The approach presented here is motivated by the fact thatthe samples from different modes must look different. Forexample, in the case of images, these samples should dif-fer in terms of texture, color, shading, and various othercues. Thus, different generators must generate dissimilarsamples where the dissimilarity comes from a task-specificfunction. Before delving into the details, let us first definesome notations in order to avoid clutter. We denote θig asthe parameters of the i-th generator. The set of generators isdenoted asK = {1, · · · , k}. Given random noise z to the i-th generator, the corresponding generated sample Gi(z; θig)is denoted as gi(z). Using these notations and following theabove discussed intuitions, we impose following constraintsover the i-th generator while updating its parameters:

D(Gi(z; θig); θd) ≥ D(Gj(z; θ

jg); θd)

+ ∆(φ(gi(z)), φ(gj(z))

), ∀j ∈ K \ i (5)

where, φ(gi(z)) denotes the mapping of the generated im-age gi(z) by the i-th generator into a feature space and

12

∆(., .) ∈ [0, 1] is the similarity function. Higher the valueof ∆(., .) more similar the arguments are. Intuitively, theabove set of constraints ensures that the discriminator scorefor each generator should be higher than all other genera-tors with a margin proportional to the similarity score. Ifthe samples are similar, the margin increases and the con-straints become more active. We use unsupervised learn-ing based representation as our mapping function φ(.). Pre-cisely, given a generated sample gi(z), φ(gi(z)) is the fea-ture vector obtained using the discriminator of our frame-work. This is motivated by the feature matching based ap-proach to improve the stability of the training of GANs [28].The ∆(., .) function used in this work is the standard cosinesimilarity based function. The above mentioned constraintscan be satisfied by maximizing an equivalent unconstrainedobjective function as defined below:

U(θig, θd) := f(D(Gi(z; θ

ig); θd)−

1

k − 1

∑j∈K\i

(D(Gj(z; θ

jg); θd) + ∆(ψi, ψj)

)where, f(a) = min(0, a), ψi = φ(gi(z)), and ψj =φ(gj(z)). Intuitively, if the argument of f(.) is positive,then the desirable constraint is satisfied and there is no needto do anything. Otherwise, maximize the argument with re-spect to θig . Note that instead of using all the constraintsindependently, we use the average of all of them. Anotherapproach would be to use the constraint corresponding tothe j-th generator that maximally violates the set of con-straints shown in Eq. 5. Experimentally we found that thetraining process of the average constraint based objective ismore stable than the maximum violated constraint based ob-jective. The intuition behind using these constraints comesfrom the well know 1-slack formulation of the structuredSVM framework [15, 29]. Thus, the overall objective forthe i-th generator is:

minθig

V (θd, θig)− λ U(θig, θd)

where λ ≥ 0 is the hyperparameter. Algorithm 1 shows howto compute gradients corresponding to different generatorsfor the above mentioned objective function. Notice that,once sampled, the same z is passed through all the genera-tors in order to enforce constraints over a particular gener-ator (as shown in Eq. 5). However, in order for constraintsto not to contradict with each other while updating anothergenerator, a different z is sampled again from the pz . TheAlgorithm 1 is shown for the batch of size one which can betrivially generalized for any given batch sizes. In the caseof discriminator, the gradients will have exactly the sameform as the standard GAN objective. The only difference isthat in this case the fake samples are being generated by kgenerators, instead of one.

Algorithm 1 Updating generators for MAD-GAN-Sim

input θd; p(z); θig,∀i ∈ {1, · · · , k}; λ.1: for each generator i ∈ {1, · · · , k} do2: Sample noise from the given noise prior z ∼ pz .3: Obtain the generated sample Gi(z; θ

ig) and corre-

sponding feature vector ψi = φ(Gi(z; θig)).

4: ν ← 0.5: for each generator j ∈ {1, · · · , k} \ i do6: Compute feature vector ψj = φ(Gj(z; θ

jg)).

7: ν ← ν +D(Gj(z; θjg); θd) + ∆(ψi, ψj).

8: end for9: ν ← D(Gi(z; θ

ig); θd)− ν

k−1 .10: if ν ≥ 0 then11: ∇θig log(1−D(Gi(z; θ

ig); θd))).

12: else13: ∇θig

(log(1−D(Gi(z; θ

ig); θd)))− λU(θig, θd)

).

14: end if15: end foroutput

B.2. Experiments

We present the efficacy of MAD-GAN-Sim on the realworld datasets.

B.3. Diverse Samples for Image-to-Image Transla-tion

We show diverse and highly appealing results using thediversity promoting objective. We use cosine based similar-ity to enforce diverse generations, an important criteria forreal images. As before, we show results for the followingtwo situations where diverse solution is useful: (1) giventhe edges of handbags, generate real looking handbags as inFig. 17; and (2) given night images of places, generate theirequivalent day images as in Fig. 19. We clearly notice thateach generator is able to produce meaningful and diverseimages.

B.4. Unsupervised Representation Learning

We do the same experiment using SVHN as done inSection 5.6. For MAD-GAN-Sim we obtained the mis-classification error of 18.3% which is better than DCGAN(22.48%). It clearly indicates that MAD-GAN-Sim is ableto learn better feature representation in an unsupervised set-ting.

C. Network Architectures and ParametersHere we provide all the details about the architectures

and the parameters used in various experiments. For theexperiment concerning non-parametric density estimation,the MAD-GAN parameters are randomly initialized using

13

Figure 17: MAD-GAN-Sim: Diverse generations for‘edges-to-handbags’ image generation task. First columnin each sub-figure represents the input. The remaining threecolumns show the diverse outputs of different generators. Itis evident that different generators are able to produce verydiverse results capturing color (brown, pink, black), texture,design pattern, shininess, among others.

Figure 19: MAD-GAN-Sim: Diverse generations for ‘nightto day’ image generation task. First column in each sub-figure represents the input. The remaining three columnsshow the diverse outputs of different generators. It is ev-ident that different generators are able to produce very di-verse results capturing different lighting conditions, differ-ent sky patterns (cloudy vs clear sky), different weather con-ditions (winter, summer, rain), different landscapes, amongmany other minute yet useful cues.

xavier initialization with normal distributed random sam-pling [12]. For all the other experiments, the initializationdone is same as the base architecture used to adapt MAD-GAN.

C.1. Non-Parametric Density Estimation

Architecture Details: The generator has two fully con-nected hidden layers with 128 neurons (each of which arefollowed by exponential linear unit) and fully connectedouter layer. In case of MAD-GAN and MA-GAN, we used4 generators with parameters of first two layers shared.

Generator generates 1D samples. Input to each generatoris a uniform noise U(−1, 1) of 64 dimension. In case ofInfoGAN, 5 dimensional categorical code is further con-catenated with the uniform noise to form the input. Thecategorical code is randomly sampled from the multinomialdistribution. The discriminator architecture for respectivenetworks is shown in Tab. 5. Mode-Regularized GAN ar-chitecture has encoder, BEGAN has encoder and decoder,and InfoGAN has Q Network whose details are also presentin Tab. 5.

MAD-GAN has multi-label cross entropy loss. MA-GAN has binary cross entropy loss. For training, we useAdam optimizer with batch size of 128 and learning rate of1e − 4. In each mini batch, for MAD-GAN we have 128samples from each of the generators as well as real distribu-tion, while for MA-GAN 128 samples are chosen from realdistribution as well as all the generators combined.

Dataset Generation We generated synthetic 1D data us-ing GMM with 5 Gaussians and select their means at10, 20, 60, 80 and 110. The standard deviation used is3, 3, 2, 2 and 1. The first two modes overlap significantlywhile the fifth one is peaky and stands isolated.

C.2. Stacked and compositional MNIST Experi-ments

Architecture details: The architecture for stacked-MNIST is similar to the one used in [22]. Please refer to theTab. 6 for generator architecture and Tab. 7 for discriminatorarchitecture and Q network architecture of InfoGAN. Thearchitecture for compositional-MNIST experiment is sameas DCGAN [26]. Please refer to the Tab. 8 for discrimi-nator architecture and Q network architecture of InfoGAN.In both the experiments, Q network of InfoGAN shares allexcept the last layer with the discriminator.

Dataset preparation: MNIST database of hand writtendigits are used for both the tasks.

C.3. Image-to-Image Translation

C.3.1 MAD-GAN/MAD-GAN-Sim

Architecture details: The network architecture isadapted from [14] and the experiments were conductedwith the U-Net architecture and patch based discriminator.

In more detail, let Ck denote a Convolution-BatchNorm-ReLU layer with k filters and CDk represent a Convolution-BatchNorm-Dropout-ReLU layer with a dropout rate of

14

DCGAN, UnrolledGAN, InfoGAN, MA-GAN Disc

Mode-RegDCGANDisc

Mode-RegDCGANEnc

WGAN,GoGANDisc

BEGANEnc

BEGANDec

MAD-GANDisc

InfoGANQNet

Input: 1 1 1 32 1 1 1 1fc: 128, leaky relufc: 128, leaky relu

fc: 1 1 64 1 32 1 5(nGen+1) 5

sigmoid identity softmax

Table 5: Non-Parametric density estimation architecture for discriminators (Disc), encoders (Enc), decoders (Dec), and QNetwork (QNet). nGen is number of generators, fc is fully connected layer.

numberoutputs stride

Input: z ∼ N (0, I256)

Fully connected 4 * 4 * 64Reshape to image 4,4,64Transposed Convolution 32 2Transposed Convolution 16 2Transposed Convolution 8 2Convolution 3 1

Table 6: Generator architecture for 1000 class stacked-MNIST experiment. For MAD-GAN, all the layers exceptthose mentioned in last two rows are shared.


Input: 32x32 Color ImageConvolution 4 2Convolution 8 2Convolution 16 2FlattenFully Connected 1

Table 7: Discriminator architecture for 1000 class stacked-MNIST experiment. For MAD-GAN, with k generators, itis adapted to have k + 1 dimensional last layer output. ForInfoGAN, with 156 dimensional salient variables and 100dimensional incompressible noise, it is adapted to have 156dimensional output for Q network.

50%. All Convolutions are 4× 4 spatial filters with a strideof 2. Convolutions in the encoder, and in the discriminator,downsample by a factor of 2, whereas in the decoder theyupsample by a factor of 2.

Generator Architectures We used the U-Net generatorbased architecture from [14] as follows:

• U-Net Encoder: C64-C128-C256-C512-C512-C512-


Input: Color Image (64x64)Convolution 64 2Convolution 128 2Convolution 256 2Convolution 512 2FlattenFully Connected 1

Table 8: Discriminator architecture for 1000 classcompositional-MNIST experiment. For MAD-GAN, withk generators, it is adapted to have k + 1 dimensional lastlayer output. For InfoGAN, with 156 dimensional salientvariables and 100 dimensional incompressible noise, it isadapted to have 156 dimensional output for Q network.

C512-C512

• U-Net Decoder: CD512-CD1024-CD1024-C1024-C1024-C512-C256-C128. Note that, in case of MAD-GAN, the last layer does not share parameters withother generators.

After the last layer in the decoder, a convolution is appliedto map to the number of output channels to 3, followedby a tanh function. BatchNorm is not applied to the firstC64 layer in the encoder. All ReLUs in the encoder areleaky, with a slope of 0.2, while ReLUs in the decoder arenot leaky. The U-Net architecture has skip-connections be-tween each layer i in the encoder and layer n− i in the de-coder, where n is the total number of layers. The skip con-nections concatenate activations from layer i to layer n− i.This changes the number of channels in the decoder.

Discriminator Architectures The patch based 70 × 70discriminator architecture was used in this case : C64-C128-C256-C512.

15

Diversity term

• MAD-GAN: After the last layer, a convolution is ap-plied to map the output layer to the dimension of k+ 1(where k is the number of generators in MAD-GAN)followed by the softmax layer for the normalization.

• MAD-GAN-Sim: After the last layer, a convolution isapplied to map to a 1 dimensional output followed by aSigmoid function. For the unsupervised feature repre-sentation φ(.), the feature activations from the penul-timate layer C256 of the discriminator was used as thefeature activations for the computation of the cosinesimilarity.

For the training, we used Adam optimizer with learningrate of 2e−4 (for both generators and discriminator), λL1 =10 (hyperparameter corresponding to the L1 regularizer),λ = 1e− 3 (corresponding to MAD-GAN-Sim), and batchsize of 1.

C.3.2 InfoGAN

The network architecture is adapted from [14] and the ex-periments were conducted with the U-Net architecture andpatch based discriminator.

Generator Architectures The U-Net generator is exactlysame as in [14] except that the number of input channels areincreased from 3 to 4. For the experiment done for Fig. 9,to take noise as input, input channels are increased to 5 (oneextra input channel for noise).

Discriminator Architectures The discriminator is ex-actly same as in [14]: C64-C128-C256-C512

Q network Architectures The Q network architecture isC64-C128-C256-C512-Convolution3-Convolution3. Herefirst Convolution3 gives a output of 30 × 30 patches with3 channels while second Convolution3 just gives 3 dimen-sional output. All the layers except last two are shared withthe discriminator to perform the experiments for Fig. 9.

Diversity term To capture three kinds of distinct modes,the categorical code can take three values. Hence, in thiscase, the categorical code is a 2D matrix in which one thirdof entries are set to 1 and remaining to 0 for each category.The generator is fed input image along with categoricalcode appended channel wise to the image. For the exper-iment done for Fig. 9, to take noise as input, the generatorinput is further channel wise appended with a 2D matrix ofnormal noise.

For the training, we used Adam optimizer with learn-ing rate of 2e − 4 (for both generator and discriminator),

λL1 = 10 (hyperparameter corresponding to the L1 regu-larizer) and batch size of 1.

Dataset Preparation:

• Edges-to-Handbags: We used 137, 000 Amazon hand-bag images from [33]. The random split into train andtest was kept the same as done by [33].

• Night-to-Day: We used 17, 823 training images ex-tracted from 91 webcams. We thank Jun-Yan Zhu forproviding the dataset.

C.4. Diverse-Class Data Generation

Architecture details: The network architecture isadapted from DCGAN [26]. Concretely, the discriminatorarchitecture is described in Tab. 11 and the generatorarchitecture in Tab. 10. We use three generators withoutsharing any parameter. The residual layers helped inimproving the image quality since the data manifold wasmuch more complicated and the discriminator needed morecapacity to accommodate it.

Diversity terms For the training, we used Adam opti-mizer with the learning rate of 2e − 4 (both generator anddiscriminator) and batch size of 64.

Dataset preparation: Training data is obtained by com-bining dataset consisting of various highly diverse imagessuch as islets, icebergs, broadleaf-forest, bamboo-forest andbedroom, obtained from the Places dataset [32]. To createthe training data, images were randomly selected from eachof them, creating a dataset consisting of 24, 000 images.

C.5. Diverse Face Generations with DCGAN

Architecture details: The network architecture isadapted from DCGAN [26]. Concretely, the discriminatorarchitecture is described in Tab. 11 and the generatorarchitecture in Tab. 10. In this case all the parameters of thegenerators except the last layer were shared. The residuallayers helped in improving the image quality since the datamanifold and the manifolds of each of the generators wasmuch more complicated and the discriminator needed morecapacity to accommodate it.

16

Discriminator DInput 64x64 Color Image4x4 conv. 64 leakyRELU. stride 2. batchnorm4x4 conv. 128 leakyRELU. stride 2. batchnorm4x4 conv. 256 leakyRELU. stride 2. batchnorm4x4 conv. 512 leakyRELU. stride 2. batchnorm4x4 conv. output leakyRELU. stride 1

Table 9: DCGAN Discriminator: It is adapted to have k+ 1dimensional last layer output for MAD-GAN with k gener-ators. (normalizer is softmax).

Generator GInput ∈ R100

4x4 upconv. 512 RELU.batchnorm.shared4x4 upconv. 256 RELU. stride 2.batchnorm.shared4x4 upconv. 128 RELU. stride 2.batchnorm.shared4x4 upconv. 64 RELU. stride 2.batchnorm.shared4x4 upconv. 3 tanh. stride 2

Table 10: DCGAN Generator: All the layers except the lastone are shared among all the three generators.

Residual Discriminator DInput 64x64 Color Image7x7 conv. 64 leakyRELU. stride 2. pad 1. batchnorm3x3 conv. 64 leakyRELU. stride 2. pad 1. batchnorm3x3 conv. 128 leakyRELU. stride 2.pad 1. batchnorm3x3 conv. 256 leakyRELU. stride 2. pad 1. batchnorm3x3 conv. 512 leakyRELU. stride 2. pad 1. batchnorm3x3 conv. 512 leakyRELU. stride 2. pad 1. batchnorm3x3 conv. 512 leakyRELU. stride 2. pad 1. batchnormRESIDUAL-(N512, K3, S1, P1)RESIDUAL-(N512, K3, S1, P1)RESIDUAL-(N512, K3, S1, P1)

Table 11: Discriminator architecture for diverse-class datageneration and diverse face generation: The last layer out-put is k + 1 dimensional for MAD-GAN with k generators(normalizer is softmax). ‘RESIDUAL’ layer is elaboratedin Tab. 12.

RESIDUAL-Residual LayerInput: previous-layer-outputc1: CONV-(N512, K3, S1, P1), BN, ReLUc2: CONV-(N512, K3, S2, P1), BNSUM(c2,previous-layer-output)

Table 12: Residual layer description for Tab. 11.

Diversity terms For the training, we used Adam opti-mizer with the learning rate of 2e − 4 (both generator and

Technique 2 Gen 3 Gen 4 GenMAD-GAN 20.5% 18.2% 17.5%

MAD-GAN-Sim 20.2% 19.6% 18.3%

Table 13: The misclassification error of MAD-GAN andMAD-GAN-Sim on SVHN while varying the number ofgenerators.

discriminator) and batch size of 64.

Dataset preparation: We used CelebA dataset as men-tioned for face generation based experiments. For Imagegeneration all the images (14, 197, 122) from the Imagenet-1k dataset [9] were used to train the DCGAN with 3 Gener-ators alongside the MAD-GAN objective. The images fromboth CelebA and Imagenet-1k were resized into 64× 64.

C.6. Unsupervised Representation Learning

Architecture details: Our architecture uses the one pro-posed in DCGAN [26]. Similar to the DCGAN experimenton SVHN dataset (32×32×3) [24], we removed the penulti-mate layer of generator (second last row in Tab. 10) and firstlayer of discriminator (first convolution layer in Tab. 9).

Classification task: We trained our model on the avail-able SVHN dataset [24]. For feature extraction using dis-criminator, we followed the same method as mentioned inthe DCGAN paper [26]. The features were then used fortraining a regularized linear L2-SVM. The ablation study ispresented in Tab. 13.

Dataset preparation: We used SVHN dataset [24] con-sisting of 73, 257 digits for the training, 26, 032 digits forthe testing, and 53, 1131 extra training samples. As donein DCGAN [26], we used 1000 uniformly class distributedrandom samples for training, 10, 000 samples from the non-extra set for validation and 1000 samples for testing.

For the training, we used Adam optimizer with learningrate of 2e−4 (both generator and discriminator), λ = 1e−4(competing objective), and batch size of 64.

17

Documents

Multi-Agent Diverse Generative Adversarial Networks · Fig. 1 demon-strates the surprising effectiveness of MAD-GAN in this challenging setting. Generators among themselves are able