ARXIV SUBMISSION VERSION 1 Differentially Private ...ARXIV SUBMISSION VERSION 1 Differentially Private Synthetic Medical Data Generation using Convolutional GANs Amirsina Torﬁ, Member,

ARXIV SUBMISSION VERSION 1

Differentially Private Synthetic Medical DataGeneration using Convolutional GANs

Amirsina Torfi, Member, IEEE, Edward A. Fox, Fellow, IEEE, and Chandan K. Reddy, Senior Member, IEEE

Abstract—Deep learning models have demonstrated superiorperformance in several application problems, such as imageclassification and speech processing. However, creating a deeplearning model using health record data requires addressing cer-tain privacy challenges that bring unique concerns to researchersworking in this domain. One effective way to handle suchprivate data issues is to generate realistic synthetic data that canprovide practically acceptable data quality and correspondinglythe model performance. To tackle this challenge, we develop adifferentially private framework for synthetic data generationusing Rényi differential privacy. Our approach builds on convo-lutional autoencoders and convolutional generative adversarialnetworks to preserve some of the critical characteristics forthe generated synthetic data. In addition, our model can alsocapture the temporal information and feature correlations thatmight be present in the original data. We demonstrate that ourmodel outperforms existing state-of-the-art models under thesame privacy budget using several publicly available benchmarkmedical datasets in both supervised and unsupervised settings.

Index Terms—Deep learning, differential privacy, syntheticdata generation, generative adversarial networks.

I. INTRODUCTIONDeep learning has been successful in a wide range of

application domains such as computer vision, informationretrieval, and natural language processing due to its superiorperformance and promising capabilities. However, its successis heavily dependent on the availability of a massive amountof training data. Hence, the progress in deploying such deeplearning models can be crippled in certain critical domains suchas healthcare, where data privacy issues are more stringent,and large amounts of sensitive data are involved.

Hence, to effectively utilize specific promising data-hungrymethods, there is a need to tackle the privacy issues involved inthe medical domains. To handle the privacy concerns of dealingwith sensitive information, a common method that is oftenused in practice is the anonymization of personally identifiabledata. But, such approaches are susceptible to de-anonymizationattacks [1]. That made researchers explore alternative methods.To make the system immune to such attacks, privacy-preservingmachine learning approaches have been developed [2], [3]. Thisparticular privacy challenge is usually compounded by someauxiliary ones due to the presence of complex and often noisydata that, in practice, might typically consist of a combinationof multiple types of data such as discrete, continuous, andcategorical.

A. Torfi and E.A. Fox are with the Department of Computer Science, VirginiaTech, Blacksburg, VA 24060. C.K. Reddy is with the Department of ComputerScience, Virginia Tech, Arlington, VA 22203.

E-mail: [email protected], [email protected], [email protected] received September x, 2020; revised x.

A. Synthetic Data Generation

One of the most promising privacy-preserving approaches isSynthetic Data Generation (SDG). Synthetically generated datacan be shared publicly without privacy concerns and providesmany collaborative research opportunities, including for taskssuch as building prediction models and finding patterns. AsSDG inherently involves a generative process, GenerativeAdversarial Networks (GANs) [4] attracted much attention inthis research area, due to their recent success in other domains.

GANs are not reversible, i.e., one may not use a deterministicfunction to go from the generated samples to the real samples,as the generated samples are created using an implicit distri-bution of the real data. However, a naive usage of GANs forSDG does not guarantee the system being privacy-preservingby just relying on the fact that GANs are not reversible, asGANs are already proven to be vulnerable [5].

This problem becomes much more severe when such privacyviolations can have serious consequences, such as when dealingwith patient sensitive data in the medical domain. The primaryambiguity here is to understand how a system can claim topreserve the original data’s privacy. More precisely, how privateis a system? How much information is leaked during thetraining process? Thus, there is a need to measure the privacyof a system – to be able to judge if a system is privacy-preserving or not.

B. Differential Privacy

Differential Privacy (DP) [6] provides a mechanism to ensureand quantify the privacy of a system using a solid mathematicalformulation. Differential privacy recently became the de factostandard for statistical exploration of databases that containsensitive private data. The power of differential privacy is inits accurate mathematical representation, that ensures privacy,without restricting the model for accurate statistical reasoning.Furthermore, utilizing differential privacy, one can measure theprivacy level of a system.

Differential privacy, as a strong notion of privacy, becomescrucial in machine learning and deep learning as the system byitself often employs sensitive information to augment the modelperformance for prediction. Although many different attacktypes can jeopardize a machine learning model’s privacy, it isthe safest to assume the presence of a powerful adversary withcomplete knowledge of the pipeline, training, and model [7].Hence protecting the system against such an adversary, orat the very least, measuring the upper bound of the privacyleakage in this scenario, gives us a full view of the system’sprivacy. A fully differentially private system guarantees that

arX

iv:2

012.

1177

4v1

[cs

.LG

] 2

2 D

ec 2

020


the algorithm’s training is not dependent on an individual’ssensitive data.

One of the most adopted approaches to ensure differentialprivacy, under a modest privacy budget, and model accuracy, isDifferentially Private Stochastic Gradient Descent (DP-SGD),proposed in [8], which is the basis for many different researchworks [9], [10]. At its core, DP-SGD (1) bounds the sensitivityof the algorithm on individuals by clipping the gradients, (2)adds Gaussian noise, and (3) performs the gradient descentoptimization step. As of now, SGD under a DP regime is oneof the most promising privacy-preserving approaches.

Due to the rapid rise in the utilization of differential privacy,practical issues related to tracking the privacy budget became asubject of many discussions. Despite its great advantages, theproposed notion of differential privacy has some disadvantagesdue to what is called the composition theorem [6], [11].The composition theorem indicates that, for some composedmechanisms, the privacy cost simply adds up, leading to thediscussion of privacy budget restriction. This is problematic,especially in deep learning, as the model training is an iterativeprocess, and each iteration adds to the privacy loss. To tacklethe shortcomings of the differential privacy definition, RényiDifferential Privacy (RDP) has been proposed as a naturalrelaxation of differential privacy [12]. As opposed to differentialprivacy, RDP is a more robust notion of privacy that may leadto a more accurate and numerically stable computation ofprivacy loss.

C. Challenges with Medical DataThere are four primary challenges in the synthetic data

generation research in the medical domain: (1) Preservingthe privacy. The majority of the existing works do not trainthe model in a privacy-preserving manner. At best, theyonly try to address privacy with some statistical or machinelearning-based measurements. (2) Handling discrete data.When discrete data is involved, many of the methods usingGANs face some difficulty since GAN models are inherentlydesigned for generating continuous values. In particular, theyespecially fail when there is a mixture of continuous anddiscrete data. (3) Evaluation of synthetic data quality. Oneof the challenging issues regarding the evaluation of GANsand the quality of synthetic data in a real-world domain is:How can we measure the quality of a synthetically generateddata? However, this becomes particularly more critical in themedical domain since the use of low-quality synthetic data forbuilding prediction models can have dire consequences andmay even jeopardize human lives. (4) Temporal informationand correlated features. The temporal and local correlationbetween the features is often ignored. Incorporating suchinformation is important in the medical domain since thedisease incidents, or individual patients’ records, often havemeaningful temporal/correlation patterns. The quality of thegenerated data can be significantly improved by incorporatingsuch dependencies in the data.

D. Our ContributionsTo address the problems described before, here we propose a

privacy-preserving framework that employs Rényi Differential

Privacy and Convolutional Generative Adversarial Networks(RDP-CGAN). Our work makes the following contributions:• We track and compute privacy loss with stable numerical

computations and have tighter bounds on the privacy lossfunction, which leads to better performance under thesample privacy budget. We achieve this by employing Rényidifferential privacy in our model.

• Our model can effectively handle discrete data and cap-ture the information from a mixture of discrete-continuousdata by creating a compact, unified representation throughConvolutional Autoencoders (CEs) for unsupervised featurelearning.

• In addition to measuring the quality of synthetic data withstatistical measurements, we also employ a labeled syntheticdata generation mechanism for assessing the quality of thesynthetic data in a supervised manner.

• To incorporate temporal and correlation dependencies inthe data, we utilize one-dimensional convolutional neuralnetworks.We demonstrate that our model generates higher quality

synthetic data under the same privacy budget. Likewise, itcan provide a higher level of privacy under the same level ofsynthetic data quality. We also empirically show that, regardlessof privacy considerations, our proposed architecture generateshigher quality synthetic data quality as it captures temporaland correlated information.

The rest of this part is organized as follows: Section IIinvestigate some prior research efforts related to syntheticmedical data generation and differentiates our work from otherexisting works. Section III describes preliminary conceptsrequired to comprehend our proposed work. Section IV providesthe structural and implementation details of the proposedsystem. Section III demonstrates the comparison results withstate-of-the-arts by privacy analysis and the quality of generatedsynthetic data. Finally, Section VI concludes our findings inthis paper.

II. RELATED WORKS

There are a wide variety of works utilizing DifferentialPrivacy for generating synthetic data. A majority of themuse the mechanism proposed in [8] to train a neural networkwith differential privacy based ipon gradient clipping forbounding the gradient norms and adding noise. It followsthe general mechanism proposed in [6]. One of the mostcritical contributions in [8] is introducing the privacy accountantthat tracks the privacy loss. Motivated by the success of thismechanism, in our approach, we extend our privacy-preservingframework using Rényi Differential Privacy (RDP) [12] as anew notion of DP to calculate the privacy loss.

The problem of synthetic data generation in the healthcaredomain has seen a recent surge in the research efforts [17]–[19].Perhaps one of the earlier works in generating synthetic medicaldata is MedGAN [13] in which there is no privacy-preservingmechanism being enforced. It only uses GAN models to createsynthetic data, and it is well known that GANs are prone to beattacked. Hence, despite its good performance in generatingdata, it does not provide any kind of privacy guarantee. One of


TABLE I: Comparison of methods based on different criteria.

Method MedGAN TableGAN CorGAN DPGAN PATE-GAN RDP-CGAN[13] [14] [15] [9] [16] (Proposed)

Privacy-Preserving × × × X X XDoes Not Require Labels (Annotations) X × X X × X

Can Handle Mixed Data Types X X X × × XCapture Correlated and Temporal Information × X X × × X

the primary contributions of MedGAN is to tackle the discretedata generation through denoising autoencoders [13]. One theother hand, utilization of synthetic patient population simulatorssuch as Synthea [20] is minimal as it only relies on standardguidelines and does not model factors that may contribute toa predictive analysis [21]. Another work is CorGAN [15] thattries to generate longitudinal event sequences by capturingfeature correlations and temporal information. The TableGANapproach [14] also uses convolutional GANs to synthesize databy leveraging auxiliary classifier GANs. Another work in thisarea is the CTGAN model presented in [22] that addressestabular data which has a mixture of discrete and continuousfeatures. However, none of these approaches guarantees anysort of privacy during the data generation. Hence, there isa good chance that many of these models can compromisethe privacy of the original medical data, making these modelsvulnerable in real practice. Accordingly, we will now discussvarious privacy-preserving strategies.

One of the earliest works that addressed the privacy-preserving deep learning in the healthcare domain is [23],that uses Auxiliary Classifier GANs [24] to generate syntheticdata. It assumes having access to labeled data. However, weaim to develop a model able to create synthetic data from bothlabeled and unlabeled real data. PATE-GAN [16] is one of thesuccessful SDG approaches that use the Private Aggregationof Teacher Ensembles (PATE) [25], [26] mechanism to ensureDifferential Privacy. Another framework related to this topic isDPGAN [9], which implements a mechanism similar to the onedeveloped in [8], with the main difference of clipping weightsinstead of gradients. We compare our work with PATE-GANand DPGAN due to their success and relevance to the medicaldomain. One of the research effort associated with utilizingRDP is [27] in which the authors training a privacy-preservingmodel on MNIST dataset build on conditional GANs. Thework proposed in [28] uses RDP to enforce differential privacy.However, they do not consider temporal information. Further,we claim that their system is not fully-differentially privateas they only partially apply differential privacy. A generalcomparison for related methods is depicted in Table I. Therows in Table I describe important characteristics. First, wedesire to investigate if a model preserves the privacy or not.Second, it is crucial to assess the model’s utility by checkingif a model requires labels. If a model does not need labels, itcan be used for unsupervised synthetic data generation. Third,we target the approaches that can handle a mixture of datatypes. At last, we are also interested in the models that cancapture correlated and temporal information.

To assess the privacy-preserving characteristics of our model,we compare our method to two of the prominent research efforts

that guarantee differential privacy in the context of syntheticdata generation, namely, DPGAN and PATE-GAN. PATE-GANuses a modified version of a PATE mechanism. One issuewith the original PATE is that its privacy cost depends on theamount data it requires for a labeling process incorporated inthe mechanism. Noted that PATE mechanism uses some publicdataset (similar to real private data) for its labeling mechanism.Not only may this increase the privacy loss significantly, butthe availability of such a labeled public dataset is a limitingassumption in a practical, real-world scenario. Although PATE-GAN changed the PATE training paradigm so that it doesnot require public data, its basis is to train a student modelby employing the generator output that may aggregate theerror, and it still needs labeled data. However, the advantageof PATE is that it provides a tighter privacy upper bound asopposed to the mechanism used in DPGAN. Regarding theprivacy, the advantage of our method compared to both thesemethods is that we use the RDP mechanism which providesa tighter privacy upper bound. Furthermore, our approachuses convolutional architectures which typically perform betterthan standard multilayer perception models by capturing thecorrelated features and temporal information. Furthermore,we use convolutional autoencoders that, in an unsupervisedmanner, create a feature space that can effectively incorporateboth discrete and continuous values.

III. PRELIMINARIES

A. Autoencoders

Autoencoder are a kind of neural network architectures thatconsists of two encode and decode functions. Enc(.) : Rn →Rd and Dec(.) : Rd → Rn aim to transpose the input x ∈ Rnto the latent space I ∈ Rd and then reconstruct x̂ ∈ Rn. In anideal scenario, a perfect reconstruction of the original input datacan be achieved, i.e., x = x̂. The autoencoders usually employBinary Cross Entropy (BCE) and Mean Square Error (MSE)losses for binary and continuous inputs, respectively.

BCE(x, x̂) = − 1m

m∑i

(xilog(x̂i) + (1− xi)log(1x̂i)) (1)

MSE(x, x̂) =1

m

m∑i=1

|xi − x̂i|2 (2)

We utilize autoencoders in our model for capturing low-dimensional representations for both discrete and continuousvariables.


B. Differential Privacy

Differential Privacy establishes a guarantee of individuals’privacy via measuring the privacy loss (associated with anyinformation release extracted from a database) by a mathemat-ical definition [6], [11]. The most widely adopted definition isthe (�, δ)-differential privacy.

Definition 1 ((�, δ)-DP): The randomized algorithm A :X → Q, is (�, δ)-differentially private for all query outcomesQ and all neighbor datesets D and D′ if:

Pr[A(D) ∈ Q] ≤ e�Pr[A(D′) ∈ Q] + δ

Two datasets D, D′ that only differ by one record (e.g., apatients’ record) are called neighbor datasets. The notionof neighboring datasets emphasizes the sensitivity of anyindividual private data. The parameters (�, δ) denote the privacybudget, i.e., being differentially private does not mean we haveabsolute privacy. It only indicates our confidence level ofprivacy, given the (�, δ) parameters. The smaller the (�, δ)parameters are, the more confident we become about ouralgorithm’s privacy as (�, δ) indicates the privacy loss bydefinition.

The (�, δ)-DP with δ = 0 is called �-DP which is the initiallyproposed definition [11]. It provides a more substantial promiseof privacy as even a tiny amount of δ might be a dangerousprivacy violation due to the distribution shift [11]. One ofthe most important reasons in the usage of (�, δ)-DP is theapplicability of advanced composition theorems [6], [29].

Theorem 1 (Strong Composition [6]): Assume we executea k-fold adaptive composition of a mechanism with (�, δ)-DP.Then, the composite mechanism is (�′, kδ′ + δ)-DP for δ′, inwhich �′ =

√2kln(1/δ′)�+ k�(e� − 1) [6]. N

C. Rényi Differential Privacy

The strong composition (see Definition 1) enables tighterupper bound for compositions of (�, δ)-DP steps compared tothe basic composition theorem [6]. One particular problem withthe advanced composition theorem is that iterating this processleads to the rapid growth of parameters as each employment ofthe advanced composition theorem (each step of the iterativeprocess) lead to a various selection of possible (�(δ), δ) values.To address some of the shortcomings of the (�, δ)-DP definition,the Rényi Differential Privacy (RDP) has been proposed in[12] based on Rényi divergence (Eq. 2).

Definition 2 (Rényi divergence of order α [30]):The Rényi divergence of order α of a distribution P from

the distribution P ′ is defined as:

Dα(P ||P ′) =1

α− 1log

(EP ′(x)

(P (x)

P ′(x)

)α)The Rényi divergence, as a generalization of the Kull-

back–Leibler divergence, is precisely equal to the Kull-back–Leibler divergence for the order α = 1. The specialcase of α =∞ equals to:

D∞(P ||P ′) = log(supP ′(x)

P (x)

P ′(x)

)

which is the log of the maximum ratio of the probabilitiesover P ′(x). The order of α = ∞ establishes the connectionbetween the Rényi divergence and the �-DP. A randomizedmechanism A is �-differentially private if for two neighbordatasets D and D′, we have:

D∞(A(D)||A(D′)) ≤ �

Now, based on these definitions, let us define Rényi dif-ferential privacy (RDP) [12] as a new notion of differentialprivacy.

Definition 3 (Rényi Differential Privacy (RDP) [12]):The randomized algorithm A : D → U is (α, �)-RDP if, for

all neighbor datasets D and D′, we have:

Dα(A(D)||A(D′)) ≤ �

There are two essential properties of the RDP definition(Definition 3) that are required here.

Proposition 1 (Composition of RDP [12]):If A : X → U1 is (α, �1)-RDP and B : U1 × X → U2 is

(α, �2)-RDP, then the mechanism (M1,M2), where M1 ∼A(X ) and M2 ∼ B(M1,X ), satisfies the (α, �1 + �2)-RDPconditions.N

Proposition 2: If the mechanism A : X → U is (α, �)-RDPthen it also obeys (�+ log(1/δ)(α−1) , δ)-DP for any 0 < δ < 1 [12].N

The above two propositions form the basis for preservingprivacy in our approach. Proposition 1 determines the calcu-lation of the privacy cost in our framework as a compositionof two structures (autoencoder and convolutional GAN) andProposition 2 is applicable when one wants to calculate theextent to which our system is differentially private based onthe traditional (�, δ) notion (see Definition 1).

IV. PRIVACY-PRESERVING FRAMEWORKIn our work, we build a privacy-preserving GAN model using

Rényi differential privacy. However, since it is well known thatthe GAN models typically have a poor performance in gener-ating non-continuous data [31], we utilize autoencoders [32]to aid the GAN model by creating a continuous feature spaceto represent the input. At the same time, the GAN aims togenerate high-fidelity synthetic data in a privacy-preservingmanner. The autoencoder will transform the input space into acontinuous space, whether we have a discrete, continuous, or amixture of both types as the input. Thus, the autoencoder actsas a bridge between the GAN model and the non-continuousdata present in the medical domain.

Furthermore, the majority of existing research efforts ingenerating synthetic data [13] ignore the features’ local correla-tions or temporal information by using multilayer perceptrons,which are inconsistent with real-world scenarios such as diseaseprogression. To remedy this drawback, for both autoencoderand GAN architecture (generator & discriminator), we useconvolutional neural networks. In our approach, we employOne Dimensional Convolutional Neural Networks (1D-CNNs)to capture correlated input features’ patterns as well as temporalinformation and incorporate them within an autoencoder basedframework.


Fig. 1: The proposed RDP-CGAN framework. In the Autoencoder, the encoders and decoders are formed with convolutionallayers. Discriminator and Generator have architectures similar to that of the encoder and decoder, respectively.

The proposed framework is depicted in Fig. 1. The inputsto the generator G : Rr → Ddg and the discriminator D :Rn → Dd are the random noise z ∈ Rr sampled from N (0, 1)and the real data x ∈ Rn. Ddg and Dd are the generator anddiscriminator domains which are usually {0, 1} and [−1, 1]d,respectively. In RDP-CGAN, as opposed to the regular GAN’straining procedure, the generated fake data is decoded beforebeing fed to the discriminator. The decoding is done by feedingthe fake data to a pre-trained autoencoder.

Algorithm 1 Private Convolutional Autoencoder Pre-Training

1: Inputs & Parameters: Real data X = {xi}Ni=1, learningrate η, network weights θ, number of epochs for autoen-coder training nae, the norm bound C, the additive noisestandard deviation σae.

2: for i = 1 . . . nae do3: Sample a mini-batch of n examples. X = {xi}ni=1.4: Partition X into X1, . . . ,Xr where r =

⌊nk

⌋.

5: for l = 1 . . . r do6: L = BCE(Xl, X̂l), X̂l = Dec(Enc(Xl))7: gθ,l ←5θL(θ,Xl)8: ˆgθ,l ← gθ,l/max

(1,‖gθ,l()‖2

C

)9: end for

10: ĝθ ← 1r∑rl=1

(ˆgθ,l +N (0, σ2aeC2I)

)11: Update: θ̂ ← θ − ηĝθ12: end for

A. Convolutional Autoencoder

We build a one-dimensional convolutional autoencoder (1D-CAE) based architecture. The main goal of such a 1D-CAE is

Algorithm 2 Private 1-D Convolutional GAN Training

1: Inputs & Parameters: Real data X = {xi}Ni=1, randomnoise z where zi ∼ N (0, 1)., learning rate η, discriminatorweights ω, generator weights ψ, number of epochs forGAN training ngan, the norm bound C, the additivenoise standard deviation σgan, the number of discriminatortraining steps per one step of generator training nd.

2: for j = 1 . . . ngan do3: for k = 1 . . . nd do4: Take a mini-batch from private data X = {xi}ni=15: Sample a mini-batch Z = {zi}ni=16: Partition real data mini-batches into X1, . . . ,Xr7: Partition noise data mini-batches into Z1, . . . ,Zr8: for l = 1 . . . r do9: xi ∈ Xl and zi ∈ Zl

10: L= 1k∑ki=1 (D(xi)−D(Dec(G(zi))))

11: gω,l ←5ωL(ω,Xl)12: ˆgω,l ← gω,l/max

(1,‖gω,l()‖2

C

)13: end for14: ĝω ← 1r

∑rl=1

(ĝω,l +N (0, σ2ganC2I)

)15: Update: ω̂ ← ω − ηĝω16: end for17: Sample {zi}ni=1 from noise prior18: L= − 1n

∑ni (D(Dec(G(zi))))

19: gψ ←5ψL(ψ,Z)20: Update: ψ̂ ← ψ − ηgψ21: end for


to: (1) capture the correlation between neighboring features,(2) represent a new compact feature space, (3) transform thepossible discrete records to a new continuous space, and (4)simultaneously model discrete and continuous phenomena,which is a challenging problem [22].

We enforce privacy by adding noise, and by clippinggradients of both the encoder and decoder (see Algorithm 1,Autoencoder Pretraining Step), as in some previous researchworks [10], [33]. One might claim adding noise to the decoderpart should suffice as the discriminator only has access to thedecoder part of the CAE. We assert that such an action mightjeopardize the privacy of the model since we cannot rely onthe autoencoder training to be trusted and secure as it hasaccess to the real data. The details of autoencoder pretraining,demonstrated in Algorithm 1 are as follows:• We pre-train the autoencoder for nae steps (which will be

determined based on the privacy budget �), and hence itvaries by changing the desired level of �.

• We divide a mini-batch into several micro-batches.• For each micro-batch of size 1, we calculate the loss (line

7), calculate the gradients (line 8), and clip the gradientsto bind the model’s sensitivity to the individuals (line 9).The operation ‖gi()‖2 indicates the `2 norm over the micro-batch gradients, and C is an arbitrary upper bound for thegradients’ norm.

• The Gaussian noise (N (0, σ2aeC2I))will then be indepen-dently added to the gradients of each micro-batch and willbe aggregated (line 10).

• The last step is where the optimizer performs the parameterupdate (line 11).

B. Convolutional GAN

The pseudocode for the Convolutional GAN is given inAlgorithm 2, under the procedure GAN Training Step. Ascan be observed in Algorithm 2, we only enforce differentialprivacy on the discriminator as only it has access to the realdata. To avoid the issue of mode collapse [34], we train theCGAN model, using Wasserstein GAN [35] since it is anefficient approximation of Earth Mover (EM) distance and isshown to be robust to the mode collapse problem during modeltraining [35].

The Earth-Mover (EM) distance or Wasserstein-1 distancerepresents the minimum price in transforming the generateddata distribution Pg to the real data distribution Pr:

W (Px,Pg) = infϑ∈Π(Px,Pg)

E(x,y)∼ϑ[ ‖x− y‖ ] , (3)

ϑ(x, y) refers to how much “mass” should be moved from xto y to transform Px to Pg .

However, as the infimum in (3) is intractable, based on thethe Kantorovich-Rubinstein duality [36], the WGAN proposethe Eq. 4 optimization:

W (Px,Pg) = sup‖f‖L≤1

Ex∼Px [f(x)]− Ex∼Pg [f(x)] (4)

For simplicity of the definition, infimum and supremum indicatethe greatest lower bound and the least upper bound, respectively.

Definition 4 (1-Lipschitz functions): Given two metric spaces(X, dX) and (Y, dY ), where d denotes the metric (e.g., distancemetric), the function f : X→ Y is called K-Lipschitz if:

∀(x, x′) ∈ X,∃K ∈ R : dY (f(x), f(x′)) ≤ KdX(x, x′) (5)

Using the distance metric and K = 1, Eq. (5) is equivalentto:

∀(x, x′) ∈ X : |f(x)− f(x′)| ≤ |x− x′| (6)

It is clear that for computing the Wasserstein distance, weshould find a 1-Lipschitz function (see Definition 4 and Eq. 6..The approach is to build a neural model to learn it. Theprocedure is to construct a discriminator D without the Sigmoidfunction, and output a scalar instead of the probability ofconfidence.

Regarding the privacy considerations, the generator has noaccess to real data directly (although it has access to thegradients from the discriminator); only the discriminator willhave access to it. We propose only to train the discriminatorunder differential privacy for this aim. More intuitively, thisapproach considers the post-processing theorem in differentialprivacy as follows.

Theorem 2 (Post-processing [6]): Assume D : N|X | →D is an algorithm which is (�, δ)-differentially private andG : D → O is some arbitrary function operates on . ThenG ◦D : N|X | → O is (�, δ)-differentially private as well [6].N

Following the argument above, we consider D and G asdiscriminator and generator functions, respectively. Since thegenerator is an arbitrarily randomized mapping on top ofthe discriminator, enforcing the differential privacy on thediscriminator suffices to guarantee that the overall systemensures differential privacy, and we do not need to train aprivate generator as well. The generator training procedure isgiven in Algorithm 2, lines 17-20. As can be observed, wedo not have a private generator and the loss function is theregular generator loss function of the WGAN method [35].

C. Architecture Details

For the GAN discriminator, we used five convolutionallayers similar to the autoencoder (encoder), except for the lastlayer that we have another dense layer, of output size 1, fordecision making. For all layers, we used PReLU activation [37],except for the last layer that does not use any activation. Forthe generator, we used transposed convolutions (also calledfractionally-strided convolutions) similar to the ones used in[38]. However, we use 1-D transposed convolutions. The GANgenerator has an input noise of size 100, and its output size isset to 128.

In the autoencoder and for the encoder, we used an archi-tecture similar to the GAN discriminator, except we discardthe last layer of the discriminator. For the decoder, as done inthe generator, we used transposed convolutional layers in thereverse order of the ones used for the encoder. For the encoder,we used PReLU activation layers except for the last layerof the encoder where Tanh was used to match the generatoroutput. For the decoder, we also used PReLU activation except


for the last layer where a Sigmoid activation function wasused to bound the range of output data to the [0, 1] to ideallyreconstruct the discrete range of {0, 1} aligned with the inputdata. The decoder input size (the encoder output size) is equalto the GAN’s generator output dimension.

Since our model inputs sizes may change for differentdatasets, we modify the input pipeline of our architectureby varying the dimensions of convolution kernels, stride forthe GAN discriminator, and the autoencoder, to match thenew dimensionality. The GAN generator does not require anychange since, for all experiments, we used the same noisedimension as mentioned above.

D. Privacy Loss

As tracking the RDP privacy accountant [8] is computation-ally more precise than the regular DP, we based our privacyloss calculation on RDP. Then, the RDP computations can betransformed to DP based on the proposition 2. The Gaussiannoise additive procedure is also called Sampled GaussianMechanism (SGM) [39]. For tracking privacy loss, we usethe following Theorem.

Theorem 3 (SGM privacy loss [39]): Assume D and D′ aretwo neighbor datasets and G is a SGM applied to a functionf with `2-sensitivity of one. If we set the following:

G(D) ∼ ϕ1∆= N (0, σ2)

G(D′) ∼ ϕ2∆= (1− q)N (0, σ2) + qN (1, σ2)

where ϕ1 and ϕ2 are PDFs, then, the mechanism G satisfies(α, �)-RDP for:

� ≤ 1α− 1

log(max {Aα, Bα})

where Aα∆= Ex∼ϕ1 [(ϕ2/ϕ1)

α] and Bα

∆=

Ex∼ϕ2 [(ϕ1/ϕ2)α]. N

It can be proven that Aα ≤ Bα and the RDP privacy com-putation can solely be focused on upper bounding Aα whichcan be calculated with a closed-form bound and numericalcalculations [39]. We use the same numerical calculationshere. However, that bounds � for each step. The overall boundof � for the entire training process can be calculated by�total = num steps×� for any given α (see proposition 1). Todetermine the tighter upper bound, we try multiple α valuesand use the minimum (�) and its associated α (obtained byRDP privacy accountant) to compute the (�, δ)-DP employingproposition 2.

System privacy budget: One important question is: howdo we calculate the (α, �)-RDP for the combination of anautoencoder and GAN training? Consider proposition 1, inwhich X , U1, and U2 are the input and output spaces of theautoencoder and the output space of the discriminator. Themechanisms A and B are the autoencoder and discriminator,respectively. Consider U1 is what the discriminator observesafter decoding of the fake samples, and X is the space of realinputs. Hence, the proposition 1 directly results in having thewhole system with (α, �ae + �gan)-RDP by fixing the α. Butwe cannot guarantee that we can have a fixed α, and the RDP

has a budget curve parameterized by it. The procedure of fixingα is as follows.• Assume we have two systems S1 and S2 such that one is

the autoencoder, and the other is the GAN. Hence we havetwo systems which (α1, �1)-RDP and (α2, �2)-RDP.

• Without loss of generality, we assume �1 ≤ �2. We pickαtotal = α2. Now we have system S2 which (αtotal, �2)-RDP.

• For S1 we pick αtotal = α1 and calculate �′ that � ≤ �′.• Now, the total system (S1, S2) satisfies (αtotal, �2+�′)-RDP.

V. EXPERIMENTS

In this section, we will first present the details of theexperimental setup and then report the results obtained invarious experiments and compare with different methodsavailable in the literature.

A. Experimental Setup

Here, we split the dataset to train Dtr and test setsDte. We utilize Dtr to train the models, and then generatesymthesized samples Dsyn using the trained model. We set|Dsyn| = |Dtr|. Recall from Section IV that although fordifferent datasets we alter the architectures associated withthe input size, the encoder output space (decoder input space)and the generator output space will always have the samedimensionality. Furthermore, the encoder input space, thedecoder output space, and the discriminator input space willalso have the same dimensionality. It is worth noting thatall of the reported � values are associated with the (�, δ)-DPdefinition with δ = 10−5 unless otherwise stated.

Both the convolutional autoencoder and convolutional GANwere trained with the Adam optimizer [40] with the learningrate of 0.005, with a mini-batch size of 64. For both generatorand discriminator we used Batch Normalization (BN) [41] toimprove the training. We used one GeForce RTX 2080 NVIDIAGPU for our experiments.

We compare our framework with various different methods.Depends on the experiments, some of those methods mayor may not be used for comparison depending on theircharacteristics. For further details, please see Table I.

B. Datasets

To evaluate this performance, we used the following datasetsfor our experiments:1) MIMIC-III [42]: This dataset consists of the medi-

cal records from around 46K patients. The MIMIC-IIIdataset [42] includes the medical records of 46K patients.

2) Kaggle Cervical Cancer [43]: This dataset covers patientsrecords and aims to classify the cervical cancer. Thereare multiple attributes in the dataset of discrete, andcontinuous types and one attribute associated with the classlabel (e.g., Biopsy as the target for classification).

3) UCI Epileptic Seizure Recognition [44]: In the UCIEpileptic Seizure Recognition dataset [44], the task is toclassify the seizure activity. The features are the values of


the Electroencephalogram (EEG) records at various timepoints.

4) PTB Diagnostic ECG [45]: The electrocardio-grams (ECGs) in this dataset [45] are used to classify theheart activity as normal or abnormal. This dataset contains14552 samples, out of which 4046 are classified as normal,and 10506 are abnormal activities.

5) Kaggle Cardiovascular Disease [46]: This Kaggledatasetis used to determine if a patient has cardiovasculardisease or not from a variety of features such as age, systolicblood pressure and diastolic blood pressure. This datasethas a mixture of discrete and continuous feature types aswell.

6) MIT-BIH Arrhythmia [47]: The MIT-BIH Arrhythmiadatabase [48] includes ambulatory ECG recordings, createdfrom 47 subjects. records obtained from a mixed group ofinpatients (requires overnight hospitalization) and outpa-tients (usually do not require overnight hospitalization) atBoston’s Beth Israel Hospital. Unlike other datasets in thisset of experiments, for the MIT-BIH Arrhythmia database,we are dealing with a multicategory classification task. TheECG signals correspond to normal (negative) and abnormalcases (positive). There are a total of five classes in thisdataset. Class zero is associated with normal cases. Therest of the classes are associated with different arrhythmiasand myocardial infarction which we call abnormal.

C. Baseline Comparison Methods

We compare our model with different benchmark meth-ods (see Table I). Depends on the nature of experiments, themodels are used as benchmarks. For example, if the experimentsare associated with different privacy settings, the models thatdo not preserve the privacy, will not be used for comparison.

1) MedGAN [13]: MedGAN consists of an architecture that isdesigned to generate discrete records. It has an autoencoderand a vanilla GAN. MedGAN does not preserve privacy.MedGAN can be used for unsupervised synthetic data gen-eration. We used MedGAN open-source implementation1.

2) TableGAN [14]: TableGAN consists of three components:A generator, a discriminator, and an auxiliary classifierthat aims to augment the semantic integrity of synthesizeddata. Similar to our method, TableGAN uses ConvolutionalNeural Networks. TableGAN requires labeled training dataand will be used in experiments with a supervised setting.We used TableGAN open-source implementation2.

3) DPGAN [9]: DPGAN enforces differential privacy by clip-ping the weights rather than the gradients. It uses WGANto stabilize the training and use multi-layer perceptions forits architecture. DPGAN does not require labeled data andhence will be used in both supervised and unsupervisedsettings. We used DPGAN open-source implementation 3.

4) PATE-GAN [16]: PATE-GAN proposes a differentiallyprivate model formed based on the modification of the

1https://github.com/mp2893/medgan2https://github.com/mahmoodm2/tableGAN3https://github.com/illidanlab/dpgan

PATE mechanism. We utilized our own implementation ofthe PATE-GAN as its code was not available publicly.

D. Unsupervised Synthetic Data Generation

We chose electronic health records after converting themto a high-dimensional binary discrete dataset to demonstratethe privacy-preserving potential of the proposed model. Here,we have Dtr ∈ {0, 1}R×|T |, Dte ∈ {0, 1}T×|T |, and Dsyn ∈{0, 1}S×|T |, where |T | represents the feature size. The goalis to capture the information from a real private dataset, anduse that to synthesize a dataset with similar distribution ina privacy-sensitive manner. In this section, we can compareour method with models that do not require labeled data (forunsupervised synthetic data generation) and are also privacy-reserving method (see Table I). In the experiments of thissection, to pretrain the autoencoder, we used Eq. 1 as the lossfunction.

Dataset Construction: We used MIMIC-III dataset [42]as an example for high-dimensional and discrete dataset.From MIMIC-III, We extracted and used 1071 unique ICD-9 codes. The dataset has different discrete variables (e.g.,diagnosis codes, procedure, etc.). Assuming there are |V |discrete variables, we can represent the patient’s record usinga binary vector v ∈ {0, 1}|V |. The ith variable is set to oneif it is available in the patient’s record. In our experiments,|V | = 1071.

Fig. 2: The comparison of generated and real data distributions.Lower MMD score indicates a higher distribution similarityand hence, a better model.

Evaluation Metrics: To assess the quality of syntheticallygenerated data, we used two evaluation metrics:1) Maximum Mean Discrepancy (MMD): This statistical

measure indicates the extent to which the model captures thestatistical distribution of the real data. In a recent study [49],MMD demonstrated most of the desired features of anevaluation metric, especially for GANs. This is used inan unsupervised setting since there are no labeled data forstatistical measurements. To report MMD, we comparedtwo samples of real and synthetic data, each of size 10000.

2) Dimension-wise prediction: This setting aims to illustratethe inter-connection between features, i.e., if we can predict

https://github.com/mp2893/medganhttps://github.com/mahmoodm2/tableGANhttps://github.com/illidanlab/dpgan


missing features using the features available in the dataset.We select top-10 and top-50 most frequent features (ICD-9 codes in the patients’ history) from training, test, andsynthetic sets. In each run, one testing dimension (k) fromDsyn and Dtr will be selected as Dsyn,k ∈ {0, 1}N×1 andDtr,k ∈ {0, 1}N×1. The remaining dimensions (Dsyn,\k ∈{0, 1}N×1 and Dtr,\k ∈ {0, 1}N×1) will be utilized totrain a classifier that aims to predict Ste,k ∈ {0, 1}N×1from the test set. We employed Random Forests [50], XG-Boost [51], and Decision Tree [52]. We report the averagedperformance over all predictive models (trained on syntheticdata) and for all features using the F1-score (Figure 3). Wecompare models by considering different privacy budgetsby fixing δ = 10−5 and varying �.

As can be observed in Figure 2 and Figure 3, withoutenforcing privacy (�), our model performs better compared toDPGAN. This is due to the fact that by using 1-D convolutionalarchitecture, our model captures the correlated informationamong adjacent ICD-9 codes in the MIMIC-III dataset. Top-features indicate the most frequent features in the database.

Fig. 3: The prediction accuracy of models trained on syntheticdata by varying privacy budget. The number of top featuresused for training are ‘10’ and ‘50’. The averaged F-1 scoreof all models to train on the real data is 0.44 and 0.35 forpicking top-10 and top-50 features respectively.

To further investigate the effect of CGANs and CAEs incapturing correlated features, we conduct some experimentswithout enforcing any privacy and compare our method withexisting methods. As can be observed in Table II our methodoutperforms other methods. Note that when the number oftop features are increased, our method shows even betterperformance compared to other methods since capturingcorrelation for higher number of features becomes more feasiblewith convolutional layers.

E. Supervised Synthetic Data Generation

In this part, we consider a supervised setting and generatelabeled synthetic data. For the experiments of this part, weneed labeled data. The supervised setting here includes variousclassification tasks on different datasets. In the experiments of

TABLE II: Comparison of different methods using F-1 scorefor the dimension-wise prediction setting. Except for the RealData column, the classifiers are trained on the synthetic data.The closer the results are to the Real Data experiments, thehigher the quality of synthetic data is and hence the model isbetter.

Top Features Real Data MedGAN DPGAN RDP-CGAN

10 0.51 ± 0.043 0.41 ± 0.013 0.36 ± 0.010 0.44 ± 0.01750 0.45 ± 0.037 0.24 ± 0.017 0.21 ± 0.017 0.35 ± 0.056100 0.36 ± 0.051 0.15 ± 0.016 0.12 ± 0.009 0.25 ± 0.096

this section, to pretrain the autoencoder, we used Eq. 2 as theloss function.

Data Processing: For all the datasets except for MIMIC,we follow the same data processing steps that are describedin Section V-B. For the MIMIC-III dataset, we used one ofthe top three codes across hospital admissions [42], whichis 414.01 (associated with coronary atherosclerosis) andextracted patients diagnosed with that specific medical code.The classification task will be mortality prediction. We used12,127 unique admissions and variables such as demographicinformation (Marital status, ethnicity, and insurance status),admission Information (e.g., days of admission), treatmentinformation (cardiac defibrillator implant with/without cardiaccatheterization), diagnostic information (respiratory disorder,cancer, etc.), and lab results (kidney function tests, creatinekinase, etc.) forming a total of 31 variables. Each admissionis considered as one data sample. Since multiple admissionsmight be associated with one patient we will have multiplesamples per patient. This comes from the fact that, in a 1-yearinterval, a patient may survive, but within the next few yearsand with new medical conditions, the same patient may notsurvive in the 1-year observation windows.

Evaluation: Let us consider the following two settings: (A)Train and test the model on the real data as the baseline (resultsshown in Table III). (B) Train on the generated data and teston the real data (this setting is used to quantify the quality ofsynthetic data). Note that the class distribution of the generatedsynthetic data must be identical to the real data. Setting (B)can demonstrate how well the model has performed the taskof synthetic data generation. The closer the performance ofsetting (B) is to setting (A), the better the model is in termsof the quality of the generated synthetic data. We conducteddifferent sets of experiments. We conduct ten runs (E=10), foreach experiment, and report the averaged AUROC (Area Underthe ROC curve) and AUPRC (Area Under the Precision-RecallCurve) for the models’ evaluations. The AUPRC metric isbeing utilized here since it operates better than AUROC foran imbalanced classification setting.

1) The effect of architecture: In order to investigate theeffect of convolutional GANs and convolutional autoencodersin our model, we perform some experiments with no privacyenforcement. The case � =∞ corresponds to the setup wherewe do not enforce any privacy. The AUROC and AUPRCresults are reported in Table IV and Table V, respectively. Inthe aforementioned tables, the best and second best resultsare indicated with bold and underline text. The classifiers are


TABLE III: Summary statistics of the datasets used in this paper along with the performance of the base model.

Dataset Samples Features Positive Labels AUROC AUPRC

MIMIC-III 12,127 31 3272 (27%) 0.81 ± 0.005 0.74 ± 0.007Kaggle Cervical Cancer 858 36 52 (6%) 0.94 ± 0.011 0.69 ± 0.003UCI Epileptic Seizure 11,500 178 2300 (20%) 0.97 ± 0.008 0.94 ± 0.014

PTB Diagnostic 14,552 118 10506 (62%) 0.97 ± 0.006 0.96 ± 0.003Kaggle Cardiovascular Disease 70,000 14 35000 (50%) 0.80 ± 0.017 0.78 ± 0.011

MIT-BIHArrhythmia 109,444 188 18606 (17%) 0.94 ± 0.006 0.89 ± 0.014

TABLE IV: Performance comparison of different methods using AUROC metric under no privacy constraints.

Dataset MedGAN TableGAN DPGAN PATE-GAN RDP-CGAN

MIMIC-III 0.71 ± 0.011 0.72 ± 0.017 0.71 ± 0.014 0.70 ± 0.007 0.74 ± 0.012Kaggle Cervical Cancer 0.89 ± 0.010 0.90 ± 0.012 0.90 ± 0.016 0.89 ± 0.009 0.92 ± 0.009UCI Epileptic Seizure 0.87 ± 0.013 0.91 ± 0.019 0.88 ± 0.024 0.85 ± 0.014 0.90 ± 0.011

PTB Diagnostic 0.86 ± 0.018 0.94 ± 0.009 0.88 ± 0.018 0.91 ± 0.006 0.93 ± 0.016Kaggle Cardiovascular Disease 0.71 ± 0.031 0.73 ± 0.016 0.71 ± 0.019 0.72 ± 0.008 0.76 ± 0.014

MIT-BIHArrhythmia 0.90 ± 0.021 0.89 ± 0.019 0.88 ± 0.012 0.86 ± 0.014 0.90 ± 0.009

TABLE V: Performance comparison of different methods using AUPRC metric under no privacy constraints.

Dataset MedGAN TableGAN DPGAN PATE-GAN RDP-CGAN

MIMIC-III 0.68 ± 0.011 0.71 ± 0.007 0.69 ± 0.006 0.70 ± 0.019 0.72 ± 0.009Kaggle Cervical Cancer 0.55 ± 0.012 0.60 ± 0.019 0.59 ± 0.008 0.58 ± 0.020 0.62 ± 0.017UCI Epileptic Seizure 0.79 ± 0.013 0.86 ± 0.016 0.82 ± 0.015 0.81 ± 0.012 0.84 ± 0.020

PTB Diagnostic 0.88 ± 0.006 0.93 ± 0.008 0.90 ± 0.007 0.88 ± 0.014 0.92 ± 0.012Kaggle Cardiovascular Disease 0.69 ± 0.021 0.73 ± 0.011 0.73 ± 0.021 0.72 ± 0.017 0.75 ± 0.009

MIT-BIHArrhythmia 0.77 ± 0.016 0.85 ± 0.013 0.82 ± 0.008 0.81 ± 0.005 0.86 ± 0.013

trained on the synthetic data. The closer the results are to theReal Data experiments, the higher the quality of synthetic datais and hence, we have a better model. As can be observedfrom those tables, for the challenging datasets where thereis a mixture of continuous and discrete variables, our modeloutperforms others since it not only captures feature correlationsbut also utilizes convolutional autoencoders which allow ourapproach learn robust representations for the mixture of datatypes. For other datasets, TableGAN closely competes withour model. This is because TableGAN also uses convolutionalGANs as well as auxiliary classifiers [24] for improving theperformance of the GANs. We could incorporate auxiliaryclassifiers to improve our model, however, it would narrowdown the scope of our work since using such auxiliary modelswould make it infeasible for unsupervised synthetic datageneration.

2) The effect of differential privacy: We will now investigatehow well different methods perform in the differential privacysetting. The base model is when we train using setting (A),which will have the highest accuracy as expected. So the ques-tion we investigate here is: how much will the accuracy dropfor different models at the same level of privacy budget (�, δ)?We compare our model with privacy-preserving models (seeTable I). Table VI shows these results for the supervisedsetting described above. In the majority of the experiments, thesynthetic data generated by our model, demonstrates higher

quality for classification tasks compared to other models underthe same privacy budget.

3) The effect of privacy budget: We also investigate howmodels will perform under different privacy budgets. Fig. 4demonstrates the trade-off between the privacy budget and thesynthetic data quality. RDP-CGAN consistently shows betterperformance compared to other benchmarks in all the datasets.Our approach demonstrates it is particularly effective in lowerprivacy budgets. As an example, for the Kaggle CervicalCancer dataset and for � = 100, 10, 1, 0.1, our model achievessignificantly higher AUPRC compared to the PATE-GAN.

4) Ablation study: We will now conduct an ablation studyto investigate the effect of each component of our model. Inparticular, we are interested in assessing the importance andimpact of utilizing autoencoders and convolutional architectures.The results of our ablation study are reported in Table VIIIand Table VII. The various scenarios reported are describedbelow:

• W/O CAE: without convolutional architecture in autoencoder.• W/O AE: without autoencoder.• W/O CG: without convolutional architecture in GAN’s

generator.• W/O CD: without convolutional architecture in GAN’s

discriminator.• W/O CDCG: a standard GAN architecture with MLP.


TABLE VI: The comparison of different models under the (1, 10−5)-DP setting. For � =∞, we used our RDP-CGAN modelwithout enforcing privacy. We used synthetic data for training all models. The best and second best results are indicated withbold and underline text.

AUROC AUPRC

Dataset � =∞ DPGAN PATE-GAN RDP-CGAN � =∞ DPGAN PATE-GAN RDP-CGAN

MIMIC-III 0.74 ± 0.012 0.59 ± 0.009 0.62 ± 0.011 0.67 ± 0.010 0.72 ± 0.009 0.55 ± 0.013 0.58 ± 0.017 0.62 ± 0.015Kaggle Cervical Cancer 0.92 ± 0.009 0.86 ± 0.009 0.91 ± 0.005 0.89 ± 0.005 0.62 ± 0.017 0.53 ± 0.008 0.54 ± 0.014 0.57 ± 0.007UCI Epileptic Seizure 0.90 ± 0.011 0.72 ± 0.009 0.74 ± 0.014 0.84 ± 0.008 0.84 ± 0.020 0.57 ± 0.003 0.63 ± 0.016 0.69 ± 0.019PTB Diagnostic ECG 0.93 ± 0.016 0.71 ± 0.012 0.75 ± 0.012 0.79 ± 0.009 0.92 ± 0.012 0.71 ± 0.018 0.76 ± 0.011 0.80 ± 0.008Kaggle Cardiovascular 0.76 ± 0.014 0.61 ± 0.019 0.66 ± 0.006 0.69 ± 0.013 0.75 ± 0.009 0.60 ± 0.001 0.63 ± 0.007 0.66 ± 0.016MIT-BIHArrhythmia 0.90 ± 0.009 0.69 ± 0.004 0.73 ± 0.006 0.77 ± 0.003 0.86 ± 0.013 0.68 ± 0.023 0.73 ± 0.016 0.78 ± 0.008

(a) MIMIC-III. (b) Kaggle Cervical Cancer. (c) UCI Epileptic Seizure.

(d) PTB Diagnostic. (e) Kaggle Cardiovascular. (f) MIT-BIH Arrhythmia.

Fig. 4: The effect of privacy budget on the synthetic data quality measured by AUPRC. The baseline is associated with � =∞for which we trained each model with no privacy constraint (see Table V). Higher AUPRC is associated with higher quality ofsynthetic data and hence represents a better model.

The results shown in Table VIII and Table VII demonstratethe importance of convolutional architectures. The interestingfinding here is that, for the datasets with only continuousvalues like UCI Epileptic Seizure, PTB Diagnostic ECG, andMIT-BIHArrhythmia) (as opposed to a mixture of continuous-discrete variables), the use of autoencoders actually downgradedthe system performance. Hence, we can conclude that autoen-coders are very useful in the presence of discrete data or amixture of both continuous and discrete data types.

VI. CONCLUSIONIn this work, we proposed and developed a differentially

private framework for synthetic data generation using RényiDifferential Privacy. The model aimed to capture the temporalinformation and feature correlation using convolutional neuralnetworks. We empirically demonstrate that by employingconvolutional autoencoders we can effectively handle variables

that are continuous, discrete or a mixture of both. We argue thatwe will need to secure both encoder and decoder parts of theautoencoder since there is no guarantee that the autoencoder’straining is being done by a trusted third-party. We show ourmodel outperforms other models under the same privacy budget.This phenomenon may come in part from reporting a tighterbound and, in part, from utilizing convolutional networks. Wereported the performance of different models by replacingreal data with synthetic data for training our machine learningmodels.

ACKNOWLEDGEMENTSThis work was supported in part by the US National Science

Foundation grants IIS-1619028, IIS-1707498, IIS-1838730 andNVIDIA Corp. We also acknowledge the support provided byNewWave Telecom Technologies during the early stages ofthis project through a fellowship award.


TABLE VII: Ablation study results. For the baseline � =∞, we used our RDP-CGAN model without enforcing privacy. Eachcolumn corresponds to a change in the core RDP-CGAN architecture. The reported results are AUROC.

Dataset � =∞ W/O CAE W/O AE W/O CG W/O CD W/O CDCG

MIMIC-III 0.74 ± 0.012 0.73 ± 0.017 0.69 ± 0.031 0.72 ± 0.023 0.71 ± 0.027 0.70 ± 0.016Kaggle Cervical Cancer 0.92 ± 0.009 0.90 ± 0.019 0.86 ± 0.027 0.92 ± 0.031 0.90 ± 0.014 0.88 ± 0.021UCI Epileptic Seizure 0.90 ± 0.011 0.89 ± 0.016 0.91 ± 0.021 0.88 ± 0.022 0.86 ± 0.008 0.85 ± 0.026PTB Diagnostic ECG 0.93 ± 0.016 0.91 ± 0.011 0.94 ± 0.015 0.92 ± 0.029 0.91 ± 0.027 0.89 ± 0.014Kaggle Cardiovascular 0.76 ± 0.014 0.74 ± 0.012 0.71 ± 0.041 0.43 ± 0.015 0.72 ± 0.021 0.71 ± 0.026MIT-BIHArrhythmia 0.90 ± 0.009 0.88 ± 0.013 0.90 ± 0.014 0.88 ± 0.012 0.86 ± 0.017 0.84 ± 0.041

TABLE VIII: Ablation study results. For the baseline � =∞, we used our RDP-CGAN model without enforcing privacy. Eachcolumn corresponds to a change in the core RDP-CGAN architecture. The reported results are AUPRC.

Dataset � =∞ W/O CAE W/O AE W/O CG W/O CD W/O CDCG

MIMIC-III 0.72 ± 0.009 0.70 ± 0.032 0.67 ± 0.054 0.70 ± 0.076 0.69 ± 0.023 0.66 ± 0.065Kaggle Cervical Cancer 0.62 ± 0.017 0.61 ± 0.031 0.58 ± 0.017 0.59 ± 0.036 0.59 ± 0.041 0.55 ± 0.023UCI Epileptic Seizure 0.84 ± 0.020 0.82 ± 0.034 0.86 ± 0.041 0.79 ± 0.024 0.81 ± 0.012 0.80 ± 0.018PTB Diagnostic ECG 0.92 ± 0.012 0.90 ± 0.028 0.94 ± 0.031 0.89 ± 0.022 0.88 ± 0.022 0.88 ± 0.029Kaggle Cardiovascular 0.75 ± 0.009 0.73 ± 0.052 0.71 ± 0.017 0.72 ± 0.033 0.71 ± 0.041 0.70 ± 0.014MIT-BIHArrhythmia 0.86 ± 0.013 0.83 ± 0.019 0.87 ± 0.026 0.85 ± 0.038 0.84 ± 0.023 0.82 ± 0.032

REFERENCES

[1] A. Narayanan and V. Shmatikov, “Robust de-anonymization of largesparse datasets,” in 2008 IEEE Symposium on Security and Privacy (sp2008), pp. 111–125, IEEE, 2008.

[2] M. Al-Rubaie and J. M. Chang, “Privacy-preserving machine learning:Threats and solutions,” IEEE Security & Privacy, vol. 17, no. 2, pp. 49–58,2019.

[3] Y. Li, C. Bai, and C. K. Reddy, “A distributed ensemble approach formining healthcare data under privacy constraints,” Information sciences,vol. 330, pp. 245–259, 2016.

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,”in Advances in Neural Information Processing Systems, pp. 2672–2680,2014.

[5] J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro, “LOGAN:Membership inference attacks against generative models,” Proceedingson Privacy Enhancing Technologies, vol. 2019, no. 1, pp. 133–152, 2019.

[6] C. Dwork, A. Roth, et al., “The algorithmic foundations of differentialprivacy,” Foundations and Trends® in Theoretical Computer Science,vol. 9, no. 3–4, pp. 211–407, 2014.

[7] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” inProceedings of the 22nd ACM SIGSAC Conference on Computer andCommunications Security, pp. 1310–1321, ACM, 2015.

[8] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar,and L. Zhang, “Deep learning with differential privacy,” in Proceedingsof the 2016 ACM SIGSAC Conference on Computer and CommunicationsSecurity, pp. 308–318, ACM, 2016.

[9] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou, “Differentially privategenerative adversarial network,” arXiv preprint arXiv:1802.06739, 2018.

[10] G. Acs, L. Melis, C. Castelluccia, and E. De Cristofaro, “Differentiallyprivate mixture of generative neural networks,” IEEE Transactions onKnowledge and Data Engineering, vol. 31, no. 6, pp. 1109–1121, 2018.

[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noiseto sensitivity in private data analysis,” in Theory of CryptographyConference, pp. 265–284, Springer, 2006.

[12] I. Mironov, “Rényi Differential Privacy,” in 2017 IEEE 30th ComputerSecurity Foundations Symposium (CSF), pp. 263–275, IEEE, 2017.

[13] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen-erating multi-label discrete patient records using generative adversarialnetworks,” arXiv preprint arXiv:1703.06490, 2017.

[14] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim,“Data synthesis based on generative adversarial networks,” Proceedingsof the VLDB Endowment, vol. 11, no. 10, pp. 1071–1083, 2018.

[15] A. Torfi and E. A. Fox, “CorGAN: Correlation-capturing convolutionalgenerative adversarial networks for generating synthetic healthcarerecords,” in The Thirty-Third International Flairs Conference, 2020.

[16] J. Jordon, J. Yoon, and M. van der Schaar, “PATE-GAN: Generatingsynthetic data with differential privacy guarantees,” in InternationalConference on Learning Representations, 2018.

[17] M. K. Baowaly, C.-C. Lin, C.-L. Liu, and K.-T. Chen, “Synthesizingelectronic health records using improved generative adversarial networks,”Journal of the American Medical Informatics Association, vol. 26, no. 3,pp. 228–241, 2018.

[18] A. H. Pollack, T. D. Simon, J. Snyder, and W. Pratt, “Creating syntheticpatient data to support the design and evaluation of novel healthinformation technology,” Journal of Biomedical Informatics, vol. 95,p. 103201, 2019.

[19] J. Guan, R. Li, S. Yu, and X. Zhang, “Generation of syntheticelectronic medical record text,” in 2018 IEEE International Conferenceon Bioinformatics and Biomedicine (BIBM), pp. 374–380, IEEE, 2018.

[20] J. Walonoski, M. Kramer, J. Nichols, A. Quina, C. Moesel, D. Hall,C. Duffett, K. Dube, T. Gallagher, and S. McLachlan, “Synthea: Anapproach, method, and software mechanism for generating syntheticpatients and the synthetic electronic health care record,” Journal of theAmerican Medical Informatics Association, vol. 25, no. 3, pp. 230–238,2017.

[21] J. Chen, D. Chun, M. Patel, E. Chiang, and J. James, “The validityof synthetic clinical data: a validation study of a leading syntheticdata generator (synthea) using clinical quality measures,” BMC MedicalInformatics and Decision Making, vol. 19, no. 1, p. 44, 2019.

[22] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni,“Modeling tabular data using conditional GAN,” in Advances in NeuralInformation Processing Systems, pp. 7333–7343, 2019.

[23] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, R. Lee, S. P. Bhavnani,J. B. Byrd, and C. S. Greene, “Privacy-preserving generative deep neuralnetworks support clinical data sharing,” Circulation: CardiovascularQuality and Outcomes, vol. 12, no. 7, p. e005122, 2019.

[24] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis withauxiliary classifier GANs,” in Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pp. 2642–2651, JMLR.org, 2017.

[25] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar,“Semi-supervised knowledge transfer for deep learning from privatetraining data,” arXiv preprint arXiv:1610.05755, 2016.

[26] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, andÚ. Erlingsson, “Scalable Private Learning with PATE,” arXiv preprintarXiv:1802.08908, 2018.

[27] R. Torkzadehmahani, P. Kairouz, and B. Paten, “Dp-cgan: Differentiallyprivate synthetic data and label generation,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops,pp. 0–0, 2019.

[28] U. Tantipongpipat, C. Waites, D. Boob, A. A. Siva, and R. Cummings,


“Differentially private mixed-type data generation for unsupervisedlearning,” arXiv preprint arXiv:1912.03250, 2019.

[29] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differentialprivacy,” in 2010 IEEE 51st Annual Symposium on Foundations ofComputer Science, pp. 51–60, IEEE, 2010.

[30] A. Rényi et al., “On measures of entropy and information,” in Proceedingsof the Fourth Berkeley Symposium on Mathematical Statistics andProbability, Volume 1: Contributions to the Theory of Statistics, TheRegents of the University of California, 1961.

[31] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio,“Boundary-seeking generative adversarial networks,” arXiv preprintarXiv:1702.08431, 2017.

[32] D. P. Kingma and M. Welling, “Auto-encoding Variational Bayes,” arXivpreprint arXiv:1312.6114, 2013.

[33] N. C. Abay, Y. Zhou, M. Kantarcioglu, B. Thuraisingham, andL. Sweeney, “Privacy preserving synthetic data release using deeplearning,” in Joint European Conference on Machine Learning andKnowledge Discovery in Databases, pp. 510–526, Springer, 2018.

[34] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,“Improved training of Wasserstein GANs,” in Advances in NeuralInformation Processing Systems, pp. 5767–5777, 2017.

[35] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXivpreprint arXiv:1701.07875, 2017.

[36] C. Villani, Optimal transport: old and new, vol. 338. Springer Science& Business Media, 2008.

[37] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE International Conference on Computer Vision,pp. 1026–1034, 2015.

[38] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,” arXivpreprint arXiv:1511.06434, 2015.

[39] I. Mironov, K. Talwar, and L. Zhang, “Rényi differential privacy of thesampled gaussian mechanism,” arXiv preprint arXiv:1908.10530, 2019.

[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015.

[42] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi,B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “MIMIC-III, afreely accessible critical care database,” Scientific data, vol. 3, p. 160035,2016.

[43] K. Fernandes, J. S. Cardoso, and J. Fernandes, “Transfer learning withpartial observability applied to cervical cancer screening,” in IberianConference on Pattern Recognition and Image Analysis, pp. 243–250,Springer, 2017.

[44] R. G. Andrzejak, K. Lehnertz, F. Mormann, C. Rieke, P. David, and C. E.Elger, “Indications of nonlinear deterministic and finite-dimensionalstructures in time series of brain electrical activity: Dependence onrecording region and brain state,” Physical Review E, vol. 64, no. 6,p. 061907, 2001.

[45] R. Bousseljot, D. Kreiseler, and A. Schnabel, “Nutzung der ekg-signaldatenbank cardiodat der ptb über das internet,” BiomedizinischeTechnik/Biomedical Engineering, vol. 40, no. s1, pp. 317–318, 1995.

[46] S. Ulianova, “Cardiovascular disease dataset,” 2018. data re-trieved from the kaggle dataset, https://www.kaggle.com/sulianova/cardiovascular-disease-dataset.

[47] G. B. Moody and R. G. Mark, “The impact of the MIT-BIH arrhythmiadatabase,” IEEE Engineering in Medicine and Biology Magazine, vol. 20,no. 3, pp. 45–50, 2001.

[48] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C.Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E.Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: components of anew research resource for complex physiologic signals,” Circulation,vol. 101, no. 23, pp. e215–e220, 2000.

[49] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, and K. Weinberger, “Anempirical study on evaluation metrics of generative adversarial networks,”arXiv preprint arXiv:1806.07755, 2018.

[50] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[51] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”in Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pp. 785–794, 2016.

[52] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1,no. 1, pp. 81–106, 1986.

Amirsina Torfi holds a Ph.D. in Computer Sci-ence from Virginia Tech, and a B.S. and M.S.from Iran University of Science and Technology inElectrical Engineering and Information Technology,respectively. His research interests include MachineLearning and Deep Learning in various applicationssuch as NLP, healthcare, and computer vision. Hepublished several peer-reviewed. He is the founderof Instill AI, a start-up company aimed to enhancethe utilization of AI in real-world applications. Heis also a leading expert in open-source development

as some of his Deep Learning projects are well-known worldwide.

Edward A. Fox holds a Ph.D. and M.S. in ComputerScience from Cornell University, and a B.S. fromM.I.T. He is a Fellow of both ACM and IEEE,cited for leadership in digital libraries and informa-tion retrieval. Since 1983 he has been at VirginiaPolytechnic Institute and State University (VPI&SUor Virginia Tech), where he serves as Professor ofComputer Science, and by courtesy, of ECE. Hewas an elected member of the Board of Directorsof the Computing Research Association and wasChair (now a member) of the ACM/IEEE-CS Joint

Conference on Digital Libraries (JCDL) Steering Committee. He served onthe IEEE Thesaurus Editorial Board, and was Chairman of the IEEE-CSTechnical Committee on Digital Libraries. He works in the areas of digitallibraries, information storage and retrieval, machine learning/AI, computationallinguistics (NLP), hypertext/hypermedia/multimedia, computing education, CD-ROM and optical disc technology, electronic publishing, and expert systems.In these areas, he has participated in, organized, presented, and/or reviewed forhundreds of conferences and workshops. 131 grants have funded his research.For the Association for Computing Machinery he served 2018-2019 as amember of its Publications Board (and as co-chair of its digital librariescommittee). He was editor for information retrieval and digital libraries inthe ACM Books series. He was founder and co-editor-in-chief for the ACMJournal of Educational Resources in Computing, was a member of the editorialboard for ACM Transactions on Information Systems, and was General Chairfor JCDL 2001. He served as Program Chair for ACM DL’99, ACM DL’96,and ACM SIGIR’95 - and Program Co-chair for JCDL 2018, CIKM 2006,and ICADL 2005. He serves on the editorial boards of Journal of EducationalMultimedia and Hypermedia, Journal of Universal Computer Science, andPeerJ Computer Science.

Chandan K. Reddy is a Professor in the Departmentof Computer Science at Virginia Tech. He receivedhis Ph.D. from Cornell University and M.S. fromMichigan State University. His primary researchinterests are Data Mining and Machine Learningwith applications to various real-world domainsincluding healthcare, transportation, social networks,and e-commerce. He has published over 130 peer-reviewed articles in leading conferences and journals.He received several awards for his research workincluding the Best Application Paper Award at ACM

SIGKDD conference in 2010, Best Poster Award at IEEE VAST conference in2014, Best Student Paper Award at IEEE ICDM conference in 2016, and wasa finalist of the INFORMS Franz Edelman Award Competition in 2011. He isserving on the editorial boards of ACM TKDD, IEEE Big Data, and DMKDjournals. He is a senior member of the IEEE and distinguished member ofthe ACM.

https://www.kaggle.com/sulianova/cardiovascular-disease-datasethttps://www.kaggle.com/sulianova/cardiovascular-disease-dataset

I IntroductionI-A Synthetic Data GenerationI-B Differential PrivacyI-C Challenges with Medical DataI-D Our Contributions

II Related WorksIII PreliminariesIII-A AutoencodersIII-B Differential PrivacyIII-C Rényi Differential Privacy

IV Privacy-Preserving FrameworkIV-A Convolutional AutoencoderIV-B Convolutional GANIV-C Architecture DetailsIV-D Privacy Loss

V ExperimentsV-A Experimental SetupV-B DatasetsV-C Baseline Comparison MethodsV-D Unsupervised Synthetic Data GenerationV-E Supervised Synthetic Data GenerationV-E1 The effect of architectureV-E2 The effect of differential privacyV-E3 The effect of privacy budgetV-E4 Ablation study

VI ConclusionReferencesBiographiesAmirsina TorfiEdward A. FoxChandan K. Reddy

Documents

ARXIV SUBMISSION VERSION 1 Differentially Private ...ARXIV SUBMISSION VERSION 1 Differentially Private Synthetic Medical Data Generation using Convolutional GANs Amirsina Torﬁ, Member,