scVAE: Variational auto-encoders for single-cell gene ...scVAE: Variational auto-encoders for single-cell gene expression data Christopher Heje Grønbech1,2,3, Maximillian Fornitz

scVAE: Variational auto-encoders for single-cell geneexpression dataChristopher Heje Grønbech 1,2,3, Maximillian Fornitz Vording 3, Pascal Timshel 4,5,Casper Kaae Sønderby 1, Tune Hannes Pers 4,5, and Ole Winther 1,2,3

1Bioinformatics Centre, Department of Biology, University of Copenhagen.2Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital.3Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark.4The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University ofCopenhagen.5Department of Epidemiology Research, Statens Serum Institut.

Abstract

Motivation: Models for analysing and making relevant biological inferences from mas-sive amounts of complex single-cell transcriptomic data typically require several indi-vidual data-processing steps, each with their own set of hyperparameter choices. Withdeep generative models one can work directly with count data, make likelihood-basedmodel comparison, learn a latent representation of the cells and capture more of thevariability in different cell populations.

Results: We propose a novel method based on variational auto-encoders (VAEs) foranalysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessingby using raw count data as input and can robustly estimate the expected gene expres-sion levels and a latent representation for each cell. We tested several count likelihoodfunctions and a variant of the VAE that has a priori clustering in the latent space. Weshow for several scRNA-seq data sets that our method outperforms recently proposedscRNA-seq methods in clustering cells and that the resulting clusters reflect cell types.

Availability and implementation: Our method, called scVAE, is implemented inPython using the TensorFlow machine-learning library, and it is freely available athttps://github.com/scvae/scvae.

The length of this alphabet is 127.55219pt.

1 IntroductionSingle-cell RNA sequencing (scRNA-seq) enablesmeasurement of gene expression levels of individualcells and thus provides a new framework to under-stand dysregulation of disease at the cell-type level(Regev et al., 2017). To date, a number of meth-ods1 have been developed to process gene expressiondata to normalise the data (Haghverdi et al., 2018)and cluster cells into putative cell types (Satija et al.,2015; Kiselev et al., 2017). Seurat (Satija et al., 2015)is a popular method which is a multi-step processof normalisation, transformation, decomposition, em-bedding, and clustering of the gene expression levels.This can be cumbersome, and a more automated ap-proach is desirable. Four recent methods, cellTree (du-

1A list of software packages for single-cell data anal-ysis is available at https://github.com/seandavi/awesome-single-cell.

Verle et al., 2016), DIMM-SC (Sun et al., 2017), DCA(Eraslan et al., 2018), and scVI (Lopez et al., 2018),model the gene expression levels directly as counts us-ing latent Dirichlet allocation, Dirichlet-mixture gen-erative model, an auto-encoder, and a variationalauto-encoder, respectively. The last two methods aredescribed below.

Here, we show that expressive deep generative mod-els, leveraging the recent progress in deep neural net-works, provide a powerful framework for modellingthe data distributions of raw count data. We showthat these models can learn biologically plausible rep-resentations of scRNA-seq experiments using only thehighly sparse raw count data as input entirely skippingthe normalisation and transformation steps neededin previous methods. Our approach is based on thevariational auto-encoder (VAE) framework presentedin Kingma and Welling (2013) and Rezende et al.(2014). These models learn compressed latent repre-sentations of the input data without any supervisorysignals, and they can denoise the input data using

1

.CC-BY-NC-ND 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted October 2, 2019. ; https://doi.org/10.1101/318295doi: bioRxiv preprint

https://github.com/scvae/scvaehttps://github.com/seandavi/awesome-single-cellhttps://github.com/seandavi/awesome-single-cellhttps://doi.org/10.1101/318295http://creativecommons.org/licenses/by-nc-nd/4.0/

its encoder-decoder structure. We extend these mod-els with likelihood (link) functions suitable for mod-elling (sparse) count data and perform extensive ex-periments on several data sets to analyse the strengthsand weaknesses of the models and the likelihood func-tions. Variational auto-encoders have been examinedand extended to use multiple latent variables for, e.g.,semi-supervised learning (Kingma et al., 2014), clus-tering (Dilokthanakul et al., 2016; Maaløe et al., 2017;Jiang et al., 2017), and structured representations(Sønderby et al., 2016; Johnson et al., 2016; Lin et al.,2018). They have also been used in variety of cases to,e.g., generate sentences (Bowman et al., 2016), trans-fer artistic style of paintings (Gatys et al., 2015), andcreate music (Roberts et al., 2017). Recently, VAEshave also been used to study normalised bulk RNA-seq data (Way and Greene, 2017) as well as visual-ising normalised scRNA-seq data (VASC, Wang andGu, 2018; scvis, Ding et al., 2018).

Compared to most other auto-encoder methods,VAEs have two major advantages: (a) It is probabilis-tic so that performance can be quantified and com-pared in terms of the likelihood, and (b) the varia-tional objective creates a natural trade-off betweendata reconstruction and model capacity. Other typesof auto-encoders also learn a latent representationfrom data that is used to reconstruct the same as wellas unseen data. The latent representation is tunedto have low-capacity, so the model is forced to onlyestimate the most important features of the data.Different auto-encoder models have previously beenused to model normalised (or binarised) bulk gene ex-pression levels: denoising auto-encoders (Tan et al.,2014; Gupta et al., 2015; Tan et al., 2016), sparseauto-encoders (Chen et al., 2016), and robust auto-encoders (Cui et al., 2017). A bottleneck auto-encoder(Eraslan et al., 2018) was also recently used to modelsingle-cell transcript counts, and a generative adver-sarial network (Goodfellow et al., 2014), which is arelated model, was recently used to model normalisedsingle-cell gene expression levels (Ghahramani et al.,2018).

Our contributions are threefold: (a) We have devel-oped new generative models based on the VAE frame-work for directly modelling raw counts from RNA-seqdata; (b) we show that our models using either a Gaus-sian or a Gaussian-mixture latent variable prior learnbiologically plausible groupings of scRNA-seq data ofas validated on data sets with known cell types; and(c) we provide a publicly available framework for un-supervised modelling of count data from RNA-seq ex-periments.

2 Methods and materialsWe have developed generative models for directlymodelling the raw read counts from scRNA-seq data.In this section, we describe the models as well as thedifferent data likelihood (link) functions for this task.

2.1 Latent-variable modelsWe take a generative approach to modelling raw countdata vectors x, where x represents a single cell and itscomponents xn, which are also called features, corre-spond to the gene expression count for gene n. Weassume that x is drawn from the distribution pθ(x) =p(x |θ) parameterised by θ. The number of features(genes), N , is very high, but we still want to modelco-variation between the features. We achieve this byintroducing a stochastic latent variable z, with fewerdimensions than x, and condition the data-generatingprocess on this. The joint probability distribution ofx and z is then

pθ(x, z) = pθ(x | z) pθ(z), (1)

where pθ(x | z) is the likelihood function and pθ(z) isthe prior probability distribution of z. Marginalisingover z results in the marginal likelihood of θ for onedata point:

pθ(x) =

∫pθ(x | z) pθ(z) dz. (2)

The log-likelihood function for all data points (alsocalled examples, which in this case are cells) will bethe sum of the log-likelihoods for each data point:F(θ) =

∑Mm=1 log pθ(xm), where M is the number

of cells. We can then use maximum-likelihood estima-tion, arg maxθ F(θ), to infer the value θ.

We use a deep neural network to map from z to thesufficient statistics of x. However, as the marginali-sation over the latent variables in Equation (2) is in-tractable for all but the simplest linear models, wehave to resort to approximate methods for inferencein the models. Here we use variational Bayesian opti-misation for inference as introduced in the variationalauto-encoder framework (Kingma and Welling, 2013;Rezende et al., 2014). Details of this approach is cov-ered in Section 2.4.

2.2 Deep generative modelsThe choice of likelihood function pθ(x | z) depends onthe statistical properties of the data, e.g., continuousor discrete observations, sparsity, and computationaltractability. Contrary to most other VAE-based arti-cles considering either continuous or categorical inputdata, our goal is to model discrete count data directly,which is why Poisson or negative binomial likelihoodfunctions are natural choices. This will be discussedin more detail below. The prior over the latent vari-ables p(z) is usually chosen to be an isotropic standardmultivariate Gaussian distribution to get the followinggenerative process (Fig. 1A):

pθ(x | z) = f(x;λθ(z)), (3a)pθ(z) = N (z;0, I), (3b)

where f is a discrete distribution such as the Pois-son (P) distribution: fP(x;λ) = e−λ λ

x

x! and f(x;λ) =

2



https://doi.org/10.1101/318295http://creativecommons.org/licenses/by-nc-nd/4.0/

A

xm

zmym

MLP

MLP

θ

m = 1, . . . ,M

Generative process

B

xm

µm σ2m

zm

MLP

µ+ σ � �

�m

N(0, I

)

MLP

ym

φ

m = 1, . . . ,M

Inference process

Fig. 1. Model graphs of (A) the generative process and (B)the inference process of the variational auto-encodermodels. Dotted strokes designate nodes and edges for theGaussian-mixture model only. Grey circles signify observablevariables, white circles represent latent variables, andrhombi symbolise deterministic variables. The black squaresdenote the functions next to them with the variablesconnected by lines as input and the variable connected byan arrow as output.

∏k f(xk;λk). The Poisson rate parameters are func-

tions of z: λθ(z). We can for example parameterise itby a single-layer feed-forward neural network:

λθ(z) = h(Wz + b). (4)

Here, W and b are the weights and bias of a lin-ear model, θ = (W,b), and h(·) is an appropri-ate element-wise non-linear transformation to makethe rate non-negative. The Poisson likelihood func-tion can be substituted by other probability distribu-tions with parameters of the distribution being non-linear functions of z in the same fashion as above (seeSection 2.3). In order to make the model more ex-pressive, we can also replace the single-layer modelwith a deep model with one or more hidden lay-ers. For L layers of adaptable weights we can write:al = hl(Wlal−1 + bl) for l = 1, . . . , L, a0 = z, andλθ(z) = aL with hl(·) denoting the activation func-tion of the lth layer. For the hidden layers the recti-fier function, ReLU(x) = max(0, x), is often a goodchoice.

2.2.1 Gaussian-mixture variational auto-encoderUsing a Gaussian distribution as the prior probabilitydistribution of z only allows for one mode in the latentrepresentation. If there is an inherent clustering in thedata, like for scRNA-seq data where the cells repre-sent different cell types, it is desirable to have multi-ple modes, e.g., one for every cluster or class. This can

be implemented by using a Gaussian-mixture modelin place of the Gaussian distribution. Following thespecification of the M2 model of Kingma et al. (2014),with inspiration from Dilokthanakul et al. (2016) andmodifications by Rui Shu2, a categorical latent ran-dom variable y ∈ {1, . . . ,K} is added to the VAEmodel:

pθ(x, y, z) = pθ(x | z) pθ(z | y) pθ(y). (5)

where the likelihood term pθ(x | z) is same as for thestandard VAE (Eq. 3a) and

pθ(z | y) = N (z;µθ(y),σ2θ(y)I), (6a)pθ(y) = Cat(y;π). (6b)

The generative process for the Gaussian-mixture vari-ational auto-encoder (GMVAE) is illustrated in Fig-ure 1A. Here, π is a K-dimensional probability vec-tor, where K is the number of components in theGaussian-mixture model. The component πk of πis the mixing coefficient of the kth Gaussian distri-bution, quantifying how much this distribution con-tributes to the overall probability distribution. With-out any prior knowledge, the categorical distributionis fixed to be uniform.

2.3 Modelling gene expression count dataInstead of normalising and transforming the gene ex-pression data, the original transcript counts are mod-elled directly to take into account the total amount ofgenes expressed in each cell also called the sequenc-ing depth. To model count data the likelihood func-tion pθ(x | z) will need to be discrete and only havenon-negative support. As described earlier, parame-ters for these probability distributions are modelledusing deep neural networks that takes the latent vec-tor z as input. We will consider a number of such dis-tributions in the following and investigate which onesare best in term of likelihood on held-out data.

Our first approach to model the likelihood func-tion uses either the Poisson or negative binomial(NB) distributions. The Poisson distribution providesthe simplest distribution naturally handling countdata, whereas the negative binomial distribution cor-responds well with the over-dispersed nature of geneexpression data (Oshlack et al., 2010). To properlyhandle the sparsity of the scRNA-seq data (Vallejoset al., 2017), we test two approaches: a zero-inflateddistribution and modelling of low counts using a dis-crete distribution. A zero-inflated distribution addsan additional parameter, which controls the amountof excessive zeros added to an existing probability dis-tribution. For the Poisson distribution, fP(x;λ), thezero-inflated version (ZIP) is defined as

fZIP(x; ρ, λ) =

{ρ+ (1− ρ)fP(x;λ), x = 0,

(1− ρ)fP(x;λ), x > 0,(7)

2Detailed in a blog post by Rui Shu: http://ruishu.io/2016/12/25/gmvae/.

3



http://ruishu.io/2016/12/25/gmvae/http://ruishu.io/2016/12/25/gmvae/https://doi.org/10.1101/318295http://creativecommons.org/licenses/by-nc-nd/4.0/

where ρ is the probability of excessive zeros. The zero-inflated negative binomial (ZINB) distribution havean analogous expression.

The so-called piece-wise categorical distribution,which has been used for demand forecasting (Seegeret al., 2016), uses a mixture of discrete probabilitiesfor low counts and a shifted probability distributionfor counts equal to or greater than kmax. The piece-wise categorical version of the Poisson distribution(PCP) is

fPCP(x; τ , λ) =

{τk, x = k, 0 ≤ x < kmax,τkmaxfP (x− kmax;λ), x ≥ kmax ,

(8)where τk, k = 0, . . . , kmax − 1, is the probability of acount being equal to k and τkmax the probability thatthe count is kmax or above.

Lastly, we also investigate the so-called constrainedPoisson (CP) distribution (Salakhutdinov and Hinton,2009) which reparameterises the Poisson to dependexplicitly on the sequencing depth Dm of cell m. Forthe constrained Poisson distribution, the rate param-eter λmn for gene n in cell m is

λmn = Dmpmn , (9)

where pmn is constrained to be a probability distri-bution for each cell: pmn ≥ 0 and

∑n pmn = 1. This

construction ensures the rates are set such that the ex-pected number of counts are equal to the sequencingdepth while we can still model each gene by a Poissondistribution.

Since we do not know the true conditioning struc-ture of the genes, we make the simplifying assumptionthat they are independent for computational reasonsand therefore use feed-forward neural networks.

2.4 Variational auto-encodersIn order to train our deep generative models, wewill use the variational auto-encoder framework whichamounts to replacing the intractable marginal likeli-hood with its variational lower bound and estimatethe intractable integrals with low-variance MonteCarlo estimates. The lower bound is maximised forthe training set and then evaluated on a test set inorder to perform model comparison.

Since the likelihood function pθ(x | z) is modelledusing non-linear transformations, the posterior prob-ability distribution pθ(z |x) = pθ(x | z)pθ(z)/pθ(x)becomes intractable. Variational auto-encoders use avariational Bayesian approach where pθ(z |x) is re-placed by an approximate probability distributionqφ(z |x) modelled using non-linear transformationsparameterised by φ. Thus, the marginal log-likelihoodcan be written as

log pθ(x) = KL(qφ(z |x) ‖ pθ(z |x)) + L(θ,φ;x).(10)

Here, the first term is the Kullback–Leibler (KL) di-vergence between the true and the approximate pos-terior distributions. L(θ,φ) can be rewritten as

L(θ,φ;x) = Eqφ(z|x)[log pθ(x|z)]−KL(qφ(z|x) ‖ pθ(z)),(11)

where the first term is the expected negative recon-struction error quantifying how well the model canreconstruct x, and the second term is the relative KLdivergence between the approximate posterior distri-bution qφ(z|x) and the prior distribution pθ(z) of z.

Evaluating Equation (10) is intractable due to theappearance of the intractable true posterior distribu-tion. However, as the KL divergence is non-negative,Equation (11) provides a lower bound on the marginal,e.g., log pθ(x) ≥ L(θ,φ;x). The integrals over z inEquation (11) are still analytically intractable, buta low-variance estimator can be constructed usingMonte Carlo integration as mentioned above.

For the standard VAE, the approximate posteriordistribution is chosen to be a multivariate Gaussiandistribution with a diagonal covariance matrix:

qφ(z |x) = N (z;µφ(x),σ2φ(x)I), (12)

where µφ(x) and σ2φ(x) are non-linear transforma-tions of x parameterised by a neural network with pa-rameters φ in a similar fashion as Equation (4) is usedto parameterise the generative model. Equation (12)is called the inference model, and it is illustrated inFigure 1B together with the inference model for theGMVAE, which is described in Supplementary Sec-tion S1.

The parameters of the generative and inferencemodels are optimised simultaneously using a stochas-tic gradient descent algorithm. The reparameterisa-tion trick for sampling from qφ(z |x) allows back-propagation of the gradients end-to-end (Kingma andWelling, 2013). We use a warm-up scheme during op-timisation as described in the Supplementary Sec-tion S2, and we report the marginal log-likelihoodlower bound averaged over all examples. The VAE re-construction of x is defined as the mean of the expres-sion values using a stochastic encoding and decodingstep: x̃ =

∫ ∑x′ x′p(x′ | z)q(z |x)dz. The variance of

the reconstruction can also be computed in a similarfashion.

2.5 Clustering and visualisation of the latentspace

The aim of using a VAE is to capture the joint dis-tribution of multivariate data such as gene expres-sion profiles. Since our method is likelihood-based,model comparison can be performed by evaluating themarginal log-likelihood (lower bound) on a test setfor different models. For model comparison with non-likelihood-based methods or methods that preprocessthe data, we need other metrics such as a clusteringquality. For latent variable methods, clustering in the

4




latent space is the most obvious approach. Therefore,we do this as well.

The categorical latent variable used in the GMVAEmodel gives a built-in clustering by using as assign-ment the mixing coefficient with the highest respon-sibility qφ(y|x). The responsibilities are computed aspart of the inference network as described in Supple-mentary Section S1. The standard VAE model doesnot have this feature built in, so instead k-means clus-tering (kM) is used to cluster the latent representa-tions of the cells for this model. We also test k-meansclustering for the GMVAE model.

We use two clustering metrics to measure the simi-larity between a clustering found by one of our mod-els and the clustering given in the data set. These arethe adjusted Rand index (Hubert and Arabie, 1985),Radj, and adjusted mutual information (AMI; Vinhet al. (2009)). For both metrics, two identical cluster-ings have a value of 1, while the expected value of themetrics is 0 (they are not bounded below).

To visualise the latent space in which the latentrepresentations reside, the mean values of the approx-imate posterior distribution qφ(z|x) are plotted in-stead of the samples. This is done to better get arepresentation of the probability distribution for thelatent representation of each cell. For the standardVAE, this is just z̄ = µφ(x), and for the GMVAE,z̄ = Ey

[µφ(x, y)

]. For latent spaces with only two

dimensions, we plot these directly, but to visualisehigher dimensional latent spaces, we project to two di-mensions using t-distributed stochastic neighbour em-beddings (t-SNE; van der Maaten and Hinton, 2008).

2.6 RNA-seq data setsWe model four different RNA-seq data sets as sum-marised in Table 1. The first data set is of single-cellgene expression counts of peripheral blood mononu-clear cells (PBMC) used for generation of referencetranscriptome profiles (Zheng et al., 2017). It is com-bined from nine separate data sets of different purifiedcell types and published by 10x Genomics3, and we usethe filtered gene–cell matrices for each data set. An-other data set of purified monocytes is also available,but since another cell type was discovered in this dataset (Zheng et al., 2017), and since no separation of thetwo is available in the data set, we only use the othernine data sets, which are listed in Table 1. This tablealso shows the high degree of sparsity of the data set.

The second data set is a bulk RNA-seq data setmade publicly available by TCGA (Weinstein et al.,2013).4 This data set consists of RSEM (Li andDewey, 2011) expected gene expression counts for

3Available online at https://support.10xgenomics.com/single-cell-gene-expression/datasets/ (under “SingleCell 3′ Paper: Zheng et al. 2017”).

4Available online at https://xenabrowser.net/datapages/?dataset=tcga_gene_expected_count&host=https://toil.xenahubs.net.

Table 1. Overview of gene expression data sets.

Data set Examples Features Classes Sparsity

PBMC 92 043 32 738 9a 0.980MBC 1 306 127 27 998 — 0.928TCGA 10 830 58 581 29 0.525GTEx 11 688 54 271 53 0.472a CD19+ B cells, CD34+ cells, CD4+ helper T cells,CD4+/CD25+ regulatory T cells, CD4+/CD45RA+/CD25-naïve T cells, CD4+/CD45RO+ memory T cells, CD56+ nat-ural killer cells, CD8+ cytotoxic T cells, CD8+/CD45RA+naïve cytotoxic T cells.

samples of human cancer cells from 29 tissue sites.Gene IDs are used in this data set, so the availablemapping from TCGA is used to map the IDs to genenames. The difference in number of genes and sparsityfrom the original data set is not large. The two remain-ing data sets are described, modelled, and analysed inSupplementary Sections S4–5.

2.7 Experiment setupEach data set is divided into training, validation, andtest sets using a 81 %–9 %–10 % split with uniformlyrandom selection. The training sets are used to trainthe models, the validation sets are used for hyperpa-rameter selection during training, and the test sets areused to evaluate the models after training.

For the deep neural networks, we examine differ-ent network architectures to find the optimal one foreach data set. We test deep neural networks of bothone and two hidden layers with 100, 250, 500, or 1000units each. We also experiment with a latent space ofboth 10, 25, 50, and 100 dimensions. A standard VAEwith the negative binomial distribution as the likeli-hood function (a VAE-NB model) is used for theseexperiments. Using the optimal architecture, we testthe link functions introduced in Section 2.3 as likeli-hood function for both the standard VAE as well asthe GMVAE. We train and evaluate each model threetimes for each likelihood function.

The hidden units in the deep neural networks usethe rectifier function as their non-linear transforma-tion. For real, positive parameters (λ for the Poissondistribution, r for the negative binomial distribution,σ in the Gaussian distribution), we model the natu-ral logarithm of the parameter. For the standard de-viation σ in the Gaussian-mixture model, however,we use the softplus function, log(1 + ex), to constrainthe possible covariance matrices to be only positive-definite. The units modelling the probability parame-ters in the negative binomial distribution, p, and thezero-inflated distributions, πk, use the sigmoid func-tion, while for the categorical distributions in the con-strained Poisson distribution, the piece-wise categori-cal distributions, as well as for the Gaussian-mixturemodel, the probabilities are given as logits with lin-ear functions, which can be evaluated as probabilities

5



https://support.10xgenomics.com/single-cell-gene-expression/data sets/https://support.10xgenomics.com/single-cell-gene-expression/data sets/https://xenabrowser.net/datapages/?data set=tcga_gene_expected_count&host=https://toil.xenahubs.nethttps://xenabrowser.net/datapages/?data set=tcga_gene_expected_count&host=https://toil.xenahubs.nethttps://xenabrowser.net/datapages/?data set=tcga_gene_expected_count&host=https://toil.xenahubs.nethttps://doi.org/10.1101/318295http://creativecommons.org/licenses/by-nc-nd/4.0/

using the softmax normalisation. Additionally for thepiece-wise categorical distributions, we choose the cut-offK to be 4, since this strikes a good balance betweenthe number of low and high counts for all examineddata sets (see Supplementary Fig. S1 for an exampleof this). For the k-means clustering and the GMVAEmodel, the number of clusters is chosen to be equal tothe number of classes, if cell types were provided.

The models are trained using one Monte Carlo sam-ple for each example and using the Adam optimisationalgorithm (Kingma and Ba, 2014) with a mini-batchsize of 100 and a learning rate of 10−4. Additionally,we use batch normalisation (Ioffe and Szegedy, 2015)to improve convergence speed of the optimisation. Wetrain all models for 500 epochs and use early stop-ping with the validation marginal log-likelihood lowerbound to select parameters θ and φ. We also use thewarm-up optimisation scheme for 200 epochs for VAEand GMVAE models exhibiting overregularisation. Totest whether the non-linearities in the neural networkcontributes to a better performance, we also run ourmodels without hidden layers in the generative pro-cess. This corresponds to factor analysis (FA) witha generalised linear model link function. The optimalnetwork architecture for the inference process is foundin the same way as for the VAE models, and we simi-larly cluster the latent space using k-means clustering.In addition, we investigate the performance of a base-line linear (link) factor analysis (LFA). LFA assumescontinuous data, so we first normalise the read countsusing the counts-per-million method (Law et al., 2014)and then apply log transformation to one plus the readcount. We use a 25-dimensional Gaussian latent vari-able vector and cluster the latent space using k-meansclustering. Both LFA and k-means clustering are per-formed using scikit-learn (Pedregosa et al., 2011).

We compare our best-performing models with Seu-rat, scvis, and scVI for the PBMC data set. Seurathave in turn been compared to several other single-cellclustering methods (Duò et al., 2018). Since Seuratuses the full data set, to make a fair comparison, wealso train our best-performing models on the full dataset. Because scvis is a visualisation method, we com-pare our models to it visually. In addition, we trainour best-performing models using only two latent vari-ables to visualise the latent space without using t-SNE. scVI is evaluated on the original counts similarto our method, so we can compare the test marginallog-likelihood lower bounds directly. We also performk-means clustering on the latent space of scvis andscVI and compute the adjusted Rand indices of theseclusterings to compare them with our method.

2.8 Software implementationThe models described above have been implementedin Python using the machine-learning software libraryTensorFlow (Abadi et al., 2015) in an open-sourceand platform-independent software tool called scVAE

(single-cell variational auto-encoders), and the sourcecode is freely available online5 along with a user guideand examples. The RNA-seq data sets presented inSection 2.6 along with others are easily accessed us-ing our tool. It also has extensive support for sparsedata sets and can thus be used on the largest data setscurrently available as demonstrated in SupplementarySection S4.

3 ResultsBoth the standard VAE and the GMVAE have beenused to model both single-cell and bulk RNA-seq datasets (see Supplementary Sections S4–S5 for more).The performance of different likelihood functions havebeen investigated, and the latent spaces of the modelshave been examined.

3.1 scRNA-seq data of purified immune cellsWe first modelled the scRNA-seq data set of geneexpression counts for peripheral blood mononuclearcells (PBMC; Zheng et al., 2017). To assess the opti-mal network architectures for these data sets, we firsttrained VAE-NB models with different network ar-chitecture as mentioned in Section 2.7 on the dataset with all genes included. This was carried out asa grid search of number of hidden units and latentdimensions (Supplementary Fig. S2). We found thatusing two smaller hidden layers (of 100 units each) inthe generative and inference processes yielded betterresults than only using one hidden layer, and a high-dimensional latent space of 100 dimensions resulted inthe highest test marginal log-likelihood lower bound.A similar network architecture grid search were car-ried out using FA-NB models, and the optimal archi-tecture was two large hidden networks of 1000 unitseach with a latent space of only 25 dimensions.

With these network architectures, different dis-crete likelihood functions were examined for the VAE,GMVAE, and FA models. The mean marginal log-likelihood lower bound and the mean adjusted Randindex for each model and likelihood function evalu-ated on the test set are plotted against each other inFigure 2 for a subset of model/link function combi-nations (see Supplementary Fig. S3 for the completeversion). From this figure, it is clear that using a GM-VAE model with the negative binomial distributionyielded the highest lower bound as well as the highestRand index, with the zero-inflated Poisson distribu-tion as the next best. For each likelihood function, theGMVAE model yielded the highest lower bound, andin most cases, the distribution-based clustering of theGMVAE model outperforms k-means clustering in thelatent space of the same model or of the other modelson average. All scVAE models outperform the LFAbaseline model. This shows that complex non-linear

5https://github.com/scvae/scvae.

6



https://github.com/scvae/scvaehttps://doi.org/10.1101/318295http://creativecommons.org/licenses/by-nc-nd/4.0/

models really makes a difference for the marginal log-likelihood of the data. It should be noted, however,that there does not seem to be a strong positive cor-relation between Rand index and marginal likelihoodlower bound. Similar conclusions are obtained usingadjusted mutual information. The latent space of theGMVAE-NB model for the test set of the PBMC dataset can be seen in Figure 3 (see Supplementary Fig. S4for a larger version), which shows different clusterscorresponding to distinct cell types, while more similarcell types are clustered close to each other or mixed to-gether. In particular, cells cluster into well-known dis-tinct immune cell populations: B cells, CD34+ cells,natural killer cells, and T cell sub-populations. Cyto-toxic T cells, which express CD8, cluster together, andT cells expressing CD4 group together, while memoryT cells, which can express both, are in a cluster be-tween the two. Within the two CD4+ and CD8+ clus-ters, naïve and either regulatory or vanilla cells, re-spectively, form distinguishable groupings. Addition-ally, in the former cluster, helper T cells are mixedin with these two cell types, which makes sense, sincetheir subtypes can express the same genes as eitherone.

Furthermore, cells for several cell types seem to besplit into subtypes, which could both be a biologicaland a batch effect. There is more overlap of the celltypes in the latent spaces for the VAE-NB, FA-NB,and LFA models (Supplementary Fig. S5–7). It is alsopossible to visualise what each latent dimension en-code about the cells by looking at how cell types arecorrelated with a given latent dimension (an exampleis shown in Supplementary Fig. S8). Finally, for theGMVAE model we investigated whether we could usethe test marginal log-likelihood lower bound to iden-tify the number of components. The result of scanningthe number of components K from 5 to 13 is shown inSupplementary Fig. S9. The optimal value was foundto be K = 9, which is the number of cell types in thedata set.

The PBMC data set has also been analysed usingSeurat at different resolutions (see Supplementary Ta-ble S1), and the highest adjusted Rand index obtainedwas Radj = 0.634 at a resolution of 0.2. This yieldsnine clusters, which is also the number of cell types inthe data set. This performance of Seurat have been in-cluded in Figure 2, and even though Seurat is trainedon the full data set, two of the GMVAE models (usingwithheld data) perform better on average. A GMVAE-NB model was also trained on the full data set. Thisresulted in a lower bound of L = −2046.0± 0.4 and aRand index of Radj = 0.656± 0.039 (see Supplemen-tary Fig. S10 for its latent space), which is better thanthe best Rand index obtained using Seurat. In the la-tent space, we again find that the cells are clusteredand arranged according to which CD molecule theyexpress, and it also looks like some cells are clusteredtogether in subtypes.

We also trained a scvis model with standard pa-

Fig. 2. Comparison of VAE and GMVAE models withdifferent likelihood functions trained and evaluated on thePBMC data set. For each combination the mean and thestandard deviation of the adjusted Rand index are plottedagainst the marginal log-likelihood lower bound. These arecompared to scVI and on the adjusted rand index only: LFA,Seurat (trained and evaluated on the full data set) and scvis.

Fig. 3. Latent space of the median-performing GMVAE-NBmodel trained and evaluated on the PBMC data set. Theencoding of the cells in the latent space has been embeddedin two dimensions using t-SNE and are colour-coded usingtheir cell types. Clear separation corresponding to differentcell types can be seen, but some similar cell types are alsoclustered close together or mixing together.

7




rameters as well as a GMVAE-NB model with a 2-dlatent variable on the PBMC data set (see Supple-mentary Fig. S11 for their latent spaces). Compar-ing the latent space of the two models, both mod-els succeed in separating out cells with dissimilar celltypes, while struggling with cells of similar cell type,but the scvis model handles it better. This 2-d-latent-variable GMVAE-NB also has both a worse lowerbound of L = −2103.7± 1.3 and a worse Rand indexof Radj = 0.420± 0.014 compared to the GMVAE-NBusing the optimal architecture, but comparable to thescvis, which has a Rand index of Radj = 0.436± 0.009shown in Figure 2. The t-SNE of the latent space ofthis model also has a much better separation of thecells than the latent space of the scvis model.

Finally, we compared our method to scVI andthe resulting mean lower bound and Rand index areshown in Figure 2. As can be seen from this fig-ure, scVI significantly outperforms our models on thelower bound and are comparable to the standard VAEmodels in Rand index, but both comparable GMVAEmodels (with negative binomial distributions) are sig-nificantly better in ARI. Looking at the latent space ofthe scVI model (Supplementary Fig. S12), it is clearthat the GMVAE model achieve more separation ofthe cells, especially of the T cells.

3.2 Bulk RNA-seq data of human cancer cellsThe bulk RNA-seq data set of RSEM expected geneexpression levels for human cancer cell samples fromTCGA (Weinstein et al., 2013) was also modelled.On a GPU with 12 GB of memory, we cannot modelthe data set using all genes with the optimal net-work architecture. Since the majority of the varianceis captured in the 5000–10 000 most varying genes,we choose to limit the number of genes to 5000. Aswith the previous data set, different network architec-tures (Supplementary Fig. S13) and likelihood func-tions (Supplementary Fig. S14) were investigated forthe VAE, GMVAE, and FA models. There is a cleardifference in marginal log-likelihood lower bound be-tween using either of the negative binomial distribu-tions from any of the Poisson distributions, but foreach likelihood function, the GMVAE model yieldedthe highest lower bound. k-means clustering on theVAE and GMVAE latent spaces for all likelihood func-tions resulted in the best adjusted Rand indices ingeneral with the best one being for the GMVAE-P-kM model. Contrary to the sparse single cell datasets, for the TCGA data set only small performancegains were observed compared to the LFA baselinemodel. The latent space of the GMVAE-NB modelfor the test set of the TCGA data set (SupplementaryFig. S15) show that samples belonging to different tis-sue sites are clearly separated. The latent spaces of theVAE-NB, FA-NB, and LFA models look quite similardespite the difference in Rand index (SupplementaryFig. S16–18).

4 DiscussionWe have shown that variational auto-encoders (VAEs)can be used to model single-cell and bulk RNA-seqdata. The Gaussian-mixture VAE (GMVAE) modelachieves higher marginal log-likelihood lower boundson all data sets as well as a higher Rand indices forthe PBMC data set compared to corresponding stan-dard VAE, factor analysis (FA) and LFA models. Thelatent spaces for the GMVAE also showed the clearestseparation according to cell type. Its built-in cluster-ing helps the GMVAE achieve this.

In general we only find a weak correlation betweentest log marginal likelihood lower bound and Randindices. Since likelihood is a measure of how well weare able to capture the multivariate transcript dis-tribution and Rand index is a measure of alignmentwith predefined cell types the weak correlation simplyimplies that the are other important sources of varia-tion in the data not directly related to cell type. Fac-tor models (for count data and linear for transformeddata) lead to significantly worse lower bounds, demon-strating that non-linear transformations can more eas-ily express more subtle co-variation patterns in thedata sets. We found that the negative binomial distri-bution modelled most data sets for most models thebest, and that it was a close second to its zero-inflatedversion in the remaining cases. So explicit modellingof zero counts seems not be necessary. For much lesssparse bulk RNA-seq dataset (TCGA), the advantageof working with the more advanced models were lesspronounced. This might be explained by the diversitywithin each data sample and lesser need for accuratelydealing with low and zero expression.

In modelling four different data sets, we have foundthe following guidelines help in achieving a highermarginal log-likelihood lower bound score on a dataset. The network architecture should adapt to the dataset: Use a large network when there are more cellsand/or more cell types in the data set and when thedata is less sparse. The models for all data sets alsobenefited from deeper network architectures. If theoptimal network architecture is too large for compu-tation, one can limit the number of genes to the mostvarying genes.6 For the likelihood function, try usingthe negative binomial distribution (or its zero-inflatedversion) first. In addition, the warm-up optimisationscheme proved valuable in avoiding overregularisationwhen modelling the scRNA-seq data sets.

Compared to Seurat (Satija et al., 2015) theGMVAE-NB model achieved a higher adjusted Randindex on average for the PBMC data set. This wasachieved working directly on the transcript countswith no preprocessing of the data. With variationalauto-encoders being generative models, they can alsomodel unseen data (test set or new data), whereasSeurat cannot. By modelling the cells with the GM-

6We did this for the bulk RNA-seq data set, and the optimalnetwork architecture stayed the same.

8




VAE model using the optimal configuration and thenvisualising the latent space using t-SNE, we alsoachieve a better visual representation of cells thanthe scvis model (Ding et al., 2018). Even though thescVI model (Lopez et al., 2018) achieves a better testmarginal log-likelihood lower bound than any of ourmodels on the PBMC data set, corresponding GM-VAE models still obtain latent spaces with more sepa-rate groupings. In Supplementary Section S4, we alsoshowed that our method can scale up to very largedata sets like the data set of 1.3 million mouse braincells from 10x Genomics.

5 Summary and conclusionWe show that two variations of the variational auto-encoder are able to model gene expression counts us-ing appropriate discrete probability distribution aslikelihood (link) functions and provide a software im-plementation. These models are probabilistic and puta lower bound on the marginal likelihood of the datasets, enabling us to perform likelihood-based modelcomparison. We have applied both models success-fully to single-cell and – to a lesser degree – bulkRNA-seq data sets, and the GMVAE model achievesbetter clustering and representation of the cells thanSeurat, scvis, scVI, and a baseline method using size-factor normalisation and log transformation, linearfactor analysis, and k-means clustering. However, scVIachieves a better bound for the likelihood, whichmight be attributed to including the sequencing depthas a latent variable in the scVI model. Building clus-tering into the Gaussian-mixture variational auto-encoder, we have a model that can cluster cells intocell populations without the need of a pipeline withseveral parameter selection steps. Having both an in-ference process and a generative process, makes it pos-sible to project new data onto an existing latent space,or even simulate new data from samples in the latentspace. This means that new cells can be introduced toan already trained model, and it could enable combin-ing the latent representations of two cells to generate acell and the transitional states in-between. In the scVImodel, a zero-inflated negative binomial distributionis used. We observe that using a negative binomialdistribution overall is a better model for our method.As in Lopez et al. (2018), we also provide batch cor-rection within our framework.

As for future extensions we would also like to makethe models more flexible by adding more latent vari-able layers (Sønderby et al., 2016) and make the mod-els learn the number of clusters (Rasmussen, 2000)rather than setting it a priori. We could also use semi-supervised learning and active learning to better clas-sify cells and identify cell populations and associatedresponse variables. This would also help with transferlearning enabling modelling multiple data sets withthe same model. As mentioned above, it should alsobe possible to combine encodings of the cells in the

latent space and produce in-between cells like Lotfol-lahi et al. (2018). We would also like to extent ourinvestigation of what dimensions of the latent vari-ables encode (Kinalis et al., 2019). We note that it ispossible to apply these models to data sets with multi-ple modalities such as RNA-seq and exome sequencing(Brouwer and Lió, 2017).

AcknowledgementsThis work was supported by the Lundbeck Foundationto THP and the Novo Nordisk Foundation to THP.Wewould like to thank Lars Maaløe for discussions aboutmultimodal VAEs and technical help for implement-ing these. We also want to thank Aki Vehtari for dis-cussions about discrete probability distributions andmodel comparison. Lastly, we gratefully acknowledgethe support of NVIDIA Corporation with the dona-tion of Tesla Titan Xp GPUs used for this research.

ReferencesAbadi, M. et al. (2015). TensorFlow: Large-scale ma-chine learning on heterogeneous systems. Softwareavailable from tensorflow.org.

Bowman, S.R. et al. (2016). Generating sentencesfrom a continuous space. arXiv preprint .

Brouwer, T. and Lió, P. (2017). Bayesian hybrid ma-trix factorisation for data integration. In: Singh, A.and Zhu, J. (eds.), Proceedings of the 20th Inter-national Conference on Artificial Intelligence andStatistics, volume 54 of Proceedings of MachineLearning Research, pp. 557–566. PMLR.

Chen, L. et al. (2016). Learning a hierarchical rep-resentation of the yeast transcriptomic machineryusing an autoencoder model. BMC Bioinformatics,17(S1), S9.

Cui, H. et al. (2017). Boosting gene expression clus-tering with system-wide biological information: Arobust autoencoder approach. bioRxiv .

Dilokthanakul, N. et al. (2016). Deep unsupervisedclustering with gaussian mixture variational au-toencoders. arXiv preprint .

Ding, J. et al. (2018). Interpretable dimensionality re-duction of single cell transcriptome data with deepgenerative models. Nat. Commun., 9(1), 2002.

duVerle, D.A. et al. (2016). CellTree: anR/bioconductor package to infer the hierarchicalstructure of cell populations from single-cell RNA-seq data. BMC Bioinformatics, 17(1), 363.

Duò, A. et al. (2018). A systematic performance eval-uation of clustering methods for single-cell RNA-seqdata. F1000Research, 7, 1141.

9




Eraslan, G. et al. (2018). Single cell RNA-seq denois-ing using a deep count autoencoder. bioRxiv .

Gatys, L.A. et al. (2015). A neural algorithm of artis-tic style. arXiv preprint .

Ghahramani, A. et al. (2018). Generative adversarialnetworks uncover epidermal regulators and predictsingle cell perturbations. bioRxiv .

Goodfellow, I.J. et al. (2014). Generative adversarialnets. arXiv preprint .

Gupta, A. et al. (2015). Learning structure in geneexpression data using deep architectures, with anapplication to gene clustering. bioRxiv .

Haghverdi, L. et al. (2018). Batch effects in single-cell rna-sequencing data are corrected by matchingmutual nearest neighbors. Nat. Biotechnol., 36, 421.

Hubert, L. and Arabie, P. (1985). Comparing parti-tions. J. Classif., 2(1), 193–218.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:Accelerating deep network training by reducing in-ternal covariate shift. arXiv preprint .

Jiang, Z. et al. (2017). Variational deep embed-ding: An unsupervised and generative approachto clustering. In: Sierra, C. (ed.), Proceedings ofthe Twenty-Sixth International Joint Conference onArtificial Intelligence (IJCAI-17), pp. 1965–1972.IJCAI Organization.

Johnson, M. et al. (2016). Composing graphical mod-els with neural networks for structured representa-tions and fast inference. In: Lee, D.D., Sugiyama,M., Luxburg, U.V., Guyon, I., and Garnett, R.(eds.), Advances in Neural Information ProcessingSystems 29 , pp. 2946–2954. NeurIPS.

Kinalis, S. et al. (2019). Deconvolution of autoen-coders to learn biological regulatory modules fromsingle cell mrna sequencing data. BMC Bioinfor-matics, 20(1), 379.

Kingma, D.P. and Ba, J. (2014). Adam: A method forstochastic optimization. arXiv preprint .

Kingma, D.P. and Welling, M. (2013). Auto-encodingvariational bayes. arXiv preprint .

Kingma, D.P. et al. (2014). Semi-supervised learn-ing with deep generative models. In: Ghahramani,Z., Welling, M., Cortes, C., Lawrence, N.D., andWeinberger, K.Q. (eds.), Advances in Neural In-formation Processing Systems 27 , pp. 3581–3589.NeurIPS.

Kiselev, V.Y. et al. (2017). SC3: consensus clusteringof single-cell rna-seq data. Nat. Methods, 14, 483.

Law, C.W. et al. (2014). voom: precision weights un-lock linear model analysis tools for RNA-seq readcounts. Genome Biol., 15(2), R29.

Li, B. and Dewey, C.N. (2011). RSEM: accurate tran-script quantification from RNA-seq data with orwithout a reference genome. BMC Bioinformatics,12(1), 323.

Lin, W. et al. (2018). Variational message passingwith structured inference networks. In: Interna-tional Conference on Learning Representations.

Lopez, R. et al. (2018). Deep generative modeling forsingle-cell transcriptomics. Nat. Methods, 15(12),1053–1058.

Lotfollahi, M. et al. (2018). Generative modelingand latent space arithmetics predict single-cell per-turbation response across cell types, studies andspecies. bioRxiv .

Maaløe, L. et al. (2017). Semi-supervised genera-tion with cluster-aware generative models. arXivpreprint .

Oshlack, A. et al. (2010). From RNA-seq reads to dif-ferential expression results. Genome Biol., 11(12),220.

Pedregosa, F. et al. (2011). Scikit-learn: Machinelearning in Python. Journal of Machine LearningResearch, 12, 2825–2830.

Rasmussen, C.E. (2000). The infinite gaussian mix-ture model. In: Solla, S.A., Leen, T.K., and Müller,K.R. (eds.), Advances in Neural Information Pro-cessing Systems 12 , pp. 554–560, Cambridge, MA,USA. MIT Press.

Regev, A. et al. (2017). The human cell atlas. eLife,6.

Rezende, D.J. et al. (2014). Stochastic backpropa-gation and approximate inference in deep genera-tive models. In: Xing, E.P. and Jebara, T. (eds.),Proceedings of the 31st International Conference onMachine Learning , volume 32 of Proceedings of Ma-chine Learning Research, pp. 1278–1286. PMLR.

Roberts, A. et al. (eds.) (2017). Hierarchical Varia-tional Autoencoders for Music.

Salakhutdinov, R. and Hinton, G. (2009). Semantichashing. Int. J. Approx. Reason., 50(7), 969–978.

Satija, R. et al. (2015). Spatial reconstruction ofsingle-cell gene expression data. Nat. Biotechnol.,33(5), 495–502.

Seeger, M.W. et al. (2016). Bayesian intermittentdemand forecasting for large inventories. In: Lee,D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., andGarnett, R. (eds.), Advances in Neural InformationProcessing Systems 29 , pp. 4646–4654. NeurIPS.

10




Sun, Z. et al. (2017). DIMM-SC: a Dirichlet mixturemodel for clustering droplet-based single cell tran-scriptomic data. Bioinformatics, 34(1), 139–146.

Sønderby, C.K. et al. (2016). Ladder variational au-toencoders. arXiv preprint .

Tan, J. et al. (2014). Unsupervised feature con-struction and knowledge extraction from genome-wide assays of breast cancer with denoising autoen-coders. In: Biocomputing 2015 , pp. 132–143, Sin-gapore. World Scientific.

Tan, J. et al. (2016). ADAGE-based integration ofpublicly available Pseudomonas aeruginosa geneexpression data with denoising autoencoders illu-minates microbe-host interactions. mSystems, 1(1),e00025–15.

Vallejos, C.A. et al. (2017). Normalizing single-cellRNA sequencing data: challenges and opportuni-ties. Nat. Methods, 14(6), 565–571.

van der Maaten, L.J. and Hinton, G. (2008). Visual-izing high-dimensional data using t-sne. J. Mach.Learn. Res., 9, 545.

Vinh, N.X. et al. (2009). Information theoretic mea-sures for clusterings comparison: Is a correction forchance necessary? In: Proceedings of the 26th An-nual International Conference on Machine Learn-ing , ICML ’09, pp. 1073–1080, New York, NY, USA.ACM.

Wang, D. and Gu, J. (2018). VASC: Dimension reduc-tion and visualization of single-cell RNA-seq databy deep variational autoencoder. Genomics, Pro-teomics & Bioinformatics, 16(5), 320–331.

Way, G.P. and Greene, C.S. (2017). Extracting a bi-ologically relevant latent space from cancer tran-scriptomes with variational autoencoders. In: Bio-computing 2018 , Singapore. World Scientific.

Weinstein, J.N. et al. (2013). The cancer genome at-las pan-cancer analysis project. Nat. Genet., 45,1113—-1120.

Zheng, G.X.Y. et al. (2017). Massively parallel digitaltranscriptional profiling of single cells. Nat. Com-mun., 8, 14049.

11




IntroductionMethods and materialsLatent-variable modelsDeep generative modelsGaussian-mixture variational auto-encoder

Modelling gene expression count dataVariational auto-encodersClustering and visualisation of the latent spaceRNA-seq data setsExperiment setupSoftware implementation

ResultsscRNA-seq data of purified immune cellsBulk RNA-seq data of human cancer cells

DiscussionSummary and conclusion

Documents

scVAE: Variational auto-encoders for single-cell gene ...scVAE: Variational auto-encoders for single-cell gene expression data Christopher Heje Grønbech1,2,3, Maximillian Fornitz