ZhuSuan: A Library for Bayesian Deep LearningZhuSuan: A Library for Bayesian Deep Learning and multi-GPU training of deep learning, while at the same time they can use probabilis-tic

ZhuSuan: A Library for Bayesian Deep Learning∗

Jiaxin Shi1 [email protected]

Jianfei Chen1 [email protected]

Jun Zhu1 [email protected]

Shengyang Sun2 [email protected]

Yucen Luo1 [email protected]

Yihong Gu1 [email protected]

Yuhao Zhou1 [email protected] of Computer Science & Technology, TNList Lab, CBICR Center1State Key Lab of Intelligent Technology & Systems2Department of Electronic Engineering

Tsinghua University

Beijing, 100084, China

Abstract

In this paper we introduce ZhuSuan, a python probabilistic programming library for Bayesiandeep learning, which conjoins the complimentary advantages of Bayesian methods and deeplearning. ZhuSuan is built upon Tensorflow. Unlike existing deep learning libraries, whichare mainly designed for deterministic neural networks and supervised tasks, ZhuSuan isfeatured for its deep root into Bayesian inference, thus supporting various kinds of proba-bilistic models, including both the traditional hierarchical Bayesian models and recent deepgenerative models. We use running examples to illustrate the probabilistic programmingon ZhuSuan, including Bayesian logistic regression, variational auto-encoders, deep sigmoidbelief networks and Bayesian recurrent neural networks.

Keywords: Bayesian Inference, Deep Learning, Probabilistic Programming, Deep Gen-erative Models

1. Introduction

These years have seen great advances of deep learning (LeCun et al., 2015) and its suc-cess in many applications, such as speech recognition (Hinton et al., 2012), computer vi-sion (Krizhevsky et al., 2012), language processing (Sutskever et al., 2014), and computergames (Silver et al., 2016). One good lesson we learned from the practice is that a deeplyarchitected model can be well fit by leveraging advanced optimization algorithms (e.g.,stochastic gradient decent), a large training set (e.g., ImageNet), and powerful computingdevices (e.g., GPUs). Although the great expressiveness of deep neural networks (DNNs)has made them a first choice for many complex fitting problems, especially those tasksinvolving a mapping from an input space to an output space (e.g., classification and regres-sion), the results given by them are usually point estimates. Such a deterministic approachdoes not account for uncertainty, which is essential in every part of a learning machine (e.g.,data, estimation, inference, prediction and decision-making) due to the randomness of the

∗. J. Zhu is the corresponding author. S. Sun is now with Department of Computer Science, University ofToronto.

1

arX

iv:1

709.

0587

0v1

[st

at.M

L]

18

Sep

2017

Shi et al.

physical world, incomplete information, and measurement noise. It has been demonstratedthat deterministic neural networks can be vulnerable to adversarial attacks (Szegedy et al.,2013; Nguyen et al., 2015) (partly because of the lack of uncertainty modeling), which mayhinder their applications in the scenarios where representing uncertainty is of crucial im-portance, such as automated driving (Bojarski et al., 2016) and healthcare (Miotto et al.,2017).1 On the other hand, the probabilistic view of machine learning offers mathematicallygrounded tools for dealing with uncertainty, i.e., Bayesian methods (Ghahramani, 2015).Bayesian methods also provide a theoretically sound approach to incorporating structuralbias (Lake et al., 2015) and domain knowledge as prior or posterior constraints (Mei et al.,2014) to achieve efficient learning with a small number of training samples. Thus it is ben-eficial to combine the complimentary advantages of deep learning and Bayesian methods,which in fact has been drawing tremendous attention in recent years (Gal, 2016).

Moreover, as most of the success in deep learning comes from supervised tasks that re-quire a large number of labeled data, research in this area is paying more and more attentionto unsupervised learning, a long-standing goal of artificial intelligence (AI). Probabilisticmodels and Bayesian methods are commonly treated as principled approaches to model-ing unlabeled data for pure unsupervised learning or its hybrid with supervised learning(aka., semi-supervised learning). Recently, the popularity of deep generative models againdemonstrates the promise of combining deep neural networks with probabilistic modeling,which has shown superior results in image generation (Kingma and Welling, 2013; Goodfel-low et al., 2014; Radford et al., 2015), semi-supervised classification (Kingma et al., 2014;Salimans et al., 2016) and one-shot learning (Rezende et al., 2016).

We call such an arising direction that conjoins the advantages of Bayesian methods anddeep learning as Bayesian Deep Learning (BDL). The scope of BDL covers the traditionalBayesian methods, the deep learning methods where probabilistic inference plays a key role,and their intersection. One unique feature of BDL is that the deterministic transformationbetween random variables can be automatically learned from data under an expressiveparametric formulation typically using deep neural networks (Johnson et al., 2016), whilein traditional Bayesian models, the transformation tends to have a simple analytical form(e.g., the exponential function or inner product). One key challenge for Bayesian deeplearning is on posterior inference, which is typically intractable for such models and needssophisticated approximation techniques. Although much progress has been made on bothvariational inference and Monte Carlo methods (Zhu et al., 2017), it is still too difficultfor practitioners to pick-up. Moreover, although variational inference and Monte Carlomethods have their general recipes, it is still quite involved for an experienced researcher toderive the equations and implement every detail for a particular model. Such a procedureis error prone and takes a long time to debug.

In this paper, we present ZhuSuan, a probabilistic programming library for BayesianDeep Learning.2 We build ZhuSuan upon Tensorflow (Abadi et al., 2016) to leverage itscomputation graphs for flexible modeling. With ZhuSuan, users can enjoy powerful fitting

1. Some recent efforts have been made towards improving the robustness of deep neural networks againstadversarial samples by doing adversarial training (Szegedy et al., 2013), reverse cross-entropy train-ing (Pang et al., 2017) or Bayesian averaging (Li and Gal, 2017).

2. ZhuSuan is named after the Chinese name of abacus, which is the oldest calculating machine and hasbeen recognized as the fifth greatest innovation in China.

2

ZhuSuan: A Library for Bayesian Deep Learning

and multi-GPU training of deep learning, while at the same time they can use probabilis-tic models to model the complex world, exploit unlabeled data and deal with uncertaintyby applying principled Bayesian inference. We provide running examples to illustrate howintuitive it is to program with ZhuSuan, including Bayesian logistic regression, variationalauto-encoders, deep sigmoid belief networks and Bayesian recurrent networks. More ex-amples can be found in our code repository https://github.com/thu-ml/zhusuan. Thekey difference of ZhuSuan from other probabilistic programming libraries lies in its flex-ibility, benefited from the deep learning paradigm on which the library is built, and thekey treatment of model reuse. More detailed comparisons with two closely related works,Edward (Tran et al., 2016) and PyMC3 (Salvatier et al., 2016), are provided in this paper.

2. Bayesian Deep Learning

We first overview the basics of Bayesian methods and their intersection with deep learning,and define the notations.

2.1 Modeling

As a conjunction of graph theory and probability theory, probabilistic graphical models(PGMs) (Koller and Friedman, 2009) provide a powerful language for probabilistic ma-chine learning. A PGM defines a joint distribution of a set of random variables using agraph, which has an intuitive and compact semantic meaning to read out the conditionalindependence structures. There are two major types of PGMs, namely, directed ones (aka.Bayesian networks) and undirected ones (aka. Markov random fields, MRFs). ThoughBayesian networks and MRFs have different expressiveness, sometimes one can transforma Bayesian network to an MRF or vice versa (Koller and Friedman, 2009). In ZhuSuan, weconcentrate on Bayesian networks, which provide an intuitive data generation process (Gel-man et al., 2014). This choice is also reflected in the literature of Bayesian deep learning.For example, while modern deep generative models make their first remarkable step withundirected graphs (Hinton et al., 2006), the use of directed models (Kingma and Welling,2013) has dominated the area due to cheap generation and fast inference.

(a) (b) (c)

Figure 1: Bayesian networks

Figure 1a shows a Bayesian network with a directed acyclic graph (DAG) structure.This represents a joint probability distribution over the nodes that factorizes according to

3

https://github.com/thu-ml/zhusuan

Shi et al.

the graph:

p(Z,H,U, S,A,N) = p(Z)p(H)p(U |Z)p(S|Z,H)p(A|S)p(N |U,A). (1)

The joint probability is the product of two types of factors: the prior distribution of variablesat the roots (e.g., p(Z)) and the conditional probability distribution of other variables giventheir parents (e.g., p(S|Z,H)). The conditional distributions are also known as conditionalprobability tables (CPTs) for discrete variables and conditional probability density (CPD)functions for continuous variables.

We consider the general formulation, where there are two kinds of nodes in a Bayesiannetwork: deterministic nodes (Figure 1b, right, double line notation) and stochastic nodes(Figure 1b, left). A deterministic node calculates its value by a deterministic function of itsparents, while a stochastic node is represented by a standard probabilistic distribution. Al-though deterministic nodes are often not represented explicitly to make the graph structureconcise, it is beneficial to do so especially when we consider Bayesian deep learning models.This is because the transformation in a traditional model is typically in a simple analyticalform, while a Bayesian deep learning model learns it from a very flexible family (e.g., deepneural networks). Hence, explicitly representing deterministic nodes helps on both modeldefinition and inference, as shall be clear soon.

If all the variables are observed in the dataset, it is easy to do learning and inference fora Bayesian network. However, in reality it is more common to have partially observed datadue to physical randomness, missing information and measurement noise, thereby somevariables are hidden. Such models are known as latent variable models (LVMs), whichprovide a suite of useful tools to unveil the underlying factors (e.g., topics or latent featurerepresentations). Figure 1c illustrates a generic latent variable model, where the gray nodesrepresent the observed variables (x1:n), the rest (h1:n, θ) are latent variables, and n is thenumber of observed data samples. This class of models include both global and local latentvariables. By local we mean the latent variable only has impact on its paired observation(e.g., h is paired with x), while global means it influences all observations (e.g., θ).

2.2 Inference

For a latent variable model, posterior inference is the key step to unveil the latent factors fora particular input. To make the notations simple, for this subsection we consider a highlyabstract model p(z,x) = p(z)p(x|z), where z represents all latent variables, including theglobal and local ones, i.e., z = (h, θ), and x denotes observations. The vanilla Bayes’ ruleprovides a principle to derive the posterior distribution:

p(z|x) =p(z,x)

p(x)=p(z)p(x|z)

p(x). (2)

As analyzed in (Zhu et al., 2014), the vanilla Bayes’ rule can be generalized to regularizedBayesian inference (RegBayes) by introducing posterior regularization under an informationtheoretical formulation.

In general, the posterior inference using Bayes’ rule or the RegBayes formulation isintractable, except a few simple examples. Therefore, we have to resort to approximateBayesian inference methods. Many years of developments in this area has led to many fast,

4


widely applicable approximate inference algorithms, mainly divided into two categories,variational inference and Monte Carlo methods (Zhu et al., 2017). Both will be coveredbelow.

2.2.1 Variational Inference

Variational inference (VI) is an optimization-based method for posterior approximation,in which a parametric distribution family qφ(z) is chosen to approximate the true poste-rior p(z|x) by minimizing the KL divergence between them (KL(qφ(z)‖p(z|x)).3 The KL-divergence minimization is equal to maximizing a lower bound of the marginal log likelihoodlog p(x):

L(x;φ) = log p(x)−KL(qφ(z)‖p(z|x))

= Eqφ(z) [log p(x|z)]−KL (qφ(z)‖p(z)) .(3)

In the VI literature, qφ(z) is called the variational posterior, and L(x;φ) is called theevidence lower bound (ELBO). Generally there are two steps for doing VI. First is to choosethe parametric family that will be used as the variational posterior. Then the second stepis to solve the optimization problem with respect to the variational parameters (φ).

In recent years, benefited from the joint effort of the Bayesian and the deep learningcommunity, variational inference is undergoing many significant changes, both in the algo-rithm and the choices of variational families. From the algorithm side, stochastic approxi-mation with data sub-sampling has enabled VI to scale to large datasets (Hoffman et al.,2013). Meanwhile, direct optimization of the variational lower bounds by gradient descentis replacing traditional analytic updates, which makes VI applicable to a broader range ofmodels with non-conjugate dependencies (Graves, 2011; Titsias and Lazaro-Gredilla, 2014).The key challenge that draws most attention in this direction is on the gradient estimates.Many gradient estimators have been developed with different variance reduction techniques(Paisley et al., 2012; Mnih and Gregor, 2014; Mnih and Rezende, 2016). Specially, for con-tinuous latent variable models, an algorithm named Stochastic Gradient Variational Bayes(SGVB) has been very successful (Kingma and Welling, 2013), due to a clever trick to passthe gradient through a stochastic node, which is now well-known as the reparameterizationtrick. Besides, better lower bounds than the ELBO have also been developed, which can beoptimized by gradient descent as well and have been applied to both discrete and continuouslatent variable models (Burda et al., 2015; Mnih and Rezende, 2016).

On the other side, considerable efforts have also been put into the design of variationalposteriors. Using neural networks to parameterize the variational posterior has been acommon practice adopted by aforementioned works (Mnih and Gregor, 2014; Kingma andWelling, 2013), serving as one of the key technologies to efficiently train deep generativemodels. The network is usually fed with observations and is expected to amortize theinference cost for each data point by learning to do inference. This type of scheme isoften referred as amortized inference (Rezende and Mohamed, 2015). In summary, VI inthe Bayesian deep learning context is more stochastic, differentiable, and amortized thanbefore.

3. We do not explicitly condition z on x in q because it is not necessary to explicitly model this, thoughmany modern literatures do. In fact, even without explicit modeling, the optimization process willconnect them together.

5

Shi et al.

2.2.2 Monte Carlo Methods

Unlike VI that reformulates the inference as an optimization problem, Monte Carlo methods(Robert and Casella, 2005) are more direct ways to simulate samples or estimate propertiesof a particular distribution. Two main techniques extensively used for Bayesian inferenceare Importance Sampling and Markov Chain Monte Carlo.

Importance Sampling The basic idea of importance sampling is as follows. To esti-mate µ = Epf(x) where p is a probability density defined over Rd, a probability distributionq called the proposal is introduced. It is required that q(x) > 0 whenever p(x)f(x) > 0.Then

µ =

∫Rdf(x)p(x) dx =

∫Rd

f(x)p(x)

q(x)q(x) dx = Eq

[f(x)p(x)

q(x)

]. (4)

The Monte Carlo estimate of µ given by importance sampling is

µ =1

N

N∑i=1

f(xi)p(xi)

q(xi), xi ∼ q(x). (5)

In the Bayesian inference context, it is often the case that we only have access to anunnormalized version of p, which we denote as p. p = p/Z, where Z is the normalizingconstant which is intractable. Self-normalized Importance Sampling is particularly usefulin this situation, which gives the estimate

µ =

N∑i=1

wif(xi), where wi =wi∑Nj=1wj

, wi =p(xi)

q(xi). (6)

It can be proved that µ is an asymptotically unbiased estimator of µ as N →∞. For moredetailed introduction, we refer the readers to (Owen, 2013).

As introduced above, importance sampling in the strict sense is not a sampling methodbecause it does not directly draw samples from the target distribution. Instead, it providesa method for estimating properties of a certain distribution (normalized or not). Thus theuse of importance sampling can be everywhere that needs an estimate of some integral overa probability measure. Though the computation in eqs. (5) and (6) is rather direct, someof their application scenarios may not be obvious for users.

Here we focus on application scenarios in Bayesian deep learning, which concentrate onmodel learning and evaluation. For model learning, it is shown in (Bornschein and Ben-gio, 2014) that self-normalized importance sampling can be used to estimate gradients ofmarginal log likelihoods with respect to model parameters, where this technique was pro-posed to train deep generative models. Even if not used for learning, importance samplingis still a simple and efficient method for evaluating marginal log likelihoods of latent vari-able models (Wallach et al., 2009). One of the drawbacks of using importance samplingis that the estimation has large variance if the proposal is not good enough (Owen, 2013).Therefore people have been exploring the use of adaptive proposals (Cappe et al., 2008;Cornuet et al., 2012). This idea is recently reformed into a new technique called neuraladaptive proposals, i.e., a neural network is used to parameterize the proposal distribution,which is adapted towards the optimal proposal by gradient descent (Bornschein and Ben-gio, 2014). This technique proves to be successful in applications like dynamic systems (Gu

6


et al., 2015), inference acceleration (Du et al., 2017; Paige and Wood, 2016) and learningattention models (Ba et al., 2015).

Markov Chain Monte Carlo Markov Chain Monte Carlo (MCMC) is a classic methodto generate samples from the posterior distribution in Bayesian statistics. Unlike variationalinference, MCMC is known to be asymptotically unbiased and allows the user to trade offcomputation for accuracy without limit. The basic idea of MCMC is to design a Markovchain whose stationary distribution is the target distribution, then samples can be generatedby simulating the chain until convergence. Specifically, for a Markov chain specified bythe transition kernel T (z′|z), the most common sufficient condition for p(z|x) being itsstationary distribution is the detailed balance condition:

p(z|x)T (z′|z,x) = p(z′|x)T (z|z′,x), for all z, z′.

An arbitrarily chosen initial state z0 and the transition kernel define a joint distributionp(z0, . . . , zt, . . . |x) and marginal distributions p(zt|x). It can be shown that under mildconditions, p(zt|x) converges to p(z|x) as t → ∞ (Robert and Casella, 2005). In practice,it is common to throw away samples from the initial stage before the chain converges. Thisstage is often referred as the burn-in (or warm-up) stage. Once we get samples from theMCMC methods, there are many ways to use them. For parameter estimation of LVMs, thesamples can be used in the Monte-Carlo EM (MCEM) algorithm (Wei and Tanner, 1990).

Throughout the literature there are many ways to design the transition kernel. Forexample, the simplest form of the Metropolis Hastings algorithm (Metropolis et al., 1953;Hastings, 1970) combines a Gaussian random walk proposal with an accept-reject test forcorrection, which scales poorly with increasing dimension and complexity of the target dis-tribution. Gibbs sampling (Geman and Geman, 1984) utilizes the structure of the targetdistribution by taking its element-wise conditional distribution as the transition proposal.However, it requires the conditionals to be analytically computable, which limits its appli-cation scope, especially in Bayesian deep learning.

A powerful MCMC method that efficiently explores high-dimensional continuous distri-butions is Hamiltonian Monte Carlo (HMC) (Neal et al., 2011). To sample from p(z|x),HMC introduces an auxiliary momentum variable p with the same dimensionality as z,and effectively samples from the joint distribution p(z,p|x) = p(z|x) exp(−1

2p>M−1p),

where M is called the mass matrix. Samples are generated by simulating the Hamiltoniandynamics, which governs the evolution of the (z,p) system along continuous time t as:

∂z

∂t= ∇pH,

∂p

∂t= −∇zH. (7)

Here H(z,p) = − log p(z,p|x) is the Hamiltonian of the system. The Hamiltonian dy-namics defines a continuous-time transition kernel, and the stationary distribution of thecorresponding Markov chain is the desired p(z,p|x), due to the volume-preservation andthe Hamiltonian-conservation properties of the dynamics. In practice one has to simu-late the dynamics in discrete time steps. The leapfrog integrator (Leimkuhler and Reich,2004) is widely used, since it keeps the two vital properties of the Hamiltonian dynam-ics: it is volume-preserving and approximately Hamiltonian-conserving. Combined witha Metropolis-Hastings algorithm to correct the approximating error of the Hamiltonian,the discrete-time simulation comes to our satisfactory: the Markov chain converges to our

7

Shi et al.

desired distribution. The key advantage of HMC is that it exploits information about thegeometry of the target distribution (Betancourt, 2017) while only needs the distributiondensity and the gradient to proceed.

People have been exploring the use of HMC in Bayesian deep learning. The early workby Radford Neal (Neal, 1995) applied HMC to the inference of Bayesian neural networks,one of the representative models that apply Bayesian methods to capture uncertainty indeep learning. Recent works on deep generative models have applied HMC to improve thevariational posterior (Salimans et al., 2015) as well as used HMC-based MCEM algorithmto directly learn the model parameters (Hoffman, 2017; Li et al., 2017; Titsias, 2017).

3. Design Overview

ZhuSuan has been designed towards the basic needs of Bayesian deep learning, which, asstated in the last section, mainly include two parts: modeling and inference. As for themodeling part, we follow the principle that the code reads like the model definition. Thisrequires model primitives that support:

• Direct building of probabilistic models as computation graphs.

• Easy reuse by changing the states of stochastic nodes (observed or not).

• Arbitrary deterministic modeling with the full power of Tensorflow.

On the side of inference, to leverage recent advances in Bayesian deep learning while stayingapplicable to a broad class of models, ZhuSuan makes efforts towards:

• Unifying traditional and advanced differentiable inference methods in a deep learningparadigm.

• Supporting inference for both continuous latent variables and discrete ones.

Also from users’ perspective, the design of ZhuSuan has been governed by two principles 4:

• Modularity: Make flexible abstractions that allow all parts to be used independently.

• Transparency: Do not hide the whole inference procedure behind abstractions, toenable heavy customization by users.

This makes ZhuSuan particularly different from existing probabilistic programming li-braries, which will be discussed in later comparison (Section 5).

4. Features

In this section we introduce the basic features of ZhuSuan, including model primitives andinference algorithms.

4. Both principles are learned from Lasagne (Dieleman et al., 2015).

8


4.1 Model Primitives

For the reasons we have explained in Section 2.1, ZhuSuan’s model primitives are designedfor Bayesian networks. Since the basic structure is a DAG, we will introduce the nodeprimitives first, and then will see how ZhuSuan stores and fetches information in the graph.

Deterministic nodes In ZhuSuan, users are enabled to use any Tensorflow operationfor deterministic transformations. This includes various arithmetic operators (e.g., tf.

add,tf.tensordot,tf.matrix_inverse), neural network layers (e.g., tf.layers.fully_connected,tf.layers.conv2d) and even control flows (e.g., tf.while_loop,tf.cond). InTensorflow computation graph, the output of an operation is named Tensor. ZhuSuan doesnot add any higher-level abstraction on Tensors but directly treat them as deterministicnodes in the Bayesian networks. We will see that the other primitives work well directlywith Tensors.

Stochastic nodes For stochastic nodes, ZhuSuan provides an abstraction called Stocha

sticTensor, which is named following its deterministic counterpart. Many commonly usedprobabilistic distributions are implemented and wrapped in StochasticTensors (e.g., Nor-mal, Bernoulli, Categorical), together with some recently developed variants for Bayesiandeep learning, such as Gumbel-softmax or Concrete (Jang et al., 2016; Maddison et al.,2016). StochasticTensors inherit most behaviors of Tensors. The former can be directlyfed into Tensorflow operations and can automatically be cast into Tensors when computingwith them. When cast into Tensors, the values they choose can be set to samples or obser-vations, or can be automatically determined according to their states given in the context,which will be introduced next.

The graph context Above we see a Bayesian network can be built transparently withmixes of Tensorflow operations and StochasticTensors. Because large and sophisticatedmodels are generally common today, it is still painful for one to deal with all these nodesindividually. To help users manage the graph in a convenient way, ZhuSuan provides agraph context called the BayesianNet, which keeps track of all named StochasticTensorsconstructed in it.

To start a BayesianNet context, use the with statement in python:

import zhusuan as zs

with zs.BayesianNet() as model:

# Build the graph.

The BayesianNet context supports making queries on the inner stochastic nodes. Thequery options include the current-state outputs and local probabilities (the current-statevalue of CPDs) at certain nodes.

We use several representative examples to illustrate the process of building models inZhuSuan, ranging from the simple Bayesian logistic regression to modern Bayesian deepmodels. In general, the programming of a probabilistic model on ZhuSuan is as intuitive asreading the corresponding graphical model, which is provided side-by-side.

Example 1 (Bayesian Logistic Regression, BLR) The generative process of a Bayesianlogistic regression can be written as:

w ∼ N(0, α2I),

yi ∼ Bernoulli(σ(wTxi)), i = 1, . . . , n,(8)

9

tf.add, tf.tensordot, tf.matrix_inverse

tf.add, tf.tensordot, tf.matrix_inverse

tf.layers.fully_connected, tf.layers.conv2d

tf.layers.fully_connected, tf.layers.conv2d

tf.while_loop, tf.cond

Tensor

Tensor

Tensor

StochasticTensor

StochasticTensor

Tensor

Tensor

Tensor

StochasticTensors

BayesianNet

StochasticTensor

BayesianNet

with

BayesianNet

Shi et al.

where w,xi ∈ RD, yi ∈ {0, 1}, and σ(·) is the sigmoid function. The graphical model andthe corresponding code is shown in Figure 2. Note that data is modeled in a batch.

(a)

import tensorflow as tf


# Define inputs and parameters

x = tf.placeholder([None, D], tf.float32)

log_alpha = tf.Variable(tf.zeros([D]))

# w ~ N(0, alpha^2*I)

w = zs.Normal('w', mean=0., logstd=log_alpha,

group_ndims=1)

# y_logit = w^Tx

y_logit = tf.reduce_sum(

tf.expand_dims(w, 0)*x, axis=1)

# y ~ Bernoulli(sigmoid(y_mean))

y = zs.Bernoulli('y', y_logit)

(b)

Figure 2: BLR: (a) graphical model; (b) programming on ZhuSuan.

In this example the deterministic transformation part is the linear model (wTx), whichis implemented by the Tensorflow operations, tf.expand_dims,tf.reduce_sum and tf.

multiply (*) to enable batch processing of the inputs (xi, i = 1, . . . , n). The two randomvariables y and w are created as zs.Bernoulli and zs.Normal respectively, which are theStochasticTensor for Bernoulli and normal distributions. The group_ndims argumentpassed to specify the StochasticTensor for w means that the last dimension of w is treatedas a group, whose probabilities are computed together.

Example 2 (Variational Auto-Encoders, VAE) Variational autoencoders (VAE) is oneof the most popular products of Bayesian deep learning. The generative process of a VAEfor modeling binarized MNIST data is

z ∼ N(0, I),

xlogits = fNN (z),

x ∼ Bernoulli(σ(xlogits)),

(9)

where z ∈ RD,x ∈ R784. This generative process is a stereotype for deep generative models,which starts with a latent representation (z) sampled from a simple distribution (such as thestandard normal). Then the samples are forwarded through a deep neural network (fNN ) tocapture the complex generative process of high dimensional observations such as images (x).Finally, some noise is added to the output to get a tractable likelihood for the model. Forbinarized MNIST, the observation noise is chosen to be Bernoulli, with its parameters outputby the neural network. Because ZhuSuan is built upon Tensorflow, which has full supportfor deep neural networks, the implementation of VAE is extremely easy and intuitive, asshown in Figure 3.

10

tf.expand_dims, tf.reduce_sum

tf.multiply

tf.multiply

*

zs.Bernoulli

zs.Normal

StochasticTensor

group_ndims

StochasticTensor


(a)

from tensorflow.contrib import layers


z_mean = tf.zeros([n, D])

z = zs.Normal('z', z_mean, std=1., group_ndims=1)

h = layers.fully_connected(z, 500)

x_logits = layers.fully_connected(h, 784,

activation_fn=None)

x = zs.Bernoulli('x', x_logits, group_ndims=1)

(b)

Figure 3: VAE: (a) graphical model; (b) programming on ZhuSuan.

Example 3 (Deep Sigmoid Belief Networks, DSBN) Sigmoid Belief Networks (SBN)is a directed discrete latent variable model that has close connections with feed-forward neu-ral networks and Boltzmann Machines (Neal, 1992). In recent years, the return of neuralnetworks has brought a new life to this old model. In fact, the well-known Deep Belief Net-works (DBN), the earliest work on deep learning, is an infinite-layer tied-weight SBN withthe bottom layers untied (Hinton et al., 2006). The generative process of a DSBN with Llayers is

z(L) ∼ Bernoulli(σ(p(L))),

z(l−1) ∼ Bernoulli(σ(w(l)T z(l))), l = L, . . . , 1

x = z(0),

(10)

where σ is the sigmoid function; p(L) is the top layer parameters; w are hidden weights; zare latent variables, and x are observations. From the definition we can see that DSBN is amodel with multi-layer stochastic nodes. The implementation of a two-layer DSBN (L = 2)is shown in Figure 4.

(a)


z2_logits = tf.zeros([n, n_z])

z2 = zs.Bernoulli('z2', z2_logits, dtype=tf.float32,

group_ndims=1)

z1_logits = layers.fully_connected(z2, n_z,

activation_fn=None)

z1 = zs.Bernoulli('z1', z1_logits, dtype=tf.float32,

group_ndims=1)

x_logits = layers.fully_connected(z1, n_x,

activation_fn=None)


(b)

Figure 4: DSBN: (a) graphical model; (b) programming on ZhuSuan.

11

Shi et al.

To provide a more interesting example that utilizes the powerful control flow operationsfrom Tensorflow, we describe a Bayesian Recurrent Neural Network (Bayesian RNN) below,which can also be intuitively programmed on ZhuSuan.

Example 4 (Bayesian RNN) We have mentioned previously that deterministic neuralnetworks lack the ability to account for the uncertainty of its own predictions. A solution tothis given by Bayesian methods is a model named Bayesian Neural Network (Bayesian NN),which treats the network weights as random variables and infers a posterior distribution overthem given data. The generative process of a Bayesian NN for classification tasks is

W ∼ N(0, I),

π = fNN (x;W),

y ∼ Cat(softmax(π)),

(11)

where fNN is a neural network with W as the weights, π is the predicted unnormalized logprobabilities for all classes, Cat represents the categorical distribution, and y is the classlabel. When fNN is chosen to be a recurrent network, the above process describes a BayesianRNN. The graphical model for a Bayesian RNN is shown in Figure 5. Below we considera model for a two-class sequence classification task. For the RNN part, it uses a LongShort-Term Memory (LSTM) network.

Figure 5: Bayesian RNN

class BayesianLSTMCell(object):

def __init__(self, num_units, forget_bias=1.0):

self._forget_bias = forget_bias

w_mean = tf.zeros([2 * num_units + 1, 4 * num_units])

self._w = zs.Normal('w', w_mean, std=1., group_ndims=2)

def __call__(self, state, inputs):

c, h = state

batch_size = tf.shape(inputs)[0]

linear_in = tf.concat([inputs, h, tf.ones([batch_size, 1])], axis=1)

linear_out = tf.matmul(linear_in, self._w)

12


# i = input_gate, j = new_input, f = forget_gate, o = output_gate

i, j, f, o = tf.split(value=linear_out, num_or_size_splits=4, axis=1)

new_c = (c * tf.sigmoid(f + self._forget_bias) +

tf.sigmoid(i) * tf.tanh(j))

new_h = tf.tanh(new_c) * tf.sigmoid(o)

return new_c, new_h

def bayesian_rnn(cell, inputs, seq_len):

initializer = (tf.zeros([batch_size, 128]), tf.zeros([batch_size, 128]))

c_list, h_list = tf.scan(cell, inputs, initializer=initializer)

relevant_outputs = tf.gather_nd(

h_list, tf.stack([seq_len - 1, tf.range(batch_size)], axis=1))

logits = tf.squeeze(tf.layers.dense(relevant_outputs, 1), -1)

return logits


cell = BayesianLSTMCell(128, forget_bias=0.)

logits = bayesian_rnn(cell, self.x, self.seq_len)

_ = zs.Bernoulli('y', logits, dtype=tf.float32)

The code splits into three parts. First we build a BayesianLSTMCell that applies trans-formations to the inputs and the hidden states at each time step. Its only difference fromthe tf.nn.rnn_cell.BasicLSTMCell in Tensorflow is that the weights are generated from anormal StochasticTensor. Then the tf.scan operation is used to apply the cell function toinput sequences. This part utilizes the power of control flows to deal with the variable-lengthsequences (tf.scan is internally implemented by the tf.while_loop operator). Finally, aBernoulli StochasticTensor generates the class label given the outputs from the RNN.

Model reuse Unlike supervised neural networks, a key feature of probabilistic graphicalmodels is polymorphism, i.e., a stochastic node can have two states: observed or latent.This is a major difficulty when designing programming primitives. For example, considerthe above VAE case. If z is a Tensor sampled from the Normal distribution, when the modelis again used in a case where z is observed, the common practice in Tensorflow programmingis to write another piece of code that builds a graph with the observation of z as the input.When the set of stochastic nodes is large, the process is very painful.

The reusability problem is a shared problem for probabilistic programming librariesthat are based on computation graph toolkits, e.g., Theano (Al-Rfou et al., 2016) and Ten-sorflow (Abadi et al., 2016). Other libraries such as PyMC3 (Salvatier et al., 2016) andEdward (Tran et al., 2016) address this problem by directly manipulating the created com-putation graph. Specifically, PyMC3 (based on Theano) relies on the officially supportedtheano.clone() function to copy and recreate subgraphs that are influenced when the stateof a node changes. While Edward (based on Tensorflow) has to implement their own copy()

function by looking into nonpublic low-level APIs of the Tensorflow computation graph. Aswe will see later, the solutions by manipulating the created graphs induce problems andlimitations for these libraries.

13

BayesianLSTMCell

tf.nn.rnn_cell.BasicLSTMCell

StochasticTensor

tf.scan

tf.scan

tf.while_loop

StochasticTensor

Shi et al.

ZhuSuan has been carefully designed towards reusability, but does not rely on alteringthe created graphs. Specifically, an observation can be passed to StochasticTensor di-rectly, which enables model reuse by repeated calls of a function with different argumentspassed. Below is an illustration example:

def build_model(x_obs=None):

x = zs.Normal('x', observed=a_obs)

return f(x)

# samples from the Normal distribution will be used for computation in f

out = build_model()

# x_obs will be used for computation in f

out = build_model(x_obs=x_obs)

For models that have many stochastic nodes (e.g., x1,...,xN), in principle the poly-morphism can be achieved by

def build_model(x1_obs=None, ..., xN_obs=None):

...

However, in practice this is often found to be redundant and makes it hard for automaticinference algorithms to fit each model. This is where the BayesianNet context makes abig difference. BayesianNet accepts an argument named observed, which is a dictionarymapping from names of StochasticTensors to their observations. Then the context willbe responsible for determining the states of StochasticTensors. This makes it easierfor users to handle state changes of large sets of stochastic nodes, and more importantly,enables a unified form of model building functions that makes automatic inference possible.An example is shown below.

def build_model(observed=None):

with zs.BayesianNet(observed=observed) as model:

z = zs.Normal('z')

x = zs.Normal('x', f(z))

...

return model

# No observation

m = build_model()

# Observe z and x

m = build_model({'z': z_obs, 'x': x_obs})

4.2 Inference Algorithms

In Section 2.2 we have covered the major inference methods used in Bayesian deep learning.Although most of them are gradient-based, stochastic, and black-box (Ranganath et al.,2014), which is suitable for building a general inference framework, there is currently nosoftware that provides a complete support of them. To bridging the gap, ZhuSuan leveragesrecent advances of differentiable inference in Bayesian deep learning, and provides a wideand flexible support for both traditional and modern inference algorithms. These algorithmsare provided in an API that fits well into deep learning paradigms, which makes inference

14

StochasticTensor

x1, ..., xN

BayesianNet

BayesianNet

observed

StochasticTensor

StochasticTensor


ObjectiveGradientestimator

Supported latent-variable types Implementationsin zs.variational

ELBOSGVB

• continuous and reparameterizable

• Concrete relaxation of discreteelbo().sgvb

REINFORCE all typeselbo().

reinforce

Importanceweightedobjective

SGVB(IWAE)

• continuous and reparameterizable

• Concrete relaxation of discrete

iw_objective().

sgvb

VIMCO all typesiw_objective().

vimco

KL(p‖q) RWS all types klpq().rws

Table 1: Variational inference in ZhuSuan. Relevant references are SGVB (Kingma andWelling, 2013), REINFORCE (Williams, 1992; Mnih and Gregor, 2014), IWAE (Burdaet al., 2015), VIMCO (Mnih and Rezende, 2016), and RWS (Bornschein and Bengio, 2014).

as easy as doing gradient descent in deterministic neural networks. Below, we use examplesto explain both variational inference and Monte Carlo methods in ZhuSuan.

4.2.1 Variational Inference

As reviewed in Section 2.2.1, a variational inference (VI) algorithm consists of two parts:the construction of variational posteriors and the optimization of variational objectives.

Typical VI implementations in probabilistic programming languages have mostly re-stricted themselves to simple variational posteriors. For example, the ADVI algorithm(Kucukelbir et al., 2017) that serves as the main VI method in Stan (Carpenter et al.,2017) uses only Gaussian variational posteriors. In contrast, ZhuSuan supports buildingvery flexible variational posteriors by leveraging Bayesian networks. This opens up thedoor for rich variational families with user-specified dependency structures.

As for the optimization side, as mentioned in Section 2.2.1, many gradient-based vari-ational methods have emerged in recent progress of Bayesian deep learning (Kingma andWelling, 2013; Mnih and Gregor, 2014; Burda et al., 2015; Mnih and Rezende, 2016). Thesemethods differ in the variational objectives and the gradient estimators they use. To makethem more automatic and easier to handle, ZhuSuan has wrapped them all into single func-tions, which computes a surrogate cost for users to directly take derivatives on. This meansthat optimizing these surrogate costs is equally optimizing the corresponding variationalobjectives using their well-developed gradient estimators. The currently supported varia-

15

zs.variational

elbo().sgvb

elbo().reinforce

elbo().reinforce

iw_objective().sgvb

iw_objective().sgvb

iw_objective().vimco

iw_objective().vimco

klpq().rws

Shi et al.

tional methods are summarized in Table 1. They are grouped by the objective they aim tooptimize. Currently there are three kinds of objectives supported: the ELBO (equivalentto KL(q‖p)), the importance weighted objective (Burda et al., 2015) and KL(p‖q), wherep, q denotes the true and the variational posterior.

It is easy to program with ZhuSuan to implement a variational inference algorithm,following these steps:

1) Build the variational posterior using ZhuSuan’s modeling primitives.

2) Get samples and their log probabilities from the variational posterior.

3) Provide the log joint function of the model and call variational objectives.

4) Choose the gradient estimator to use and get the surrogate cost to minimize.

5) Call Tensorflow optimizers to run gradient descent on the surrogate cost.

Below we use a simple example to guide the readers through all these steps.

Example 5 (BLR, continued) We consider applying variational inference to the BLRmodel in Example 1, following the above steps:

1) Build the variational posterior. The true posterior of w is intractable and should havecorrelations across dimensions. Here we follow the common practice to make the mean-fieldassumption that the variational posterior q(w) is factorized, i.e., q(w) =

∏Dd=1 q(wd), where

D is the dimension of the weights. The code for using factorized normal distribution as thevariational posterior is:

with zs.BayesianNet() as variational:

w_mean = tf.Variable(tf.zeros([D]))

w_logstd = tf.Variable(tf.zeros([D]))

w = zs.Normal('w', w_mean, logstd=w_logstd, group_ndims=1)

2) Get samples and their log probabilities from the variational posterior. This is byquerying the stochastic node w about its outputs and local log probabilities.

qw_samples, log_qw = variational.query(

'w', outputs=True, local_log_prob=True)

3) Provide the log joint function of the model and call variational objectives. The logjoint value can be computed easily by taking sum of log probabilities at individual nodes ofthe model:

def log_joint(observed):

model = blr(observed, x)

log_py_xw, log_pw = model.local_log_prob(['y', 'w'])

return log_py_xw + log_pw

Then we call the ELBO objective, passing the log joint function, observed data and outputsof the variational posterior.

lower_bound = zs.variational.elbo(log_joint,

observed={'y': y},

latent={'w': [qw_samples, log_qw]})

16


4) Choose the gradient estimator to use and get the surrogate cost to minimize. Becausew is continuous and can be reparameterized as w = εσw + µw, where µw and σw are themean and the standard deviation of the normal distribution, we choose the SGVB gradi-ent estimator (see Table 1 for the scope of application of each estimator). The computedsurrogate costs are for a batch of data, which is then averaged.

cost = tf.reduce_mean(elbo.sgvb())

5) Call Tensorflow optimizers to run gradient descent on the surrogate cost. As previ-ously explained, this will be optimizing the ELBO objective with the SGVB gradient estima-tor. We can also fetch the ELBO value to verify it.

optimizer = tf.train.AdamOptimizer(learning_rate=0.001)

infer_op = optimizer.minimize(cost)

with tf.Session() as sess:

for i in range(iters):

_, elbo_value = sess.run([infer_op, elbo])

The above is a very simple example using factorized variational posteriors, while asmentioned previously, ZhuSuan supports building very flexible variational posteriors. Inthe following example, we will see how to do amortized inference (see Section 2.2.1) byleveraging neural networks in the variational posterior.

Example 6 (VAE, continued) Consider the VAE model in Example 2. The key differ-ence compared to BLR (Example 1) is that VAE’s latent variables (z) are local instead ofglobal. As the number of local latent variables scales linearly with the number of observations(x), it is hard to afford a separate variational posterior for each of them. This is where theamortized style of inference becomes useful. Specifically, we use a neural network with xas input to generate parameters of the variational posterior for each corresponding z. Thisnetwork is often referred as decoder or recognition model in VAE. The code is shown inFigure 6.

(a)


h = layers.fully_connected(x, n_h)

z_mean = layers.fully_connected(h, n_z,

activation_fn=None)

z_logstd = layers.fully_connected(h, n_z,

activation_fn=None)

z = zs.Normal('z', z_mean, logstd=z_logstd,

group_ndims=1)

(b)

Figure 6: VAE’s variational posterior: (a) graphical illustration; (b) programming onZhuSuan.

Having built the variational posterior, we can follow the above steps to finish the inferenceprocedure. Note that because the variational distribution for z is again normal, we can use

17

Shi et al.

the SGVB gradient estimator for optimizing the ELBO objective. We omit these stepsbecause they are very similar to those illustrated in the previous example.

In deep, hierarchical models, there are often conditional dependencies between latentvariables that need consideration during inference. We use the DSBN model from Example 3to illustrate how to introduce structured dependencies when building variational posteriorsfor hierarchical models. This example also demonstrates applying VI to discrete latentvariables in ZhuSuan.

Example 7 (DSBN, continued) A commonly used variational posterior in recent worksfor the DSBN model is

q(z(1:L)) =

L∏l=2

q(z(l)|z(l−1))q(z(1)|x). (12)

which has the same conditional dependency structure with the original model given x isobserved. The graphical model as well as the code for this structured posterior (when L = 2)are shown in Figure 7. It can be seen from the code that z(2) directly depends on z(1) by

(a)


z1_logits = layers.fully_connected(x, n_z,

activation_fn=None)

z1 = zs.Bernoulli('z1', z1_logits, dtype=tf.float32)

z2_logits = layers.fully_connected(z1, n_z,

activation_fn=None)


(b)

Figure 7: DSBN’s variational posterior: (a) graphical illustration; (b) programming onZhuSuan.

a fully connected layer in neural networks. Note that here z(2) and z(1) are discrete latentvariables. In ZhuSuan there are generally two ways for doing variational inference withthem.

One way is to use the Concrete relaxation, or Gumbel-softmax trick (Maddison et al.,2016; Jang et al., 2016), namely, we use Concrete random variables to replace the Bernoulliones during training. Due to the Concrete distribution is continuous and reparameterizable,we can keep on using the SGVB estimator. In test time, we switch back to the Bernoullirandom variables using the same input parameters. Both the model and the posterior in thisway are modified a little:

# model with Concrete relaxation



18


if is_training:

z2 = zs.BinConcrete('z2', z2_logits, group_ndims=1)

else:

z2 = zs.Bernoulli('z2', z2_logits, group_ndims=1, dtype=tf.float32)

z1_logits = layers.fully_connected(z2, n_z, activation_fn=None)

if is_training:


else:


x_logits = layers.fully_connected(z1, n_x, activation_fn=None)


# posterior with Concrete relaxation


z1_logits = layers.fully_connected(x, n_z, activation_fn=None)

if is_training:


else:



if is_training:


else:


The other approach is to directly use a gradient estimator that applies to discrete latentvariables. This includes REINFORCE and VIMCO. Here we present the code for usingVIMCO. Because VIMCO is optimizing the importance weighted bound, which is a multi-sample bound, the model and the variational posterior need to be changed to multi-sampleversions as shown below.

# model (multi-sample version)



z2 = zs.Bernoulli('z2', z2_logits, group_ndims=1, dtype=tf.float32,

n_samples=n_samples)



x_logits = layers.fully_connected(z1, n_x, activation_fn=None)


# posterior (multi-sample version)


z1_logits = layers.fully_connected(x, n_z, activation_fn=None)

z1 = zs.Bernoulli('z1', z1_logits, group_ndims=1, dtype=tf.float32,

n_samples=n_samples)



Note the n_samples argument added in the code. This time we call the VIMCO gradientestimator to get the surrogate cost:

19

n_samples

Shi et al.

lower_bound = zs.variational.iw_objective(

log_joint, observed={'x': x}, latent=qz_outputs, axis=0)

cost = tf.reduce_mean(lower_bound.vimco())

All samples along the axis 0 will be used to obtain variance reduced gradient estimates.

4.2.2 Monte Carlo Methods

In addition to variational inference, efforts have also been made towards unifying MonteCarlo methods in the Bayesian deep learning context. We have covered the basics in Sec-tion 2.2.2. Below we walk through ZhuSuan’s support for both importance sampling and ablack-box MCMC method: Hamiltonian Monte Carlo.

Importance Sampling As reviewed in Section 2.2.2, the major application scenariosfor importance sampling in Bayesian deep learning include model learning and evaluation.For model learning, we have introduced that self-normalized importance sampling can beused to estimate the gradients of marginal log likelihoods with respect to model parameters(Bornschein and Bengio, 2014). It turns out that the gradients estimated in this way areequivalent to exact gradients of the importance weighted variational objective (mentionedin Section 4.2.1) with respect to model parameters. So the model learning procedure withimportance sampling can be implemented by directly optimizing the importance weightedobjective (zs.variational.iw_objective(), see Table 1) with respect to the model pa-rameters:

lower_bound = zs.variational.iw_objective(

log_joint, observed, latent, axis=axis)

optimizer = tf.train.AdamOptimizer(learning_rate)

learning_op = optimizer.optimize(lower_bound, var_list=model_params)

The only difference is that the meaning of the variational posterior has changed to theproposal distribution.

Besides learning model parameters, we have mentioned that importance sampling isextensively used for model evaluation. Towards this need, ZhuSuan’s evaluation moduleprovides an is_loglikelihood() function for estimating the marginal log likelihoods ofany given observations using simple importance sampling:

ll = zs.is_loglikelihood(log_joint, observed, latent, axis=axis)

Until now we haven’t covered how to build a proposal distribution for importance sam-pling. In fact, this procedure is exactly the same as that of building a variational posterior.With ZhuSuan’s modeling primitives, neural adaptive proposals mentioned in Section 2.2.2can be easy to implement by leveraging neural networks in the proposal distribution. Adapt-ing the proposal for the above two scenarios is also straightforward, where the true posteriordistribution is often an ideal choice. So the adaptation turns out a variational inferenceproblem, which can be solved by choosing an appropriate method in Table 1. Specially,when the KL(p‖q) objective is chosen and the gradient estimation is by importance sam-pling, this recovers the method used in Reweighted Wake-Sleep (RWS) (Bornschein andBengio, 2014), which is why this estimator is named rws in ZhuSuan. We illustrate theprocess of training DSBN by importance sampling in the following example.

20

zs.variational.iw_objective()

is_loglikelihood()

rws


Example 8 (DSBN, continued) In this example we reproduce the algorithm proposedin the Reweighted Wake-Sleep paper (Bornschein and Bengio, 2014), which learns DSBNby importance sampling and a neural adaptive proposal. The code snippet below followsfrom the multi-sample version of the model and the posterior (now used as the proposal) inExample 7, and we omit the code to draw samples and compute their log probabilities fromthe proposal distribution.

# learning model parameters

lower_bound = tf.reduce_mean(

zs.variational.importance_weighted_objective(

log_joint, observed={'x': x_obs}, latent=latent, axis=0))

model_grads = optimizer.compute_gradients(-lower_bound, model_params)

# adapting the proposal

klpq_obj = zs.variational.klpq(

log_joint, observed={'x': x_obs}, latent=latent, axis=0)

klpq_cost = tf.reduce_mean(klpq_obj.rws())

klpq_grads = optimizer.compute_gradients(klpq_cost, proposal_params)

infer_op = optimizer.apply_gradients(model_grads + klpq_grads)

We can see that the code clearly has two parts. The first part is for learning model parame-ters, by optimizing the importance weighted objective with respect to model parameters. Thesecond part is for adapting the proposal, by minimizing the inclusive KL divergence betweenthe true posterior (the ideal proposal) and the current proposal. As the training proceeds, theadaptation part will help maintain a good proposal, reducing the variance of the marginallog likelihood estimate by importance sampling.

Hamiltonian Monte Carlo In Section 2.2.2 we have briefly analyzed existing MCMCmethods and have identified HMC as a powerful tool to address the posterior inference prob-lem in high-dimensional spaces and non-conjugate models, which is a perfect fit for Bayesiandeep learning. However, in practice this algorithm involves lots of technical details and canbe hard to implement in a fast and efficient way. Besides, despite all tuning parameters inHMC have clear physical meanings, it is still hard for users to tune them by hand becausethe optimal choice always depends on unknown statistics of the underlying distribution.For example, the mass matrix describes the variance of the underlying distribution, whichis hard to know before we can draw samples from it.

In recent years there has been a rise of practical algorithms and high-performance soft-wares that target at these problems. The No-U-Turn Sampler, or NUTS (Hoffman andGelman, 2014) proposes to automatically determine the number of leapfrog steps. It alsocomes along with a dual averaging scheme for automatically tuning the step size. The HMCimplementation in Stan (Carpenter et al., 2017) also includes a procedure that estimatesthe mass matrix from the samples drawn in the warm-up stage.

ZhuSuan has learned from the existing libraries and provides a fast, automatic, anddeep-learning style implementation of HMC. Specifically, ZhuSuan’s HMC has the followingfeatures:

• Support running multiple chains in parallel on CPUs or GPUs.

21

Shi et al.

• Provide options for automatically tuning parameters, including the step size and themass matrix. The NUTS algorithm for determining leapfrog steps is not includedbecause it’s a recursive algorithm and each separate chain can have different leapfrogsteps, thus hard to have a parallel implementation in static computation graphs.

• The algorithm is provided in an API very similar to Tensorflow optimizers, which isillustrated in Figure 8. We hope it is easy to start with for people who are familiarwith deep learning libraries.

z = tf.Variable(0.)

hmc = zs.HMC(step_size=1e-3,

n_leapfrogs=10)

sample_op, hmc_info = hmc.sample(

log_joint, observed={'x': x},

latent={'z': z})



_ = sess.run(sample_op)

(a) Using HMC in ZhuSuan

z = tf.Variable(0.)

optimizer = tf.train.AdamOptimizer(

learning_rate=1e-3)

optimize_op = optimizer.minimize(

cost(z))



_ = sess.run(optimize_op)

(b) Using Tensorflow optimizers

Figure 8: ZhuSuan’s HMC vs. Tensorflow optimizers

5. Comparison

We compare ZhuSuan with the representatives of python probabilistic programming li-braries, namely PyMC3 (Salvatier et al., 2016) and Edward (Tran et al., 2016). A detailedfeature comparison is shown in Table 2. As seen from the table, all the three librariesbuild upon modern computation graph libraries and can transparently run on GPUs. Theother features can be mainly divided into three categories: modeling capabilities, inferencesupport as well as architecture and design.

Modeling capabilities All the three libraries use the primitives from the computationgraph toolkits they base on and are designed for directed graphs (Bayesian networks).PyMC3 solves the model reuse problem during inference by theano.copy() to copy therelated subgraphs. Edward also does this but does not rely on the official TensorflowAPI. ZhuSuan avoids altering created graphs and build reuse on purely function reuse andcontext management. As a result, PyMC3 and ZhuSuan can correctly deal with controlflow primitives like theano.scan() and tf.while_loop(), while Edward faces challengeswhen applying variational inference to this kind of models because it will require graphcopying when replacing latent variables with samples from the variational posterior, whichis troublesome for control flow operations given there is no official support.

Inference support As for the range of inference methods, the three libraries havedifferent emphases. PyMC3 started from MCMC methods, which is demonstrated by its

22

theano.copy()

theano.scan()

tf.while_loop()


Features PyMC3 Edward ZhuSuan

Based on Theano Tensorflow Tensorflow

GPU support 3 3 3

Deterministicmodeling

Any Theano operation Control flows (tf.while_loop,tf.cond)are not properly han-dled

Any Tensorflow op-eration

Customizablevariationalposterior

Only for reparameteriz-able settings

3 3

HMC: Paral-lel chains

3 7 3

Modularity Modeling and inferenceare tightly coupled

Modeling and inferenceare tightly coupled

All parts can beused independently

Transparency 7 7 3

Table 2: Comparison between ZhuSuan and other python probabilistic programming li-braries.

name. It has put much efforts into sampling algorithms and have made them applicableto a broad class of models in Bayesian statistics. On the other hand, PyMC3’s supportfor variational inference is limited and specific to several independent algorithms (ADVI,SVGD, etc.). Edward has a general inference abstraction and implements a large number ofalgorithms as subclasses. However, as the form of abstraction induces constraints, many ofthese implemented algorithms have limited behaviors and make too strong assumptions onthe constructed models (e.g., GANInference). ZhuSuan emphasizes more on the scenariosfrom Bayesian deep learning thus puts more efforts into modern differentiable algorithmsthat can be unified into the deep learning paradigm. Both Edward and ZhuSuan supportcustomizable variational posteriors for all the VI methods while PyMC3 has only made thisspecific to the reparameterizable settings.

Architecture and design The design philosophy of ZhuSuan has been introducedin Section 3, which emphasizes two principles—modularity and transparency. These twoprinciples are very different from the traditional designs of probabilistic programming, forwhich PyMC3 can be a stereotype. The PyMC3 programs strictly follow the form withmodel definition, inference function call, and predictive checks. Little customization can bemade except in the model definition and provided inference options. Edward also roughlyfollows this framework but has made it more flexible by general programmable variationalposteriors. However, the inference procedure is also hidden due to its reliance on lots ofinternal manipulations of the created computation graphs, which can be hard for plainusers to understand. Both the libraries build their modeling and inference functionalitiesin a tightly coupled way. This implies that if the model cannot be described using their

23



ADVI

SVGD

GANInference

Shi et al.

modeling primitives, then there is little possibility using their inference features. ZhuSuandraws benefits from deep learning paradigms and makes inference just as doing gradientdescent on a cost term. And all parts in the library can be used independently. This provesto be much more transparent and compositional, especially in models with a large set ofstochastic activities, various observation behaviors and deep hierarchies.

6. Conclusions

We have described ZhuSuan, a python probabilistic programming library for Bayesian deeplearning, built upon Tensorflow. ZhuSuan bridges the gap between Bayesian methods anddeep learning by providing deep learning style primitives and algorithms for building prob-abilistic models and applying Bayesian inference. Many examples are provided to illus-trate the intuitiveness of programming with ZhuSuan. We have open-sourced ZhuSuan onGitHub5, aiming to accelerate research and applications of Bayesian deep learning.

Acknowledgments

We thank Haowen Xu for helpful discussions on the API design, Chang Liu for advices onpaper writing and Shizhen Xu, Dong Yan for initial attempts to try ZhuSuan on multi-cardand distributed systems. We would like to acknowledge support for this project from the Na-tional 973 Basic Research Program of China (No. 2013CB329403), National NSF of China(Nos. 61620106010, 61322308, 61332007), the National Youth Top-notch Talents SupportProgram, the NVIDIA NVAIL program, and Tsinghua Tiangong Institite for IntelligentTechnology.

References

Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scalemachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,2016.

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bah-danau, Nicolas Ballas, et al. Theano: A Python framework for fast computation ofmathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. URL https:

//arxiv.org/abs/1605.02688.

Jimmy Ba, Ruslan R Salakhutdinov, Roger B Grosse, and Brendan J Frey. Learning wake-sleep recurrent attention models. In Advances in Neural Information Processing Systems,pages 2593–2601, 2015.

Michael Betancourt. A conceptual introduction to hamiltonian monte carlo. arXiv preprintarXiv:1701.02434, 2017.

5. GitHub address: https://github.com/thu-ml/zhusuan

24

https://arxiv.org/abs/1605.02688

https://arxiv.org/abs/1605.02688


Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp,Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al.End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.

Jorg Bornschein and Yoshua Bengio. Reweighted wake-sleep. arXiv preprintarXiv:1406.2751, 2014.

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.arXiv preprint arXiv:1509.00519, 2015.

Olivier Cappe, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert.Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4):447–459, 2008.

Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, MichaelBetancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A proba-bilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017.ISSN 1548-7660. doi: 10.18637/jss.v076.i01. URL https://www.jstatsoft.org/v076/

i01.

Jean Cornuet, JEAN-MICHEL MARIN, Antonietta Mira, and Christian P Robert. Adap-tive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812,2012.

Sander Dieleman, Jan Schlter, Colin Raffel, Eben Olson, Sren Kaae Snderby, Daniel Nouri,et al. Lasagne: First release., August 2015. URL http://dx.doi.org/10.5281/zenodo.

27878.

C. Du, J. Zhu, and B. Zhang. Learning deep generative models with doubly stochasticgradient mcmc. IEEE Transactions on Neural Networks and Learning Systems, PP(99):1–13, 2017. ISSN 2162-237X. doi: 10.1109/TNNLS.2017.2688499.

Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis,volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014.

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and thebayesian restoration of images. IEEE Transactions on Pattern Analysis and MachineIntelligence, (6):721–741, 1984.

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452–459, 2015.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances inNeural Information Processing Systems, pages 2672–2680, 2014.

Alex Graves. Practical variational inference for neural networks. In Advances in NeuralInformation Processing Systems, pages 2348–2356, 2011.

25

https://www.jstatsoft.org/v076/i01

https://www.jstatsoft.org/v076/i01

http://dx.doi.org/10.5281/zenodo.27878

http://dx.doi.org/10.5281/zenodo.27878

Shi et al.

Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential montecarlo. In Advances in Neural Information Processing Systems, pages 2629–2637, 2015.

W Keith Hastings. Monte carlo sampling methods using markov chains and their applica-tions. Biometrika, 57(1):97–109, 1970.

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, NavdeepJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deepneural networks for acoustic modeling in speech recognition: The shared views of fourresearch groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deepbelief nets. Neural computation, 18(7):1527–1554, 2006.

Matthew D. Hoffman. Learning deep latent Gaussian models with Markov chain MonteCarlo. In Proceedings of the 34th International Conference on Machine Learning, pages1510–1519, 2017.

Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively settingpath lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.

Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochasticvariational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.

Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep RDatta. Composing graphical models with neural networks for structured representationsand fast inference. In Advances in Neural Information Processing Systems, pages 2946–2954, 2016.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114, 2013.

Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural InformationProcessing Systems, pages 3581–3589, 2014.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in Neural Information Processing Systems,pages 1097–1105, 2012.

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei.Automatic differentiation variational inference. Journal of Machine Learning Research,18(14):1–45, 2017. URL http://jmlr.org/papers/v18/16-107.html.

26

http://jmlr.org/papers/v18/16-107.html


Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level conceptlearning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

Benedict Leimkuhler and Sebastian Reich. Simulating hamiltonian dynamics, volume 14.Cambridge university press, 2004.

Yingzhen Li and Yarin Gal. Dropout inference in bayesian neural networks with alphadi-vergences. arXiv preprint arXiv:1703.02914, 2017.

Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortisedmcmc. arXiv preprint arXiv:1702.08343, 2017.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A contin-uous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.

Shike Mei, Jun Zhu, and Jerry Zhu. Robust regbayes: Selectively incorporating first-orderlogic domain knowledge into bayesian models. In Proceedings of the 31st InternationalConference on Machine Learning (ICML-14), pages 253–261, 2014.

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller,and Edward Teller. Equation of state calculations by fast computing machines. Thejournal of chemical physics, 21(6):1087–1092, 1953.

Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. Deeplearning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics,page bbx044, 2017.

Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.In Proceedings of the 31st International Conference on Machine Learning, 2014.

Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXivpreprint arXiv:1602.06725, 2016.

Radford M Neal. Connectionist learning of belief networks. Artificial intelligence, 56(1):71–113, 1992.

Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto,1995.

Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov ChainMonte Carlo, 2(11), 2011.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: Highconfidence predictions for unrecognizable images. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 427–436, 2015.

Art B. Owen. Monte Carlo theory, methods and examples. 2013.

27

Shi et al.

Brooks Paige and Frank Wood. Inference networks for sequential monte carlo in graphicalmodels. In International Conference on Machine Learning, pages 3040–3049, 2016.

John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochasticsearch. arXiv preprint arXiv:1206.6430, 2012.

Tianyu Pang, Chao Du, and Jun Zhu. Robust deep learning via reverse cross-entropytraining and thresholding test. arXiv preprint arXiv:1706.00633, 2017.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning withdeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434,2015.

Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. InArtificial Intelligence and Statistics, pages 814–822, 2014.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows.arXiv preprint arXiv:1505.05770, 2015.

Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wier-stra. One-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106,2016.

Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts inStatistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. ISBN 0387212396.

Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and varia-tional inference: Bridging the gap. In Proceedings of the 32nd International Conferenceon Machine Learning (ICML-15), pages 1218–1226, 2015.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, andXi Chen. Improved techniques for training gans. In Advances in Neural InformationProcessing Systems, pages 2226–2234, 2016.

John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. Probabilistic programmingin python using pymc3. PeerJ Computer Science, 2:e55, 2016.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George VanDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, MarcLanctot, et al. Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprintarXiv:1312.6199, 2013.

28


Michalis Titsias and Miguel Lazaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In Proceedings of the 31st International Conference on MachineLearning (ICML-14), pages 1971–1979, 2014.

Michalis K Titsias. Learning model reparametrizations: Implicit variational inference byfitting mcmc distributions. arXiv preprint arXiv:1708.01529, 2017.

Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M.Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXivpreprint arXiv:1610.09787, 2016.

Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluationmethods for topic models. In Proceedings of the 26th International Conference on MachineLearning, pages 1105–1112. ACM, 2009.

Greg CG Wei and Martin A Tanner. A monte carlo implementation of the em algorithmand the poor man’s data augmentation algorithms. Journal of the American statisticalAssociation, 85(411):699–704, 1990.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein-forcement learning. Machine Learning, 8(3-4):229–256, 1992.

Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularizationand applications to infinite latent svms. Journal of Machine Learning Research, 15:1799,2014.

Jun Zhu, Jianfei Chen, Wenbo Hu, and Bo Zhang. Big learning with bayesian methods.National Science Review, page nwx044, 2017.

29

Documents

ZhuSuan: A Library for Bayesian Deep LearningZhuSuan: A Library for Bayesian Deep Learning and multi-GPU training of deep learning, while at the same time they can use probabilis-tic