Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Deep Learning

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

http://mlg.postech.ac.kr/∼seungjin

December 2, 2016

1 / 44

Deep Learning

I Function composition

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1

I Fully-connected network (MLP)

h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network

h(l)t = σ

(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network

h(l)t = σ

(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 44

Deep Learning


f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1


h(l)t = σ

(W (l)h(l−1)

t + b(l)t


h(l)t = σ

(w (l) ∗ h(l−1)

t + b(l)t


h(l)t = σ

(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 44

Deep Learning


f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1


h(l)t = σ

(W (l)h(l−1)

t + b(l)t


h(l)t = σ

(w (l) ∗ h(l−1)

t + b(l)t


h(l)t = σ

(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 44

ImageNet Challenge

3 / 44

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

4 / 44


AlexNet (2012)






4 / 44


AlexNet (2012)






4 / 44

Bayes + Deep = ?

I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning

I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over

parameters)I Uncertainty

I Bayesian deep learning = combine the best of two approaches?

5 / 44

Bayes + Deep = ?





5 / 44

Bayes + Deep = ?





5 / 44

Bayes + Deep = ?





5 / 44

CNN +Bayesian Model Comparison

6 / 44

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

LeNet, 1989

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

7 / 44

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

LeNet, 1989

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

7 / 44

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.

Deep residual learning for Image recognition. arXiv:1512.03385, 2015

8 / 44

Pre-Trained CNNs










8 / 44

Pre-Trained CNNs










8 / 44

Pre-Trained CNNs










8 / 44

Bayesian Model ComparisonModel selection: Choose the single most probable model:

p(D|Mi ) =

∫p(D|w ,Mi )p(w |Mi )dw .

p(D)

DD0

M1

M2

M3

I M1 is the simplest andM3 is the most complex.

I For the particular observed data set D0, the modelM2 with intermediatecomplexity has the largest evidence.

9 / 44

Bayesian Model ComparisonModel selection: Choose the single most probable model:

p(D|Mi ) =

∫p(D|w ,Mi )p(w |Mi )dw .

p(D)

DD0

M1

M2

M3

I M1 is the simplest andM3 is the most complex.

I For the particular observed data set D0, the modelM2 with intermediatecomplexity has the largest evidence.

9 / 44

Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =

∏Ni=1N (yi |x>i w , β−1I ).

I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ

−1N ), where

µN = βΛ−1N Xy , ΛN = αI + βXX>.

I Marginal likelihood (evidence) is given by

L(α, β) = log p(y |X , α, β) = log

∫p(y |X ,w , β)p(w |α)dw

=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.

I Fixed point updates for hyperparameters α and β:

α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd

, (sd are eigenvalues of XX>).

10 / 44


∏Ni=1N (yi |x>i w , β−1I ).


−1N ), where





=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.


α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd


10 / 44


∏Ni=1N (yi |x>i w , β−1I ).


−1N ), where





=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.


α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd


10 / 44


∏Ni=1N (yi |x>i w , β−1I ).


−1N ), where





=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.


α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd


10 / 44

Bayesian Approach: Evidence p(D|Mi)

I Select a model with maximum evidence

I Select a subset of pre-trained CNNs in a greedy manner

11 / 44

Fast Bayesian Learning

Yong-Deok Kim, Taewoong Jang, Bohyung Han, and Seungjin Choi(2016), ”Learning to select pre-trained deep representations withBayesian evidence framework,” in CVPR-2016. (oral)

12 / 44

13 / 44

Linear Generative Models

14 / 44


x = Az + ε

I Factor analysis: Spherical Gaussian Prior

I Independent component analysis: Independent Non-Gaussian Prior

I Nonnegative matrix factorization: Nonnegative Prior

15 / 44


x = Az + ε

I Factor analysis: Spherical Gaussian Prior

I Independent component analysis: Independent Non-Gaussian Prior

I Nonnegative matrix factorization: Nonnegative Prior

15 / 44

ICA

Assume that {zi} are statistically independent.

16 / 44

17 / 44

NMF

(a) (b)

Lee and Seung (1999), ”Learning the parts of objects by non-negative matrix factorization,” Nature

18 / 44

Multiplicative Up-Prop

Ahn, Oh, Choi, ”A multiplicative up-propagation algorithm,” ICML-2004

19 / 44

(a) (b) (c)

20 / 44

Deep Generative ModelsGenerative Models + Deep Network

21 / 44

Back-Prop (discriminative)Learning with labeled data

Up-Prop (generative)Learning with unlabeled data

Oh and Seung, ”Learning Generative Models with the Up Propagation Algorithm,” NIPS-1997

22 / 44

Back-Prop (discriminative)Learning with labeled data

Up-Prop (generative)Learning with unlabeled data

Oh and Seung, ”Learning Generative Models with the Up Propagation Algorithm,” NIPS-1997

22 / 44

Deep Directed Generative Models

I Probabilistic decoder: pθ(x |z)

I Inference: p(z |x)

I Density network:pθ(x |z) = g(x |z , θ) [MacKay and

Gibbs]

I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and

Welling]

https://www.openai.com/blog/generative-

models/

23 / 44





Gibbs]


Welling]


models/

23 / 44





Gibbs]


Welling]


models/

23 / 44

Variational Lower-Bound

log p(x) ≥∫

q(z |x) logp(x , z)

q(z |x)dz

=

∫q(z |x) log

p(x |z)p(z)

q(z |x)dz

= Eq(z |x )

[log p(x |z)

]︸︷︷︸

Reconstruction

−KL [q(z |x)‖p(z)]︸︷︷︸Penalty

,

I Reconstruction cost: The expected log-likelihood measures how wellsamples from q(z |x) are able to explain the data x .

I Penalty: The approximation q(z |x) to the posterior does not deviatetoo far from your beliefs p(z).

24 / 44

Stochastic Gradient Variational Bayes

F(x , φ) = Eq

[log p(x |z)

]︸︷︷︸

SGVB

− KL [q(z |x)‖p(z)]︸︷︷︸analytically computed

,

where Eq[·] denotes the expectation w.r.t. q(z |x) and Monte Carloestimates are performed with the reparameterization trick:

Eq [log p(x |z)] ≈ 1

L

L∑

l=1

log p(x |z (l)),

where z (l) = m +√λ� ε(l) and ε(l) ∼ N (0, I ). A single sample is often

sufficient to form this Monte Carlo estimates in practice

25 / 44

Jonhson et al. (2016), ”Structured VAEs: Composing Probabilistic Graphical Models and

Variational Autoencoders,” Preprint arXiv:1603.06277

26 / 44

Data Imputation

Taken from Shakir Mohamed’s slides

27 / 44

Image Generation

Taken from Shakir Mohamed’s slides

28 / 44

Semi-Supervised Learning

Small number of labeled examples with a plenty of unlabeled examples

Taken from Wikipedia

29 / 44




29 / 44




29 / 44

VAE with Rank-One Covariance [Suh and Choi, 2016]

𝑋1

𝑋2

𝑍𝑧(3) 𝑧(5) 𝑧(4) 𝑧(1)

𝑎(𝑧(4))

𝜇(𝑧(4)) 𝜇(𝑧(1)) 𝜇(𝑧(6))

𝑎(𝑧(1)) 𝑎(𝑧(6))

𝑎(𝑧(3)) 𝑎(𝑧(5))

𝜇(𝑧(3)) 𝜇(𝑧(5))

𝑧(6) 𝑧(2)

𝑎(𝑧(2))𝜇(𝑧(2))

I Find local principal directionat a specific location µ(z):

p(x |z) = N(µ, ωI + aa>

),

p(z) = N (0, I ),µ = W µh + bµ,

logω = w>ω h + bω,

a = W ah + ba,

h = tanh(W hz + bh).

I Can be interpreted as infinitemixture of PPCA(p(s) = N (0, 1)):

p(x |s, z) = N (as + µ, ωI )

Suwon Suh and Seungjin Choi (2016), ”Gaussian copula variational autoencoders for mixed data,”

Preprint arXiv:1604.04960

30 / 44

(a) True images (b) Generated images

31 / 44

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

(a) Data

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

(b) VAE

32 / 44

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

(a) VAE-ROC without regularization

-15 -10 -5 0 5 10 15-5

0

5

10

15

(b) VAE-ROC with L2 norm reg.λlocal = 5

33 / 44

Copulas

I A D-dimensional copula C is a distribution function on unit cube[0, 1]D with each univariate marginal distribution being uniform on[0, 1]

I Classical result of Sklar (1959) (xi ’s are continuous)

F (x1, . . . , xD) = C(F1(x1), . . . ,FD(xD)

)

p(x1, . . . , xD) = c(F1(x1), . . . ,FD(xD)

) D∏

i=1

pi (xi )

I Define ui = Fi (xi ) ∈ [0, 1], i = 1, . . . ,D, then we have

C(u1, . . . , uD

)= F (F−1

1 (u1), . . . ,F−1D (uD))

c(u1, . . . , uD) =∂C (u1, . . . , uD)

∂u1 · · · ∂uD(copula density)

34 / 44

Gaussian Copula

I The Gaussian copula with covariance matrix Σ ∈ RD×D is given by

CΦ(u1, . . . , uD) = ΦΣ

(Φ−1(u1), . . . ,Φ−1(uD) |Σ

),

where ΦΣ(· |Σ) is the D-dimensional Gaussian CDF with covariancematrix Σ with diagonal entries being equal to one and Φ(·) is theunivariate standard Gaussian CDF.

I The Gaussian copula density is given by

cΦ(u1, . . . , uD) =∂DCΦ(u1, . . . , uD)

∂u1 · · · ∂uD

= |Σ|− 12 exp

{−1

2q>(Σ−1 − I )q

},

where q = [q1, . . . , qD ]> with normal scores qi = Φ−1(ui ) fori = 1, . . . ,D.

35 / 44

Invoking the result of Sklar with this Gaussian copula density,the joint density function is written as:

p(x) = |Σ|− 12 exp

{−1

2q>(Σ−1 − I )q

} D∏

i=1

pi (xi ).

36 / 44

Continuous Extension: Discrete Variables

I When xi ’s are discrete, the copula CΦ(u1, . . . , uD) are uniquelydetermined on the range of F1 × · · · × FD .

I The joint probability mass function (PMF) of x1, . . . , xD is given by

p(x1, . . . , xD) =2∑

j1=1

· · ·2∑

jD=1

(−1)j1+···jD ΦΣ

(Φ−1(u1,j1 ), . . . ,Φ−1(uD,jD )

),

where ui,1 = Fi (x−i ), the limit of Fi (·) at xi from the left, and

ui,2 = Fi (xi ).

I The PMF requires the evaluation of 2D terms, which is notmanageable even for a moderate value of D (for instance, D ≥ 5).

I A continuous extension (CE) of discrete random variables xi avoidsthe D-fold summation in the above equation, associating acontinuous random variable x∗i = xi − vi with the integer-valued xi ,where vi is uniform on [0,1] and is independent of xi as well as of ρjfor j 6= i .

37 / 44

I Continuous random variables x∗i produced by jittering xi yields theCDF and PDF given by

F ∗i (ξ) = Fi ([ξ]) + (ξ − [ξ])P (xi = [ξ + 1]) ,

p∗i (ξ) = P (xi = [ξ + 1]) ,

where [ξ] represents the nearest integer less than or equal to ξ.

I The joint PFM for x1, . . . , xD is given by

p(x1, . . . , xD) = Ev

[|Σ|− 1

2 exp

{−1

2q∗>(Σ−1 − I )q∗

} D∏

i=1

p∗i (xi − vi )

]

q∗ =[Φ−1(F ∗1 (x1 − v1)), . . . ,Φ−1(F ∗D(xD − vD))

]>.

38 / 44

Gaussian Copula VAE

z(n)

η(n) τ (n)

N

y(n)

xc,(n)i

µ(n)i σ

(n)i

h(n)

xs,(n)i

ds

β(n)i

dc

q̃c,(n)

i q̃s,(n)

ia(n) ω(n)

vs,(n)i

39 / 44

Attributes are mixed categorical (ordinal) and real-valued.

Table: Approximated test log-likelihood on UCI Auto dataset.

UCI Auto (10K) UCI SPECT (10K)VAE −200.289± 3.751 −144.195± 3.443GCVAE −189.344± 4.599 −134.579± 2.621

40 / 44

Anomaly Detection

0 100 200 300 400 500 600 700 800 900 1000-100

-50

0

50

100

(a) toy data

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

(b) anomaly scores

-2 0 2-3

-2

-1

0

1

2

3

(c) latent space

r(i−1) x(i)

z(i)

N(d) our model

41 / 44

Anomaly Detection

0 100 200 300 400 500 600 700 800 900 1000-100

-50

0

50

100

(a) toy data

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

(b) anomaly scores

-2 0 2-3

-2

-1

0

1

2

3

(c) latent space

r(i−1) x(i)

z(i)

N(d) our model

41 / 44

Anomaly Detection

0 100 200 300 400 500 600 700 800 900 1000-100

-50

0

50

100

(a) toy data

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

(b) anomaly scores

-2 0 2-3

-2

-1

0

1

2

3

(c) latent space

r(i−1) x(i)

z(i)

N(d) our model

41 / 44

Echo State Networks [Herbert Jaeger, 2002]

I An approach to recurrent neural network trainingI Consists of a large, fixed, recurrent ”reservoir” network:

r (i) = αr (i−1) + (1− α)f(Ar (i−1) + B

[1; Λxx (i)

]),

where A and B are NOT trained but only properly initialized.I The network output y (i) is computed by training suitable output

connection weights C :

y (i) = C[1; x (i); r (i)

].

42 / 44

Our Model: ES-CVAE [Suh and Choi, 2016]

r(i−1) x(i)

z(i)

N

Thejoint distribution over a sequence of N instances,p(x1, x2, . . . , xN) = p(x1)

∏Nn=2 p(xn|x1:n−1),

is modeled as

p(x1, x2, . . . , xN) = p(x1)N∏

n=2

p(xn|rn−1),

where

p(xn|rn−1) =

∫p(xn|zn, rn−1)p(zn|rn−1)dzn,

and reservoir states rn−1 are computed by

rn−1 = αrn−2 + (1− α)f (Arn−2 + B [1; Λxxn−1]) ,

Anomaly score of xn = − log p(xn|rn−1)

43 / 44

Our Model: ES-CVAE [Suh and Choi, 2016]

r(i−1) x(i)

z(i)

N

Thejoint distribution over a sequence of N instances,p(x1, x2, . . . , xN) = p(x1)

∏Nn=2 p(xn|x1:n−1),

is modeled as

p(x1, x2, . . . , xN) = p(x1)N∏

n=2

p(xn|rn−1),

where

p(xn|rn−1) =

∫p(xn|zn, rn−1)p(zn|rn−1)dzn,

and reservoir states rn−1 are computed by

rn−1 = αrn−2 + (1− α)f (Arn−2 + B [1; Λxxn−1]) ,

Anomaly score of xn = − log p(xn|rn−1)

43 / 44

Summary

I A quick overview of deep learning

I Pre-trained CNN + Bayesian model comparison

I Deep directed generative models

I Gaussian copula variational autoencoders

I Echo-state conditional variational autoencoders

44 / 44

Summary






44 / 44

Summary






44 / 44

Summary






44 / 44

Summary






44 / 44

Documents

Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang