72
Bayesian Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea [email protected] http://mlg.postech.ac.kr/seungjin December 2, 2016 1 / 44

Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

  • Upload
    lydieu

  • View
    225

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Deep Learning

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

http://mlg.postech.ac.kr/∼seungjin

December 2, 2016

1 / 44

Page 2: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Learning

I Function composition

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1

I Fully-connected network (MLP)

h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network

h(l)t = σ

(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network

h(l)t = σ

(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 44

Page 3: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Learning

I Function composition

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1

I Fully-connected network (MLP)

h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network

h(l)t = σ

(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network

h(l)t = σ

(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 44

Page 4: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Learning

I Function composition

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1

I Fully-connected network (MLP)

h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network

h(l)t = σ

(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network

h(l)t = σ

(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 44

Page 5: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

ImageNet Challenge

3 / 44

Page 6: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

4 / 44

Page 7: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

4 / 44

Page 8: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

4 / 44

Page 9: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayes + Deep = ?

I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning

I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over

parameters)I Uncertainty

I Bayesian deep learning = combine the best of two approaches?

5 / 44

Page 10: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayes + Deep = ?

I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning

I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over

parameters)I Uncertainty

I Bayesian deep learning = combine the best of two approaches?

5 / 44

Page 11: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayes + Deep = ?

I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning

I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over

parameters)I Uncertainty

I Bayesian deep learning = combine the best of two approaches?

5 / 44

Page 12: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayes + Deep = ?

I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning

I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over

parameters)I Uncertainty

I Bayesian deep learning = combine the best of two approaches?

5 / 44

Page 13: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

CNN +Bayesian Model Comparison

6 / 44

Page 14: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

LeNet, 1989

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

7 / 44

Page 15: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

LeNet, 1989

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

7 / 44

Page 16: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.

Deep residual learning for Image recognition. arXiv:1512.03385, 2015

8 / 44

Page 17: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.

Deep residual learning for Image recognition. arXiv:1512.03385, 2015

8 / 44

Page 18: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.

Deep residual learning for Image recognition. arXiv:1512.03385, 2015

8 / 44

Page 19: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.

Deep residual learning for Image recognition. arXiv:1512.03385, 2015

8 / 44

Page 20: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Model ComparisonModel selection: Choose the single most probable model:

p(D|Mi ) =

∫p(D|w ,Mi )p(w |Mi )dw .

p(D)

DD0

M1

M2

M3

I M1 is the simplest andM3 is the most complex.

I For the particular observed data set D0, the modelM2 with intermediatecomplexity has the largest evidence.

9 / 44

Page 21: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Model ComparisonModel selection: Choose the single most probable model:

p(D|Mi ) =

∫p(D|w ,Mi )p(w |Mi )dw .

p(D)

DD0

M1

M2

M3

I M1 is the simplest andM3 is the most complex.

I For the particular observed data set D0, the modelM2 with intermediatecomplexity has the largest evidence.

9 / 44

Page 22: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =

∏Ni=1N (yi |x>i w , β−1I ).

I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ

−1N ), where

µN = βΛ−1N Xy , ΛN = αI + βXX>.

I Marginal likelihood (evidence) is given by

L(α, β) = log p(y |X , α, β) = log

∫p(y |X ,w , β)p(w |α)dw

=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.

I Fixed point updates for hyperparameters α and β:

α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd

, (sd are eigenvalues of XX>).

10 / 44

Page 23: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =

∏Ni=1N (yi |x>i w , β−1I ).

I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ

−1N ), where

µN = βΛ−1N Xy , ΛN = αI + βXX>.

I Marginal likelihood (evidence) is given by

L(α, β) = log p(y |X , α, β) = log

∫p(y |X ,w , β)p(w |α)dw

=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.

I Fixed point updates for hyperparameters α and β:

α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd

, (sd are eigenvalues of XX>).

10 / 44

Page 24: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =

∏Ni=1N (yi |x>i w , β−1I ).

I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ

−1N ), where

µN = βΛ−1N Xy , ΛN = αI + βXX>.

I Marginal likelihood (evidence) is given by

L(α, β) = log p(y |X , α, β) = log

∫p(y |X ,w , β)p(w |α)dw

=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.

I Fixed point updates for hyperparameters α and β:

α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd

, (sd are eigenvalues of XX>).

10 / 44

Page 25: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =

∏Ni=1N (yi |x>i w , β−1I ).

I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ

−1N ), where

µN = βΛ−1N Xy , ΛN = αI + βXX>.

I Marginal likelihood (evidence) is given by

L(α, β) = log p(y |X , α, β) = log

∫p(y |X ,w , β)p(w |α)dw

=D

2logα+

N

2log β − β

2‖y − X>µN‖

2 − α

2µ>NµN

−1

2|ΛN | −

N

2log 2π.

I Fixed point updates for hyperparameters α and β:

α =γ

µ>NµN

, β =N − γ

‖y − X>µN‖2,

where

γ =D∑

d=1

βsdα+ βsd

, (sd are eigenvalues of XX>).

10 / 44

Page 26: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Bayesian Approach: Evidence p(D|Mi)

I Select a model with maximum evidence

I Select a subset of pre-trained CNNs in a greedy manner

11 / 44

Page 27: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Fast Bayesian Learning

Yong-Deok Kim, Taewoong Jang, Bohyung Han, and Seungjin Choi(2016), ”Learning to select pre-trained deep representations withBayesian evidence framework,” in CVPR-2016. (oral)

12 / 44

Page 28: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

13 / 44

Page 29: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Linear Generative Models

14 / 44

Page 30: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Linear Generative Models

x = Az + ε

I Factor analysis: Spherical Gaussian Prior

I Independent component analysis: Independent Non-Gaussian Prior

I Nonnegative matrix factorization: Nonnegative Prior

15 / 44

Page 31: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Linear Generative Models

x = Az + ε

I Factor analysis: Spherical Gaussian Prior

I Independent component analysis: Independent Non-Gaussian Prior

I Nonnegative matrix factorization: Nonnegative Prior

15 / 44

Page 32: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

ICA

Assume that {zi} are statistically independent.

16 / 44

Page 33: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

17 / 44

Page 34: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

NMF

(a) (b)

Lee and Seung (1999), ”Learning the parts of objects by non-negative matrix factorization,” Nature

18 / 44

Page 35: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Multiplicative Up-Prop

Ahn, Oh, Choi, ”A multiplicative up-propagation algorithm,” ICML-2004

19 / 44

Page 36: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

(a) (b) (c)

20 / 44

Page 37: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Generative ModelsGenerative Models + Deep Network

21 / 44

Page 38: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Back-Prop (discriminative)Learning with labeled data

Up-Prop (generative)Learning with unlabeled data

Oh and Seung, ”Learning Generative Models with the Up Propagation Algorithm,” NIPS-1997

22 / 44

Page 39: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Back-Prop (discriminative)Learning with labeled data

Up-Prop (generative)Learning with unlabeled data

Oh and Seung, ”Learning Generative Models with the Up Propagation Algorithm,” NIPS-1997

22 / 44

Page 40: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Directed Generative Models

I Probabilistic decoder: pθ(x |z)

I Inference: p(z |x)

I Density network:pθ(x |z) = g(x |z , θ) [MacKay and

Gibbs]

I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and

Welling]

https://www.openai.com/blog/generative-

models/

23 / 44

Page 41: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Directed Generative Models

I Probabilistic decoder: pθ(x |z)

I Inference: p(z |x)

I Density network:pθ(x |z) = g(x |z , θ) [MacKay and

Gibbs]

I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and

Welling]

https://www.openai.com/blog/generative-

models/

23 / 44

Page 42: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Deep Directed Generative Models

I Probabilistic decoder: pθ(x |z)

I Inference: p(z |x)

I Density network:pθ(x |z) = g(x |z , θ) [MacKay and

Gibbs]

I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and

Welling]

https://www.openai.com/blog/generative-

models/

23 / 44

Page 43: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Variational Lower-Bound

log p(x) ≥∫

q(z |x) logp(x , z)

q(z |x)dz

=

∫q(z |x) log

p(x |z)p(z)

q(z |x)dz

= Eq(z |x )

[log p(x |z)

]︸ ︷︷ ︸

Reconstruction

−KL [q(z |x)‖p(z)]︸ ︷︷ ︸Penalty

,

I Reconstruction cost: The expected log-likelihood measures how wellsamples from q(z |x) are able to explain the data x .

I Penalty: The approximation q(z |x) to the posterior does not deviatetoo far from your beliefs p(z).

24 / 44

Page 44: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Stochastic Gradient Variational Bayes

F(x , φ) = Eq

[log p(x |z)

]︸ ︷︷ ︸

SGVB

− KL [q(z |x)‖p(z)]︸ ︷︷ ︸analytically computed

,

where Eq[·] denotes the expectation w.r.t. q(z |x) and Monte Carloestimates are performed with the reparameterization trick:

Eq [log p(x |z)] ≈ 1

L

L∑

l=1

log p(x |z (l)),

where z (l) = m +√λ� ε(l) and ε(l) ∼ N (0, I ). A single sample is often

sufficient to form this Monte Carlo estimates in practice

25 / 44

Page 45: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Jonhson et al. (2016), ”Structured VAEs: Composing Probabilistic Graphical Models and

Variational Autoencoders,” Preprint arXiv:1603.06277

26 / 44

Page 46: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Data Imputation

Taken from Shakir Mohamed’s slides

27 / 44

Page 47: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Image Generation

Taken from Shakir Mohamed’s slides

28 / 44

Page 48: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Semi-Supervised Learning

Small number of labeled examples with a plenty of unlabeled examples

Taken from Wikipedia

29 / 44

Page 49: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Semi-Supervised Learning

Small number of labeled examples with a plenty of unlabeled examples

Taken from Wikipedia

29 / 44

Page 50: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Semi-Supervised Learning

Small number of labeled examples with a plenty of unlabeled examples

Taken from Wikipedia

29 / 44

Page 51: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

VAE with Rank-One Covariance [Suh and Choi, 2016]

𝑋1

𝑋2

𝑍𝑧(3) 𝑧(5) 𝑧(4) 𝑧(1)

𝑎(𝑧(4))

𝜇(𝑧(4)) 𝜇(𝑧(1)) 𝜇(𝑧(6))

𝑎(𝑧(1)) 𝑎(𝑧(6))

𝑎(𝑧(3)) 𝑎(𝑧(5))

𝜇(𝑧(3)) 𝜇(𝑧(5))

𝑧(6) 𝑧(2)

𝑎(𝑧(2))𝜇(𝑧(2))

I Find local principal directionat a specific location µ(z):

p(x |z) = N(µ, ωI + aa>

),

p(z) = N (0, I ),µ = W µh + bµ,

logω = w>ω h + bω,

a = W ah + ba,

h = tanh(W hz + bh).

I Can be interpreted as infinitemixture of PPCA(p(s) = N (0, 1)):

p(x |s, z) = N (as + µ, ωI )

Suwon Suh and Seungjin Choi (2016), ”Gaussian copula variational autoencoders for mixed data,”

Preprint arXiv:1604.04960

30 / 44

Page 52: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

(a) True images (b) Generated images

31 / 44

Page 53: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

(a) Data

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

(b) VAE

32 / 44

Page 54: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

(a) VAE-ROC without regularization

-15 -10 -5 0 5 10 15-5

0

5

10

15

(b) VAE-ROC with L2 norm reg.λlocal = 5

33 / 44

Page 55: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Copulas

I A D-dimensional copula C is a distribution function on unit cube[0, 1]D with each univariate marginal distribution being uniform on[0, 1]

I Classical result of Sklar (1959) (xi ’s are continuous)

F (x1, . . . , xD) = C(F1(x1), . . . ,FD(xD)

)

p(x1, . . . , xD) = c(F1(x1), . . . ,FD(xD)

) D∏

i=1

pi (xi )

I Define ui = Fi (xi ) ∈ [0, 1], i = 1, . . . ,D, then we have

C(u1, . . . , uD

)= F (F−1

1 (u1), . . . ,F−1D (uD))

c(u1, . . . , uD) =∂C (u1, . . . , uD)

∂u1 · · · ∂uD(copula density)

34 / 44

Page 56: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Gaussian Copula

I The Gaussian copula with covariance matrix Σ ∈ RD×D is given by

CΦ(u1, . . . , uD) = ΦΣ

(Φ−1(u1), . . . ,Φ−1(uD) |Σ

),

where ΦΣ(· |Σ) is the D-dimensional Gaussian CDF with covariancematrix Σ with diagonal entries being equal to one and Φ(·) is theunivariate standard Gaussian CDF.

I The Gaussian copula density is given by

cΦ(u1, . . . , uD) =∂DCΦ(u1, . . . , uD)

∂u1 · · · ∂uD

= |Σ|− 12 exp

{−1

2q>(Σ−1 − I )q

},

where q = [q1, . . . , qD ]> with normal scores qi = Φ−1(ui ) fori = 1, . . . ,D.

35 / 44

Page 57: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Invoking the result of Sklar with this Gaussian copula density,the joint density function is written as:

p(x) = |Σ|− 12 exp

{−1

2q>(Σ−1 − I )q

} D∏

i=1

pi (xi ).

36 / 44

Page 58: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Continuous Extension: Discrete Variables

I When xi ’s are discrete, the copula CΦ(u1, . . . , uD) are uniquelydetermined on the range of F1 × · · · × FD .

I The joint probability mass function (PMF) of x1, . . . , xD is given by

p(x1, . . . , xD) =2∑

j1=1

· · ·2∑

jD=1

(−1)j1+···jD ΦΣ

(Φ−1(u1,j1 ), . . . ,Φ−1(uD,jD )

),

where ui,1 = Fi (x−i ), the limit of Fi (·) at xi from the left, and

ui,2 = Fi (xi ).

I The PMF requires the evaluation of 2D terms, which is notmanageable even for a moderate value of D (for instance, D ≥ 5).

I A continuous extension (CE) of discrete random variables xi avoidsthe D-fold summation in the above equation, associating acontinuous random variable x∗i = xi − vi with the integer-valued xi ,where vi is uniform on [0,1] and is independent of xi as well as of ρjfor j 6= i .

37 / 44

Page 59: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

I Continuous random variables x∗i produced by jittering xi yields theCDF and PDF given by

F ∗i (ξ) = Fi ([ξ]) + (ξ − [ξ])P (xi = [ξ + 1]) ,

p∗i (ξ) = P (xi = [ξ + 1]) ,

where [ξ] represents the nearest integer less than or equal to ξ.

I The joint PFM for x1, . . . , xD is given by

p(x1, . . . , xD) = Ev

[|Σ|− 1

2 exp

{−1

2q∗>(Σ−1 − I )q∗

} D∏

i=1

p∗i (xi − vi )

]

q∗ =[Φ−1(F ∗1 (x1 − v1)), . . . ,Φ−1(F ∗D(xD − vD))

]>.

38 / 44

Page 60: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Gaussian Copula VAE

z(n)

η(n) τ (n)

N

y(n)

xc,(n)i

µ(n)i σ

(n)i

h(n)

xs,(n)i

ds

β(n)i

dc

q̃c,(n)

i q̃s,(n)

ia(n) ω(n)

vs,(n)i

39 / 44

Page 61: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Attributes are mixed categorical (ordinal) and real-valued.

Table: Approximated test log-likelihood on UCI Auto dataset.

UCI Auto (10K) UCI SPECT (10K)VAE −200.289± 3.751 −144.195± 3.443GCVAE −189.344± 4.599 −134.579± 2.621

40 / 44

Page 62: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Anomaly Detection

0 100 200 300 400 500 600 700 800 900 1000-100

-50

0

50

100

(a) toy data

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

(b) anomaly scores

-2 0 2-3

-2

-1

0

1

2

3

(c) latent space

r(i−1) x(i)

z(i)

N(d) our model

41 / 44

Page 63: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Anomaly Detection

0 100 200 300 400 500 600 700 800 900 1000-100

-50

0

50

100

(a) toy data

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

(b) anomaly scores

-2 0 2-3

-2

-1

0

1

2

3

(c) latent space

r(i−1) x(i)

z(i)

N(d) our model

41 / 44

Page 64: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Anomaly Detection

0 100 200 300 400 500 600 700 800 900 1000-100

-50

0

50

100

(a) toy data

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

(b) anomaly scores

-2 0 2-3

-2

-1

0

1

2

3

(c) latent space

r(i−1) x(i)

z(i)

N(d) our model

41 / 44

Page 65: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Echo State Networks [Herbert Jaeger, 2002]

I An approach to recurrent neural network trainingI Consists of a large, fixed, recurrent ”reservoir” network:

r (i) = αr (i−1) + (1− α)f(Ar (i−1) + B

[1; Λxx (i)

]),

where A and B are NOT trained but only properly initialized.I The network output y (i) is computed by training suitable output

connection weights C :

y (i) = C[1; x (i); r (i)

].

42 / 44

Page 66: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Our Model: ES-CVAE [Suh and Choi, 2016]

r(i−1) x(i)

z(i)

N

Thejoint distribution over a sequence of N instances,p(x1, x2, . . . , xN) = p(x1)

∏Nn=2 p(xn|x1:n−1),

is modeled as

p(x1, x2, . . . , xN) = p(x1)N∏

n=2

p(xn|rn−1),

where

p(xn|rn−1) =

∫p(xn|zn, rn−1)p(zn|rn−1)dzn,

and reservoir states rn−1 are computed by

rn−1 = αrn−2 + (1− α)f (Arn−2 + B [1; Λxxn−1]) ,

Anomaly score of xn = − log p(xn|rn−1)

43 / 44

Page 67: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Our Model: ES-CVAE [Suh and Choi, 2016]

r(i−1) x(i)

z(i)

N

Thejoint distribution over a sequence of N instances,p(x1, x2, . . . , xN) = p(x1)

∏Nn=2 p(xn|x1:n−1),

is modeled as

p(x1, x2, . . . , xN) = p(x1)N∏

n=2

p(xn|rn−1),

where

p(xn|rn−1) =

∫p(xn|zn, rn−1)p(zn|rn−1)dzn,

and reservoir states rn−1 are computed by

rn−1 = αrn−2 + (1− α)f (Arn−2 + B [1; Λxxn−1]) ,

Anomaly score of xn = − log p(xn|rn−1)

43 / 44

Page 68: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Summary

I A quick overview of deep learning

I Pre-trained CNN + Bayesian model comparison

I Deep directed generative models

I Gaussian copula variational autoencoders

I Echo-state conditional variational autoencoders

44 / 44

Page 69: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Summary

I A quick overview of deep learning

I Pre-trained CNN + Bayesian model comparison

I Deep directed generative models

I Gaussian copula variational autoencoders

I Echo-state conditional variational autoencoders

44 / 44

Page 70: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Summary

I A quick overview of deep learning

I Pre-trained CNN + Bayesian model comparison

I Deep directed generative models

I Gaussian copula variational autoencoders

I Echo-state conditional variational autoencoders

44 / 44

Page 71: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Summary

I A quick overview of deep learning

I Pre-trained CNN + Bayesian model comparison

I Deep directed generative models

I Gaussian copula variational autoencoders

I Echo-state conditional variational autoencoders

44 / 44

Page 72: Bayesian Deep Learning - DGIST Deep Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang

Summary

I A quick overview of deep learning

I Pre-trained CNN + Bayesian model comparison

I Deep directed generative models

I Gaussian copula variational autoencoders

I Echo-state conditional variational autoencoders

44 / 44