10
Machine Learning Lecture 9 Deep Learning Introduction & Stacked AE Dr. Patrick Chan [email protected] South China University of Technology, China 1 Dr. Patrick Chan @ SCUT Agenda Introduction of Deep Learning Autoencoder Autoencoder Sparse Autoencoder Contractive Autoencoder Denoising Autoencoder Stacked Autoencoder Lecture 9: DL - Introduction & Stacked AE 2 Dr. Patrick Chan @ SCUT What is Deep Learning? Branch of Machine Learning Commonly refer to a neural network with multiple layers (deep architecture) Lecture 9: DL - Introduction & Stacked AE 3 Dr. Patrick Chan @ SCUT Why Deep Architecture? Our brain is a very deep architecture A deep architecture can represent more complicated function than a shallow one Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009 Deeper More Complicated Function Shallower Less Complicated Function Lecture 9: DL - Introduction & Stacked AE 4

Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Machine Learning

Lecture 9

Deep Learning

Introduction & Stacked AE

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Introduction of Deep Learning

Autoencoder

Autoencoder

Sparse Autoencoder

Contractive Autoencoder

Denoising Autoencoder

Stacked Autoencoder

Lecture 9: DL - Introduction & Stacked AE2

Dr. Patrick Chan @ SCUT

What is Deep Learning?

Branch of Machine Learning

Commonly refer to a neural network with multiple layers (deep architecture)

Lecture 9: DL - Introduction & Stacked AE3

Dr. Patrick Chan @ SCUT

Why Deep Architecture?

Our brain is a very deep architecture

A deep architecture can represent more complicated function than a shallow one

Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009

Deeper

More Complicated

Function

Shallower

Less Complicated

Function

Lecture 9: DL - Introduction & Stacked AE4

Page 2: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

Brief History of Machine Learning

Deep

Learning

http://www.erogol.com/brief-history-machine-learning/

2006

Digression of deep architecture

due to the success of SVM

Lecture 9: DL - Introduction & Stacked AE5

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

Multilayer neural networks have been studied for more than 30 years

What’s actually new?

Old wine in a new bottle??

Lecture 9: DL - Introduction & Stacked AE6

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

DNN is less accurate than shallow one by using traditional backpropagation

Backpropagation loses its power in deep architecture

Vanishing gradient problem

Lecture 9: DL - Introduction & Stacked AE7

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

DNN is less accurate than shallow one by using traditional backpropagation

Optimization is very complex

Too many parameters in deep architecture

8 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 4

Lecture 9: DL - Introduction & Stacked AE8

Page 3: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

No suitable training method for deep architectures at past

Breakthrough in 2006, three papers proposing new learning algorithms for deep structure

1. Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief

nets, Neural Computation 18:1527-1554, 2006

2. Yoshua Bengio, Pascal Lamblin, Dan Popovici and Hugo Larochelle, Greedy

Layer-Wise Training of Deep Networks, in J. Platt et al. (Eds), Advances in Neural

Information Processing Systems 19 (NIPS 2006), pp. 153-160, MIT Press, 2007

3. Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun

Efficient Learning of Sparse Representations with an Energy-Based Model, in J.

Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS

2006), MIT Press, 2007

Lecture 9: DL - Introduction & Stacked AE 9

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

Vanishing Gradient Problem

Divide and Conquer

Stacked training

E.g. Stacked Autoencoder (SA) & RBM

Reduce Parameters

Too many parameters

E.g. Convolutional Neural Network (CNN)

Lecture 9: DL - Introduction & Stacked AE 10

Dr. Patrick Chan @ SCUT

Deep Learning

What’s New?

Deep Learning focuses on feature learning but not a classifier

Lecture 9: DL - Introduction & Stacked AE11

PersonDesigned

Features

. . .

Who?

Decision

Feature Learning

(Deep Learning)Person

Who?

Decision

Deep

Learning

Traditional

Learning

Complicated

Classifier

Simple

Classifier

Dr. Patrick Chan @ SCUT

Deep Learning

Unsupervised Learning

Stacked Denoising Autoencoder (SAE)

Supervised Learning

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Generative Adversarial Network (GAN)

Lecture 9: DL - Introduction & Stacked AE12

Page 4: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Autoencoder aims to learn an identify function

Predict the input itself in the output,

Containing two parts:

Encoder ,

Decoder ,

Lecture 9: DL - Introduction & Stacked AE13

�� ��

Latent

Representation

DecoderEncoder

Original

Input

Estimated

Input

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Autoencoder may contain more than one hidden layers

Lecture 9: DL - Introduction & Stacked AE14

Dec

od

er Dec

od

erE

nco

der

En

cod

er

�1�

�1 �2 �3

�2� �3�

1 2 3 4

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Gradient Descent can be used to train autoencoder

Objective Function:

Cross entropy

Squared Error

Lecture 9: DL - Introduction & Stacked AE15

�1�

�1 �2 �3

�2� �3�

1 2 3 4

�� ��

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Learning the identity function seems trivial

By adding constraints on the network, useful information can be learned

Number of hidden neurons

Regularization

Lecture 9: DL - Introduction & Stacked AE16

Page 5: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Autoencoder (AE): Type

Lecture 9: DL - Introduction & Stacked AE17

Overcomplete AE

Hidden layer is smaller than the input layer

Hidden layer is greaterthan the input layer

Undercomplete AE

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Undercomplete

Hidden layer is smaller than the input layer

Compress the input

Hidden nodes will be generally good features

Lecture 9: DL - Introduction & Stacked AE18

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Overcomplete

Hidden layer is greater than the input layer

No compression in hidden layer

A higher dimension code may help when the distribution is more complex, E.g. XOR case

No guarantee that the hidden units will extract meaningful feature

A hidden unit may

copy a input component

Additional requirement is considered

e.g. sparsity

Lecture 9: DL - Introduction & Stacked AE19

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Serve as feature extraction

Extract robust feature representation for a model

No label is needed

Independent to the model

Lecture 9: DL - Introduction & Stacked AE20

Latent Space

Representation

Noisy Input

Encoder

Robust Encoder(Maybe multi-layer)

Model

Page 6: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Example

Decoder

Encoder Encoder

Lecture 9: DL - Introduction & Stacked AE21

Dr. Patrick Chan @ SCUT

Autoencoder (AE)

Example

Decoder

Lecture 9: DL - Introduction & Stacked AE22

Dr. Patrick Chan @ SCUT

Sparse Autoencoder

Sparsity is a common constraint

Meaningful features can be learnt

The model may be more explainable with sparse features

Lecture 9: DL - Introduction & Stacked AE23

= 1 + 1 + 1 + 1 + 1

+ 1 + 1 + 0.8 + 0.8

Dr. Patrick Chan @ SCUT

Sparse Autoencoder

Let be the activation of the th hidden unit in

the layer of encoder

Sparsity is defined as the average on n samples

Force the constraint:

is a “sparsity parameter”

Typically small but not zeroi.e. 0.05

Lecture 9: DL - Introduction & Stacked AE24

1 2 3 4

�1 �2 �3

�1� �2� �3�

……

� �,��� �

���

Layer k

Page 7: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Sparse Autoencoder

Penalize for deviating from

Many measures for this penalty term

Example: Kullback-Leibler (KL) divergence

KL is a standard function for measuring how different two distributions are

� : the number of neurons in the layer

Lecture 9: DL - Introduction & Stacked AE25

�+

Dr. Patrick Chan @ SCUT

Sparse Autoencoder

If ,

KL is increased monotonically with respect to the difference between

and

Lecture 9: DL - Introduction & Stacked AE26

Dr. Patrick Chan @ SCUT

Sparse Autoencoder

Cost function:�

: tradeoff parameter

depends on all samples and affects J

It should be calculated for all x in advance

Lecture 9: DL - Introduction & Stacked AE27

� �,��� �

���

Dr. Patrick Chan @ SCUT

Contractive Autoencoder

It is expected the learned encoding would also be similar for similar inputs

Derivative of the hidden layer activations are small with respect to the input

Lecture 9: DL - Introduction & Stacked AE28

f en

x

x2

x1

Page 8: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Contractive Autoencoder

Regularization

Penalizing instances where a small change in the input

leads to a large change in the encoding space

Cost function

�� �� ���

��

���

: tradeoff parameter

Frobenius norm (L2):

Jacobian Matrix

Lecture 9: DL - Introduction & Stacked AE29

J� =���

�,���

(�)

�,���

(�)

�,��

��

(�)

�,��

��

(�)

� ��

�����

��

���aij is an element in A

Dr. Patrick Chan @ SCUT

Denoising Autoencoder

Recover the noise-contaminated sample

undo the corruption effect

Denoising Autoencoder (DAE) is robust to noise

Lecture 9: DL - Introduction & Stacked AE30

�� ��

Latent

Representation

DecoderEncoder

Noisy Input Estimated

Input

Dr. Patrick Chan @ SCUT

Denoising Autoencoder

Recall, original AE:

DAE:

Each is contaminated by either

Assigning 0 to input subset with probability

Adding Gaussian additive noise

Lecture 9: DL - Introduction & Stacked AE31

� � �

0.6 0 0.2

0 0.5 0

0 0.1 0.7

0.3 0.1 2

� � �

0.6 0.8 0.2

0.2 0.5 0.5

0.4 0.1 0.7

0.3 0.1 2

� � �

0.5 0.9 0.3

0.5 0.5 0.3

0.4 0.2 0.8

0.1 0.3 1.7

(Gaussian Noise) (Blinding)

Dr. Patrick Chan @ SCUT

Denoising Autoencoder

DAE cannot fully trust each feature of

independently so it must learn the correlations of ’s features

Lecture 9: DL - Introduction & Stacked AE32

x1

x2

x3

x4

1

2

3

4

Encode Decode

z1

z2

1

2

3

4

Corruption

Page 9: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Denoising Autoencoder

Lecture 9: DL - Introduction & Stacked AE33

Dr. Patrick Chan @ SCUT

CAE vs DAE

Denoising AE AE (ie. Encoder +

decoder) resist small

but finite-sized

perturbations of the input

Defined by the noise

injected in training

Change training samples (adding noise

or blinding)

Contractive AE The feature extraction

function (ie. encoder)

resist infinitesimal

perturbations of the input

Slope

Change objective

function

Might be more stable than DAE, which uses a

sampled gradient

Complexity is high due

to calculation of Jacobean matric

Lecture 9: DL - Introduction & Stacked AE34

Dr. Patrick Chan @ SCUT

Stacked Autoencoder

AE with deep structure faces vanishing descent problem

Stack concept is used

Train one layer by one layerRather than the whole model at the same time

Lecture 9: DL - Introduction & Stacked AE35

���

���

���

Model

Dr. Patrick Chan @ SCUT

Stacked Autoencoder

Lecture 9: DL - Introduction & Stacked AE36

���

���

���

���

���

���

���

���

���

���

���

���

Model

1st Layer Training

(AE 1)

2nd Layer Training

(AE 2)

3rd Layer Training

(AE 3)

Connect to Classifier

Page 10: Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Dr. Patrick Chan @ SCUT

Stacked Autoencoder

Algorithm

Train AE to reconstruct the input x

Keep encoder only (remove decoder)

Repeat

Treat the output of previous AE as input, and train a new AE

Keep encoder only (remove decoder)

Until enough layer

Connect to a model (e.g. classifier)

Use backpropagation to fine-tune the entire network using supervised data if necessary

Lecture 9: DL - Introduction & Stacked AE37

Dr. Patrick Chan @ SCUT

References

https://web.stanford.edu/class/cs294a/

https://www.jeremyjordan.me/autoencoders/

Lecture 9: DL - Introduction & Stacked AE38