Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT

Machine Learning

Lecture 9

Deep Learning

Introduction & Stacked AE

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Introduction of Deep Learning

Autoencoder

Autoencoder

Sparse Autoencoder

Contractive Autoencoder

Denoising Autoencoder

Stacked Autoencoder

Lecture 9: DL - Introduction & Stacked AE2


What is Deep Learning?

Branch of Machine Learning

Commonly refer to a neural network with multiple layers (deep architecture)



Why Deep Architecture?

Our brain is a very deep architecture

A deep architecture can represent more complicated function than a shallow one

Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009

Deeper

More Complicated

Function

Shallower

Less Complicated

Function



Deep Learning

What’s New?

Brief History of Machine Learning

Deep

Learning

http://www.erogol.com/brief-history-machine-learning/

2006

Digression of deep architecture

due to the success of SVM



Deep Learning

What’s New?

Multilayer neural networks have been studied for more than 30 years

What’s actually new?

Old wine in a new bottle??



Deep Learning

What’s New?

DNN is less accurate than shallow one by using traditional backpropagation

Backpropagation loses its power in deep architecture

Vanishing gradient problem



Deep Learning

What’s New?

DNN is less accurate than shallow one by using traditional backpropagation

Optimization is very complex

Too many parameters in deep architecture

8 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 4



Deep Learning

What’s New?

No suitable training method for deep architectures at past

Breakthrough in 2006, three papers proposing new learning algorithms for deep structure

1. Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief

nets, Neural Computation 18:1527-1554, 2006

2. Yoshua Bengio, Pascal Lamblin, Dan Popovici and Hugo Larochelle, Greedy

Layer-Wise Training of Deep Networks, in J. Platt et al. (Eds), Advances in Neural

Information Processing Systems 19 (NIPS 2006), pp. 153-160, MIT Press, 2007

3. Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun

Efficient Learning of Sparse Representations with an Energy-Based Model, in J.

Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS

2006), MIT Press, 2007

Lecture 9: DL - Introduction & Stacked AE 9


Deep Learning

What’s New?

Vanishing Gradient Problem

Divide and Conquer

Stacked training

E.g. Stacked Autoencoder (SA) & RBM

Reduce Parameters

Too many parameters

E.g. Convolutional Neural Network (CNN)

Lecture 9: DL - Introduction & Stacked AE 10


Deep Learning

What’s New?

Deep Learning focuses on feature learning but not a classifier


PersonDesigned

Features

. . .

Who?

Decision

Feature Learning

(Deep Learning)Person

Who?

Decision

Deep

Learning

Traditional

Learning

Complicated

Classifier

Simple

Classifier


Deep Learning

Unsupervised Learning

Stacked Denoising Autoencoder (SAE)

Supervised Learning

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Generative Adversarial Network (GAN)



Autoencoder (AE)

Autoencoder aims to learn an identify function

Predict the input itself in the output,

Containing two parts:

Encoder ,

Decoder ,


��

Latent

Representation

DecoderEncoder

Original

Input

Estimated

Input


Autoencoder (AE)

Autoencoder may contain more than one hidden layers


Dec

od

er Dec

od

erE

nco

der

En

cod

er

�1�

�1 �2 �3

�2� �3�

1 2 3 4


Autoencoder (AE)

Gradient Descent can be used to train autoencoder

Objective Function:

Cross entropy

Squared Error


�1�

�1 �2 �3

�2� �3�

1 2 3 4

��


Autoencoder (AE)

Learning the identity function seems trivial

By adding constraints on the network, useful information can be learned

Number of hidden neurons

Regularization



Autoencoder (AE): Type


Overcomplete AE

Hidden layer is smaller than the input layer

Hidden layer is greaterthan the input layer

Undercomplete AE


Autoencoder (AE)

Undercomplete

Hidden layer is smaller than the input layer

Compress the input

Hidden nodes will be generally good features



Autoencoder (AE)

Overcomplete

Hidden layer is greater than the input layer

No compression in hidden layer

A higher dimension code may help when the distribution is more complex, E.g. XOR case

No guarantee that the hidden units will extract meaningful feature

A hidden unit may

copy a input component

Additional requirement is considered

e.g. sparsity



Autoencoder (AE)

Serve as feature extraction

Extract robust feature representation for a model

No label is needed

Independent to the model


Latent Space

Representation

Noisy Input

Encoder

Robust Encoder(Maybe multi-layer)

Model


Autoencoder (AE)

Example

Decoder

Encoder Encoder



Autoencoder (AE)

Example

Decoder



Sparse Autoencoder

Sparsity is a common constraint

Meaningful features can be learnt

The model may be more explainable with sparse features


= 1 + 1 + 1 + 1 + 1

+ 1 + 1 + 0.8 + 0.8


Sparse Autoencoder

Let be the activation of the th hidden unit in

the layer of encoder

Sparsity is defined as the average on n samples

Force the constraint:

is a “sparsity parameter”

Typically small but not zeroi.e. 0.05


1 2 3 4

�1 �2 �3

�1� �2� �3�

……

� �,��

�

��

Layer k


Sparse Autoencoder

Penalize for deviating from

Many measures for this penalty term

Example: Kullback-Leibler (KL) divergence

KL is a standard function for measuring how different two distributions are

� : the number of neurons in the layer


�

�+

�


Sparse Autoencoder

If ,

KL is increased monotonically with respect to the difference between

and



Sparse Autoencoder

Cost function:�

: tradeoff parameter

depends on all samples and affects J

It should be calculated for all x in advance


� �,��

�

��



It is expected the learned encoding would also be similar for similar inputs

Derivative of the hidden layer activations are small with respect to the input


f en

x

x2

x1



Regularization

Penalizing instances where a small change in the input

leads to a large change in the encoding space

Cost function

��

�

��

��

: tradeoff parameter

Frobenius norm (L2):

Jacobian Matrix


J� =��

�,��

(�)

�,��

(�)

�,��

��

(�)

�,��

��

(�)

� ��

��

��

��aij is an element in A



Recover the noise-contaminated sample

undo the corruption effect

Denoising Autoencoder (DAE) is robust to noise


��

Latent

Representation

DecoderEncoder

Noisy Input Estimated

Input



Recall, original AE:

DAE:

Each is contaminated by either

Assigning 0 to input subset with probability

Adding Gaussian additive noise


� � �

0.6 0 0.2

0 0.5 0

0 0.1 0.7

0.3 0.1 2

� � �

0.6 0.8 0.2

0.2 0.5 0.5

0.4 0.1 0.7

0.3 0.1 2

� � �

0.5 0.9 0.3

0.5 0.5 0.3

0.4 0.2 0.8

0.1 0.3 1.7

(Gaussian Noise) (Blinding)



DAE cannot fully trust each feature of

independently so it must learn the correlations of ’s features


x1

x2

x3

x4

1

2

3

4

Encode Decode

z1

z2

1

2

3

4

Corruption





CAE vs DAE

Denoising AE AE (ie. Encoder +

decoder) resist small

but finite-sized

perturbations of the input

Defined by the noise

injected in training

Change training samples (adding noise

or blinding)

Contractive AE The feature extraction

function (ie. encoder)

resist infinitesimal

perturbations of the input

Slope

Change objective

function

Might be more stable than DAE, which uses a

sampled gradient

Complexity is high due

to calculation of Jacobean matric



Stacked Autoencoder

AE with deep structure faces vanishing descent problem

Stack concept is used

Train one layer by one layerRather than the whole model at the same time


��

��

��

Model


Stacked Autoencoder


��

��

��

��

��

��

��

��

��

��

��

��

Model

1st Layer Training

(AE 1)

2nd Layer Training

(AE 2)

3rd Layer Training

(AE 3)

Connect to Classifier


Stacked Autoencoder

Algorithm

Train AE to reconstruct the input x

Keep encoder only (remove decoder)

Repeat

Treat the output of previous AE as input, and train a new AE

Keep encoder only (remove decoder)

Until enough layer

Connect to a model (e.g. classifier)

Use backpropagation to fine-tune the entire network using supervised data if necessary



References

https://web.stanford.edu/class/cs294a/

https://www.jeremyjordan.me/autoencoders/


Documents

Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5 0.3 0.4 0.2 0.8 0.1 0.3 1.7 (Gaussian Noise) (Blinding) Dr. Patrick Chan @ SCUT