Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Machine Learning
Lecture 4
Supervised Learning
Multi-Layer Perceptron
Dr. Patrick [email protected]
South China University of Technology, China
1
Dr. Patrick Chan @ SCUT
Agenda
Artificial Neural Network
Multi-Layer Perceptron
Structure
LMS VS Bayes Classifier
Backpropagation
Practical Techniques
Lecture 04: SL - Multi-Layer Perceptron2
Dr. Patrick Chan @ SCUT
Introduction
Recall, Linear Discriminant Functions:
Limited generalization capability
Cannot handle the non-linearly separable problem
Σ
Σ
x1
x2
x3
…
xd
g1
g2
Input Layer Output Layer
Σ gc… …
Lecture 04: SL - Multi-Layer Perceptron3
Dr. Patrick Chan @ SCUT
Introduction
Solution 1: Mapping Function φ(x)
Pro: Simple structure (still using LDF)
Cons: Selection of z(x) and its parameters
Σ
Σ
zx1
zx2
zx3
…
zxd
g1
g2
Input Layer Output Layer
Σ gc
… …
…
Lecture 04: SL - Multi-Layer Perceptron4
Dr. Patrick Chan @ SCUT
Introduction
Solution 2: Multi-Layer Neural Network
Standard structure
Hidden layers serve as mapping
No prior knowledge is required (no need to choose φ(x))
x1
x2
x3
x5
Input Layer Output Layer
Σ
Σ g1
g2
Hidden Layers
… …
Σ gc
…
…
… …Lecture 04: SL - Multi-Layer Perceptron5
Dr. Patrick Chan @ SCUT
Artificial Neural Network
ANN is inspired biologically by human brain,
input output
input output
Human Brain
Artificial NN
Lecture 04: SL - Multi-Layer Perceptron6
Dr. Patrick Chan @ SCUT
Artificial Neural Network
A neuron contains
a cell body
Dendrite: a branching input structure
Axon: a branching output structure
Synapse is connections between neurons
input
output
input
output
Lecture 04: SL - Multi-Layer Perceptron7
Dr. Patrick Chan @ SCUT
Artificial Neural Network
A neuron only fires if its input exceeds a threshold
Electro-chemical signals are propagated from dendrite, cell body, and axon to other neurons
Synapses vary in strength
Good connections allowing a large signal
Slight connections allow only a weak signal.
input
output
input
output
Lecture 04: SL - Multi-Layer Perceptron8
Dr. Patrick Chan @ SCUT
Artificial Neural Network
• Our brain contains ten billion (1010) neurons
• On average, several thousand connections
• Hundreds of operations per second
• Die off frequently (never replaced)
• Compensates for problems by massive parallelism
Lecture 04: SL - Multi-Layer Perceptron9
Dr. Patrick Chan @ SCUT
Artificial Neural Network
Each neuron
Inputs from other neurons
Weighted sum is calculated from inputs
If the activation level exceeds the threshold (t), the neuron fires
Output is connected to other neurons
Input…
w1
w2
wd
I1I2
Id
O
a
Weight
Activation
Function
Output
� ��
���
Lecture 04: SL - Multi-Layer Perceptron10
Dr. Patrick Chan @ SCUT
Artificial Neural Network
Type of ANN
x1
x2
x3
g1
g2
x1
x2
x3
g1
g2
gt
xt
Fully ConnectedFeedforward
Partial ConnectedFeedforward
Recurrent
Discuss later Discuss later
Lecture 04: SL - Multi-Layer Perceptron11
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron
Multi-Layer Perceptron (MLP)
Neurons are arranged in layers
Neurons are connected to all neurons in next layer
Fully-connected
Feedforward
Neurons may have different activation functions or no activation function
x1
x2
x3
g1
g2
Lecture 04: SL - Multi-Layer Perceptron12
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
XOR Example
� � �
� � �
� �
bias
x1
x2
x1
x2
z1
x2
x1
z2
Lecture 04: SL - Multi-Layer Perceptron13
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
XOR Example
x1
= 1 x2
= 1
z1
= sgn(1+1+0.5) = sgn(2.5) = 1
z2 = sgn(1+1-1.5) = sgn(0.5) = 1
y = sgn(0.7 - 0.4 – 1) = sgn(-0.3) = -1
x1 = -1 x2 = -1
z1 = sgn(-1-1+0.5 ) = sgn(-1.5) = -1
z2 = sgn(-1-1-1.5 ) = sgn(-3.5) = -1
y = sgn(-0.7 + 0.4 – 1) = sgn(-1.3) = -1
� � �
� � �
� �
Lecture 04: SL - Multi-Layer Perceptron14
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Activation Function
Real neuron has non-linear properties
Non-linear activation function is a better choice
Binary Step function is not a good choice
Non-differentiable
Many information is ignored
the straw that broke
the camel's back
No different
when input 2 or 3
Lecture 04: SL - Multi-Layer Perceptron15
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Activation Function
Sigmoid function is commonly used in MLP
Differentiable
Lecture 04: SL - Multi-Layer Perceptron16
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Activation Function
Sigmoid
Nice interpretation on the firing rate
0 = not firing at all
1 = fully firing
Saturate: flat region (output is 0 or 1)
Gradient at these regions almost zero (NN will barely learn)
Almost no signal will flow to its weights
If initial weights are too large or too small, then most neurons would saturate
Lecture 04: SL - Multi-Layer Perceptron17
Saturate
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Activation Function
Tanh
Scaled sigmoid: Tanh(x)=2 sigm(2x) - 1
Like sigmoid, tanh neurons saturate
Unlike sigmoid, output is zero-centered
Lecture 04: SL - Multi-Layer Perceptron18
Saturate
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Activation Function
Rectified Linear Unit (ReLU)
Most Deep Networks use ReLU nowadays
Trains much faster
accelerates the convergence of Stochastic gradient descent
due to linear, non-saturating form
Less expensive operations compared to sigmoid/tanh (exponentials etc.)
More expressive
Prevents the gradient vanishing problemLecture 04: SL - Multi-Layer Perceptron19
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Activation Function
Other activation functions:
https://en.wikipedia.org/wiki/Activation_function
Identity
Binary step
Sinc
Gaussian ���
Lecture 04: SL - Multi-Layer Perceptron20
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Architecture
x1
x2 x3
g1 g2
Layer 1
Layer 2
Layer 3
Layer 4
1 3 4
2 3
1 2
1 2 3
�,�
�,
weight between the neurons j and k in the layer i
Output of the neuron j in the layer i
2
1
�,��
�, �
�,�
�,�
�,�
Activation Function
�,�
�,�� �,�
���
�,�� �,� �,�� �,�
�,�� �,� �, � �,
Lecture 04: SL - Multi-Layer Perceptron21
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Architecture
x1 x2 x3
g1 g2
Layer 1
Layer 2
Layer 3
Layer 4
1 3
2 3
1 2
1 2 3
2
1
,�
�,�
4
�,
�,�
�
�� = � � ��,��� � ��,�� � ��,����
���
��
�
���
,�
�,�� �,��
���
�,� �,
��
�,� �,��
���
�
Lecture 04: SL - Multi-Layer Perceptron22
Dr. Patrick Chan @ SCUT
Multi-Layer Perceptron (MLP)
Architecture
Example
Lecture 04: SL - Multi-Layer Perceptron23
Dr. Patrick Chan @ SCUT
Backpropagation
How to determine the weight?
Pseudoinverse cannot be used as ANN is not linear in general
Gradient Descent
: the learning rate
How to calculate ( )/ for each w?
x1
x2x
3
g1
g2
x1
x2x
3
g1
g2
MLPLDF
Lecture 04: SL - Multi-Layer Perceptron24
Dr. Patrick Chan @ SCUT
Backpropagation
Backpropagation
Calculation of the derivative flows backwards through the network
Natural extension of LMS algorithm
x1
x2
x3
g1
g2
Calc
ula
te o
utp
ut
by W
Update
W b
y o
utp
ut
(err
or)
Lecture 04: SL - Multi-Layer Perceptron25
Dr. Patrick Chan @ SCUT
Backpropagation
Recall, Chain rule
�
�
��
�
��
��
�
Lecture 04: SL - Multi-Layer Perceptron26
Dr. Patrick Chan @ SCUT
Backpropagation
Example 1
Which paths to the output are affected by w2,11?
Error on each output should be considered
Backprop from J to w2,11
x1 x2 x3
y2
J1
g1 g2y1
J2
J
��,��
��,�
Lecture 04: SL - Multi-Layer Perceptron27
Dr. Patrick Chan @ SCUT
Backpropagation
Example 1
x1 x2 x3
y2g1 g2y1
J2J1
J
��,��
�
�,��
�
�,��
� � � = �� + ��
� ��,��
��,��
�
���
��,�
Lecture 04: SL - Multi-Layer Perceptron28
Dr. Patrick Chan @ SCUT
Backpropagation
Example 1
x1 x2 x3
y2g1 g2y1
��,��
��,��
�
�,���
�,��
�
���
� � �
�,��
� ��
�,��
�� = 12 �� − �� �
��,�
J2J1
J
Lecture 04: SL - Multi-Layer Perceptron29
Dr. Patrick Chan @ SCUT
Backpropagation
Example 1
x1 x2 x3
y2g1 g2y1
��,��
��,��
�
�,�� � ��
�,��
�
���
�,��� �,��,��
�� = � � ��,�
����,�
��,�
�,��� �,��,��� �,�
�,��� �,��,��
Let � (�) = #$ �# �
�,�
�� �,��,� �,�� �,� �,�� �,� �,��
�,��
�,�
�� �,� �,���,��,��
J2J1
J
Lecture 04: SL - Multi-Layer Perceptron30
Dr. Patrick Chan @ SCUT
Backpropagation
Example 1
x1 x2 x3
y2g1 g2y1
��,��
�,��,��
�
�,�� � � �,�
�� �,� �,���,��,��
�
���
�,� ��� �,���,����,��,� ��� �,��
�,� ��� �,���,� ��� �,���,��
� (�) = %� �% �
�,�
��� �,��
�,� �,�� �,� �,���,� �,�� �, �, �
�,��
��,� = � � ��,�
�����,��
�,�
��� �,�� �,�
J2J1
J
Lecture 04: SL - Multi-Layer Perceptron31
Dr. Patrick Chan @ SCUT
Backpropagation
Example 1
x1 x2 x3
y2
J1
g1 g2y1
J2
J
��,��
�
�,��
� � �,�
�� �,� �,��
�,�
��� �,�� �,�
�
�����,�
� (�) = %� �% �
Lecture 04: SL - Multi-Layer Perceptron32
Dr. Patrick Chan @ SCUT
Backpropagation
Example 2
Which paths to the output are affected by w1,12?
Backprop from J to w1,12
Further away from J, more complicated update
��,�
��,��
��,���,�
��,�
g
x1 x2 x3
y
J
Lecture 04: SL - Multi-Layer Perceptron33
Dr. Patrick Chan @ SCUT
Backpropagation
Example 2
�,��
�
�,���
�,����,�
��,��
��,���,�
��,�
g
x1 x2 x3
y
J
�
�,��
� � � = 12 � − � �
Lecture 04: SL - Multi-Layer Perceptron34
Dr. Patrick Chan @ SCUT
Backpropagation
Example 2
�,��
��,�
��,��
��,���,�
��,�
g
x1 x2 x3
y
J
�
�,�� �,��
� = � � ��,��
�����,�
�,����� �,��,����� �,�
�,����� �,��,��
�,� �,��
��� �,��,��,��
�
���
�,� �,��
���
�,��,��,�� �,�
�,��,��
�,��,��,��
Let � (�) = #$ �# �
�,����� �,��,��
Lecture 04: SL - Multi-Layer Perceptron35
Dr. Patrick Chan @ SCUT
Backpropagation
Example 2
�,��,��
��,�
��,��
��,���,�
��,�
g
x1 x2 x3
y
J
�
�,�� �,� �,�
�
��� �,��,��,��
�
���
�, �� �,�
�,��
�, �� �,�
�, �� �,��, �,� ���,��
�,
�� �,� �,���,��,��
�, �� �,�
�, �� �,�
�,� �,���,��
�,� �,���,��
�,� �,���,��
�, �, ��,��
��,� = � � ��,
����,�
� (�) = %� �% �
Lecture 04: SL - Multi-Layer Perceptron36
Dr. Patrick Chan @ SCUT
Backpropagation
Example 2
�,��,��
��,�
��,��
��,���,�
��,�
g
x1 x2 x3
y
J#& ' (#'),)�
�,� �,����� �,�#*+,,
#'),)�����
�, �� �,� �,��# *�,�#'),)�
��,� = � � ��,��
�����,��
�,��
��� �,�� �,�
�,����� �,���,��
�,����� �,���,����� �,��
�,����� �,���,��
�,����� �,���,����� �,��
�,� �,���,��
+ �,� �,��
�,��
+ �,� �,��
�,��
� (�) = %� �% �
Lecture 04: SL - Multi-Layer Perceptron37
Dr. Patrick Chan @ SCUT
Backpropagation
Example 2
#& ' (#'),)�
�,� �,����� �,�#*+,,
#'),)�����
�, �� �,� �,�� �,����� �,�� �,�
� (�) = %� �% ���,�
��,��
��,���,�
��,�
g
x1 x2 x3
y
J
Lecture 04: SL - Multi-Layer Perceptron38
Dr. Patrick Chan @ SCUT
Training Algorithm
Presenting sequences of training samples affect the learning of ANN
Two update methods:
Batch training
One update, all samples
Sample sequence will not affect
Stochastic training
One update, one sample
Samples can be chosen randomly to avoid the influence of the sample sequence
Lecture 04: SL - Multi-Layer Perceptron39
Dr. Patrick Chan @ SCUT
Training Algorithm
What is the objective of a classifier?
Classify training samples accurately?
Training Error (Empirical Error) (Remp)
Computable
Error in the training set
Training Objective
Classify unseen samples accurately?
Generalization Error (Rgen)
Non-computable
Estimate only
Ultimate Objective
Training and ultimate objectives are different
Lecture 04: SL - Multi-Layer Perceptron40
Dr. Patrick Chan @ SCUT
Training Algorithm
A classifier with smaller training error is preferable?
General answer is NO
Conflict with the objective function?
(All unseen samples)
OverfittingUnderfitting
Best
Learning
Iteration
Lecture 04: SL - Multi-Layer Perceptron41
Dr. Patrick Chan @ SCUT
Training Algorithm
Complexity of the model can also be considered in the objective function
The complexity is affected by
Network Architecture (Layer and Neuron #)
Values of Parameters
Although there are many parameters, the model can be simplified by setting them as zero
Lecture 04: SL - Multi-Layer Perceptron42
Dr. Patrick Chan @ SCUT
Training Algorithm
Regularization
Regularization term ( )
Measure the smoothness of boundary and complexity of a classifier
λ : Tradeoff parameter
May sacrifice accuracy on training set for the simplicity of a classifier
Minimize:
Training Error Regularization Term
Tradeoff
Lecture 04: SL - Multi-Layer Perceptron43
Dr. Patrick Chan @ SCUT
λ = 0 > λ > 0 λ
Minimize:
Similar to traditional training objective function
No effect on the regularization term
f with a good generalization ability can be found if suitable λ is chosen
Dominated by the regularization term
The most smooth classifier is found
Training Algorithm
Regularization
Lecture 04: SL - Multi-Layer Perceptron44
Dr. Patrick Chan @ SCUT
Training Algorithm
Regularization
Many solutions may obtains the same empirical error
the solution (boundary) is not unique
Considering Regularization reduce the number of solutions
Ill-posed problemmay be solved
Lecture 04: SL - Multi-Layer Perceptron45
Dr. Patrick Chan @ SCUT
Training Algorithm: Regularization
Weight Decay
Weight Decay
Well known regularization
Measure the value of weight
Smaller smaller output change (recall, )
smoother
The objective function becomes
Minimize:
Lecture 04: SL - Multi-Layer Perceptron46
Dr. Patrick Chan @ SCUT
Practical Techniques
Some issues of designing MLPNN:
Target Value
Scaling Input
Input Data Type
Architecture
Initializing Weights
Learning Rates
Momentum
Stopped Training
Lecture 04: SL - Multi-Layer Perceptron47
Dr. Patrick Chan @ SCUT
Practical Techniques
Target Value
Two-class problem
1 output
1 for class 1; -1 for class 2
Multi-class problem
c outputs
One-hot Encoding(a bit string contain only one 1)
yi = 1 if x belongs to class i; otherwise yi=0
Can we just use 1 output?
Label Encoding
Set y = i if x belongs to class i
ID y1
y2
y3
y
x(1) CN 1 0 0 1
x(2) CN 1 0 0 1
x(3) UK 0 1 0 2
x(4) US 0 0 1 3
One-hot
EncodingLabel
Encoding
ID y
x(1) CN 1
x(2) CN 1
x(3) UK -1
Lecture 04: SL - Multi-Layer Perceptron48
Dr. Patrick Chan @ SCUT
Practical Techniques
Target Value
Label Encoding should not be used
Represent categorical data by numerical data
Label has no sequence concept
E.g. UK (2) is in between CN (1) and US (3)
ID y1
y2
y3
y
x(1) CN 1 0 0 1
x(2) CN 1 0 0 1
x(3) UK 0 1 0 2
x(4) US 0 0 1 3
One-hot
EncodingLabel
Encoding
Lecture 04: SL - Multi-Layer Perceptron49
Dr. Patrick Chan @ SCUT
Practical Techniques
Scaling Input
Features with different natures have different properties (range, mean, …) E.g. Student
Weight and Height (meters)
Weight: 40 – 100 (kg)
Height: 0.6 – 2.2 (m)
Learning proper weights of ANN are difficult when feature properties are different E.g. change 0.1 is huge for height but not for
score
Lecture 04: SL - Multi-Layer Perceptron50
Dr. Patrick Chan @ SCUT
Practical Techniques
Scaling Input
How to reduce this influence?
Normalization (Standardization)
Standardize the samples have
Same range (e.g. 0 to 1 or -1 to 1)
Same variance (e.g. 1)
Same average (e.g. 0)
Lecture 04: SL - Multi-Layer Perceptron51
Dr. Patrick Chan @ SCUT
Practical Techniques
Input Data Type
ANN only support numerical data, how to deal with categorical data?
Nominal Data: Blue, Red, Green, Purple
Ordinal Data: Excellent, Good, Fair, Poor
“Excellent” - “Good” ≠ “Good” - “Fair”
Same as the class ID, One-hot encoding should be used
Drawback: generate huge and sparse features
1 value generates 1 feature
Many 0s
Lecture 04: SL - Multi-Layer Perceptron52
Dr. Patrick Chan @ SCUT
Practical Techniques
Architecture
Backpropagation has vanishing gradient problem
That is why Deep Architecture cannot obtained by simplify extending ANN
ANN with traditional learning cannot be too deep
Usually contains at most 3 to 4 hidden layers
Otherwise, the results are not good
Lecture 04: SL - Multi-Layer Perceptron53
Dr. Patrick Chan @ SCUT
Practical Techniques
Architecture
How to determine the number of layer? and neurons?
General Concept
More complicated problem, more complicated model
Empirical method (Ad-hoc)
Evaluate a setting by a trained classifier
Pruning Method
Train a complicated classifier and remove the unnecessary structures
Lecture 04: SL - Multi-Layer Perceptron54
Dr. Patrick Chan @ SCUT
Practical Techniques
Initializing Weights
If all weights are set to 0 initially, learning can never start
Input does not affect to the output, ∑wx = 0
Weights are initialized randomly
Data normalization is important
Lecture 04: SL - Multi-Layer Perceptron55
Dr. Patrick Chan @ SCUT
Practical Techniques
Initializing Weights
Initialization depends on activation function
Sigmoid
Initial w is too small net may be small
the function becomes linear(we wants a non-linear mapping)
Initial w is too large net may be large
the hidden unit will saturate (always 0 or 1)
kill gradients
ww w w
z z z z
net = ∑wz
a(net)
Saturate Linear SaturateLecture 04: SL - Multi-Layer Perceptron56
Dr. Patrick Chan @ SCUT
Practical Techniques
Learning Rates
Small learning rate
Ensures convergence
Low learning speed
Stuck in a local minimum
Large learning rate
High learning speed
May never convergence
Unstable
Lecture 04: SL - Multi-Layer Perceptron57
Dr. Patrick Chan @ SCUT
Practical Techniques
Learning Rates
Slower
convergence
Optimal
Converge by
one step
Diverge Oscillate but
slowly
converge
Let -./ be the optimal learning rate,
which lead to the local error minimum in one step
Lecture 04: SL - Multi-Layer Perceptron58
Dr. Patrick Chan @ SCUT
Practical Techniques
Momentum
What is Momentum?
The moving objects tend to keep moving unless acted upon by outside forces
Consider some fraction of the previous weight update in BP
Lecture 04: SL - Multi-Layer Perceptron59
Current w Previous w
Different ratio Momentum
Dr. Patrick Chan @ SCUT
Practical Techniques
Momentum
What is Momentum?
The moving objects tend to keep moving unless acted upon by outside forces
In BP algorithm, the approach is to alter the learning rule to include some fraction α of the previous
weight update
Current delta w Previous delta w
Momentum
Tradeoff Parameters
Lecture 04: SL - Multi-Layer Perceptron60
Dr. Patrick Chan @ SCUT
Practical Techniques
Momentum
Faster Acceleration J
w
J
w
�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)
J
w
�(01�) = �(0) + Δ�(0)Without Momentum With Momentum
1
(0)(1)
w(1)
2(2)
(1)
w(2) 2
(0)(1)
w(1)
w(2)(2)
(0)
Lecture 04: SL - Multi-Layer Perceptron61
Dr. Patrick Chan @ SCUT
Practical Techniques
Momentum
Escape from Local Minimum
(0)(0)
J
w1
(0)
(1)
w(1)
J
w
2
(2)(1)w(2)
�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)
J
w
2
(2)
(1)
w(1)
w(2)
�(01�) = �(0) + Δ�(0)
Lecture 04: SL - Multi-Layer Perceptron62
Without Momentum With Momentum
Dr. Patrick Chan @ SCUT
Practical Techniques
Momentum
Faster Convergence J
w
J
w
�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)
J
w
�(01�) = �(0) + Δ�(0)Without Momentum With Momentum
1
(0)(1)
w(1)
2
(0)(1)
w(2)(2)
2
(0)(1)
w(2)(2)w(1)
Lecture 04: SL - Multi-Layer Perceptron63
Dr. Patrick Chan @ SCUT
Practical Techniques
Stopped Training
Stopping the training before gradient descent is complete may avoid overfitting
A far more effective method is to stop training when the error on a separate validation set reaches a minimum
Validation Error
Generalization
ErrorTraining Error
Algorithm1. Separate the original training
set into two sets• New Training Set• Validation Set
2. Use New Training Set to trainthe classifier
3. Evaluate the classifier using Validation Set at the end of each epoch
Lecture 04: SL - Multi-Layer Perceptron64