Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Machine Learning

Lecture 4

Supervised Learning

Multi-Layer Perceptron

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Artificial Neural Network


Structure

LMS VS Bayes Classifier

Backpropagation

Practical Techniques

Lecture 04: SL - Multi-Layer Perceptron2


Introduction

Recall, Linear Discriminant Functions:

Limited generalization capability

Cannot handle the non-linearly separable problem

Σ

Σ

x1

x2

x3

…

xd

g1

g2

Input Layer Output Layer

Σ gc… …



Introduction

Solution 1: Mapping Function φ(x)

Pro: Simple structure (still using LDF)

Cons: Selection of z(x) and its parameters

Σ

Σ

zx1

zx2

zx3

…

zxd

g1

g2


Σ gc

… …

…



Introduction

Solution 2: Multi-Layer Neural Network

Standard structure

Hidden layers serve as mapping

No prior knowledge is required (no need to choose φ(x))

x1

x2

x3

x5


Σ

Σ g1

g2

Hidden Layers

… …

Σ gc

…

…

… …Lecture 04: SL - Multi-Layer Perceptron5



ANN is inspired biologically by human brain,

input output

input output

Human Brain

Artificial NN




A neuron contains

a cell body

Dendrite: a branching input structure

Axon: a branching output structure

Synapse is connections between neurons

input

output

input

output




A neuron only fires if its input exceeds a threshold

Electro-chemical signals are propagated from dendrite, cell body, and axon to other neurons

Synapses vary in strength

Good connections allowing a large signal

Slight connections allow only a weak signal.

input

output

input

output




• Our brain contains ten billion (1010) neurons

• On average, several thousand connections

• Hundreds of operations per second

• Die off frequently (never replaced)

• Compensates for problems by massive parallelism




Each neuron

Inputs from other neurons

Weighted sum is calculated from inputs

If the activation level exceeds the threshold (t), the neuron fires

Output is connected to other neurons

Input…

w1

w2

wd

I1I2

Id

O

a

Weight

Activation

Function

Output

� ��

��




Type of ANN

x1

x2

x3

g1

g2

x1

x2

x3

g1

g2

gt

xt

Fully ConnectedFeedforward

Partial ConnectedFeedforward

Recurrent

Discuss later Discuss later




Multi-Layer Perceptron (MLP)

Neurons are arranged in layers

Neurons are connected to all neurons in next layer

Fully-connected

Feedforward

Neurons may have different activation functions or no activation function

x1

x2

x3

g1

g2




XOR Example

� � �

� � �

� �

bias

x1

x2

x1

x2

z1

x2

x1

z2




XOR Example

x1

= 1 x2

= 1

z1

= sgn(1+1+0.5) = sgn(2.5) = 1

z2 = sgn(1+1-1.5) = sgn(0.5) = 1

y = sgn(0.7 - 0.4 – 1) = sgn(-0.3) = -1

x1 = -1 x2 = -1

z1 = sgn(-1-1+0.5 ) = sgn(-1.5) = -1

z2 = sgn(-1-1-1.5 ) = sgn(-3.5) = -1

y = sgn(-0.7 + 0.4 – 1) = sgn(-1.3) = -1

� � �

� � �

� �




Activation Function

Real neuron has non-linear properties

Non-linear activation function is a better choice

Binary Step function is not a good choice

Non-differentiable

Many information is ignored

the straw that broke

the camel's back

No different

when input 2 or 3




Activation Function

Sigmoid function is commonly used in MLP

Differentiable




Activation Function

Sigmoid

Nice interpretation on the firing rate

0 = not firing at all

1 = fully firing

Saturate: flat region (output is 0 or 1)

Gradient at these regions almost zero (NN will barely learn)

Almost no signal will flow to its weights

If initial weights are too large or too small, then most neurons would saturate


Saturate



Activation Function

Tanh

Scaled sigmoid: Tanh(x)=2 sigm(2x) - 1

Like sigmoid, tanh neurons saturate

Unlike sigmoid, output is zero-centered


Saturate



Activation Function

Rectified Linear Unit (ReLU)

Most Deep Networks use ReLU nowadays

Trains much faster

accelerates the convergence of Stochastic gradient descent

due to linear, non-saturating form

Less expensive operations compared to sigmoid/tanh (exponentials etc.)

More expressive

Prevents the gradient vanishing problemLecture 04: SL - Multi-Layer Perceptron19



Activation Function

Other activation functions:

https://en.wikipedia.org/wiki/Activation_function

Identity

Binary step

Sinc

Gaussian ��




Architecture

x1

x2 x3

g1 g2

Layer 1

Layer 2

Layer 3

Layer 4

1 3 4

2 3

1 2

1 2 3

�,�

�,

weight between the neurons j and k in the layer i

Output of the neuron j in the layer i

2

1

�,��

�, �

�,�

�,�

�,�

Activation Function

�,�

�,�� ,�

��

�,�� ,� �,�� ,�

�,�� ,� �, � �,




Architecture

x1 x2 x3

g1 g2

Layer 1

Layer 2

Layer 3

Layer 4

1 3

2 3

1 2

1 2 3

2

1

,�

�,�

4

�,

�,�

�

�� = � � ��,�� ,�� ,��

��

��

�

��

,�

�,�� ,��

��

�,� �,

��

�,� �,��

��

�




Architecture

Example



Backpropagation

How to determine the weight?

Pseudoinverse cannot be used as ANN is not linear in general

Gradient Descent

: the learning rate

How to calculate ( )/ for each w?

x1

x2x

3

g1

g2

x1

x2x

3

g1

g2

MLPLDF



Backpropagation

Backpropagation

Calculation of the derivative flows backwards through the network

Natural extension of LMS algorithm

x1

x2

x3

g1

g2

Calc

ula

te o

utp

ut

by W

Update

W b

y o

utp

ut

(err

or)



Backpropagation

Recall, Chain rule

�

�

��

�

��

��

�



Backpropagation

Example 1

Which paths to the output are affected by w2,11?

Error on each output should be considered

Backprop from J to w2,11

x1 x2 x3

y2

J1

g1 g2y1

J2

J

��,��

��,�



Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

J2J1

J

��,��

�

�,��

�

�,��

� � � = �� + ��

� ��,��

��,��

�

��

��,�



Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

��,��

��,��

�

�,��

�,��

�

��

� � �

�,��

� ��

�,��

�� = 12 �� − ��

��,�

J2J1

J



Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

��,��

��,��

�

�,��

�,��

�

��

�,�� ,��,��

�� = � � ��,�

��,�

��,�

�,�� ,��,�� ,�

�,�� ,��,��

Let � (�) = #$ �# �

�,�

�� ,��,� �,�� ,� �,�� ,� �,��

�,��

�,�

�� ,� �,��,��,��

J2J1

J



Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

��,��

�,��,��

�

�,�� ,�

�� ,� �,��,��,��

�

��

�,� �� ,��,��,��,� �� ,��

�,� �� ,��,� �� ,��,��

� (�) = %� �% �

�,�

�� ,��

�,� �,�� ,� �,��,� �,�� , �, �

�,��

��,� = � � ��,�

��,��

�,�

�� ,�� ,�

J2J1

J



Backpropagation

Example 1

x1 x2 x3

y2

J1

g1 g2y1

J2

J

��,��

�

�,��

� � �,�

�� ,� �,��

�,�

�� ,�� ,�

�

��,�

� (�) = %� �% �



Backpropagation

Example 2

Which paths to the output are affected by w1,12?

Backprop from J to w1,12

Further away from J, more complicated update

��,�

��,��

��,��,�

��,�

g

x1 x2 x3

y

J



Backpropagation

Example 2

�,��

�

�,��

�,��,�

��,��

��,��,�

��,�

g

x1 x2 x3

y

J

�

�,��

� � � = 12 � − � �



Backpropagation

Example 2

�,��

��,�

��,��

��,��,�

��,�

g

x1 x2 x3

y

J

�

�,�� ,��

� = � � ��,��

��,�

�,�� ,��,�� ,�

�,�� ,��,��

�,� �,��

�� ,��,��,��

�

��

�,� �,��

��

�,��,��,�� ,�

�,��,��

�,��,��,��

Let � (�) = #$ �# �

�,�� ,��,��



Backpropagation

Example 2

�,��,��

��,�

��,��

��,��,�

��,�

g

x1 x2 x3

y

J

�

�,�� ,� �,�

�

�� ,��,��,��

�

��

�, �� ,�

�,��

�, �� ,�

�, �� ,��, �,� ��,��

�,

�� ,� �,��,��,��

�, �� ,�

�, �� ,�

�,� �,��,��

�,� �,��,��

�,� �,��,��

�, �, ��,��

��,� = � � ��,

��,�

� (�) = %� �% �



Backpropagation

Example 2

�,��,��

��,�

��,��

��,��,�

��,�

g

x1 x2 x3

y

J#& ' (#'),)�

�,� �,�� ,�#*+,,

#'),)��

�, �� ,� �,��# *�,�#'),)�

��,� = � � ��,��

��,��

�,��

�� ,�� ,�

�,�� ,��,��

�,�� ,��,�� ,��

�,�� ,��,��

�,�� ,��,�� ,��

�,� �,��,��

+ �,� �,��

�,��

+ �,� �,��

�,��

� (�) = %� �% �



Backpropagation

Example 2

#& ' (#'),)�

�,� �,�� ,�#*+,,

#'),)��

�, �� ,� �,�� ,�� ,�� ,�

� (�) = %� �% ��,�

��,��

��,��,�

��,�

g

x1 x2 x3

y

J



Training Algorithm

Presenting sequences of training samples affect the learning of ANN

Two update methods:

Batch training

One update, all samples

Sample sequence will not affect

Stochastic training

One update, one sample

Samples can be chosen randomly to avoid the influence of the sample sequence



Training Algorithm

What is the objective of a classifier?

Classify training samples accurately?

Training Error (Empirical Error) (Remp)

Computable

Error in the training set

Training Objective

Classify unseen samples accurately?

Generalization Error (Rgen)

Non-computable

Estimate only

Ultimate Objective

Training and ultimate objectives are different



Training Algorithm

A classifier with smaller training error is preferable?

General answer is NO

Conflict with the objective function?

(All unseen samples)

OverfittingUnderfitting

Best

Learning

Iteration



Training Algorithm

Complexity of the model can also be considered in the objective function

The complexity is affected by

Network Architecture (Layer and Neuron #)

Values of Parameters

Although there are many parameters, the model can be simplified by setting them as zero



Training Algorithm

Regularization

Regularization term ( )

Measure the smoothness of boundary and complexity of a classifier

λ : Tradeoff parameter

May sacrifice accuracy on training set for the simplicity of a classifier

Minimize:

Training Error Regularization Term

Tradeoff



λ = 0 > λ > 0 λ

Minimize:

Similar to traditional training objective function

No effect on the regularization term

f with a good generalization ability can be found if suitable λ is chosen

Dominated by the regularization term

The most smooth classifier is found

Training Algorithm

Regularization



Training Algorithm

Regularization

Many solutions may obtains the same empirical error

the solution (boundary) is not unique

Considering Regularization reduce the number of solutions

Ill-posed problemmay be solved



Training Algorithm: Regularization

Weight Decay

Weight Decay

Well known regularization

Measure the value of weight

Smaller smaller output change (recall, )

smoother

The objective function becomes

Minimize:




Some issues of designing MLPNN:

Target Value

Scaling Input

Input Data Type

Architecture

Initializing Weights

Learning Rates

Momentum

Stopped Training




Target Value

Two-class problem

1 output

1 for class 1; -1 for class 2

Multi-class problem

c outputs

One-hot Encoding(a bit string contain only one 1)

yi = 1 if x belongs to class i; otherwise yi=0

Can we just use 1 output?

Label Encoding

Set y = i if x belongs to class i

ID y1

y2

y3

y

x(1) CN 1 0 0 1

x(2) CN 1 0 0 1

x(3) UK 0 1 0 2

x(4) US 0 0 1 3

One-hot

EncodingLabel

Encoding

ID y

x(1) CN 1

x(2) CN 1

x(3) UK -1




Target Value

Label Encoding should not be used

Represent categorical data by numerical data

Label has no sequence concept

E.g. UK (2) is in between CN (1) and US (3)

ID y1

y2

y3

y

x(1) CN 1 0 0 1

x(2) CN 1 0 0 1

x(3) UK 0 1 0 2

x(4) US 0 0 1 3

One-hot

EncodingLabel

Encoding




Scaling Input

Features with different natures have different properties (range, mean, …) E.g. Student

Weight and Height (meters)

Weight: 40 – 100 (kg)

Height: 0.6 – 2.2 (m)

Learning proper weights of ANN are difficult when feature properties are different E.g. change 0.1 is huge for height but not for

score




Scaling Input

How to reduce this influence?

Normalization (Standardization)

Standardize the samples have

Same range (e.g. 0 to 1 or -1 to 1)

Same variance (e.g. 1)

Same average (e.g. 0)




Input Data Type

ANN only support numerical data, how to deal with categorical data?

Nominal Data: Blue, Red, Green, Purple

Ordinal Data: Excellent, Good, Fair, Poor

“Excellent” - “Good” ≠ “Good” - “Fair”

Same as the class ID, One-hot encoding should be used

Drawback: generate huge and sparse features

1 value generates 1 feature

Many 0s




Architecture

Backpropagation has vanishing gradient problem

That is why Deep Architecture cannot obtained by simplify extending ANN

ANN with traditional learning cannot be too deep

Usually contains at most 3 to 4 hidden layers

Otherwise, the results are not good




Architecture

How to determine the number of layer? and neurons?

General Concept

More complicated problem, more complicated model

Empirical method (Ad-hoc)

Evaluate a setting by a trained classifier

Pruning Method

Train a complicated classifier and remove the unnecessary structures





If all weights are set to 0 initially, learning can never start

Input does not affect to the output, ∑wx = 0

Weights are initialized randomly

Data normalization is important





Initialization depends on activation function

Sigmoid

Initial w is too small net may be small

the function becomes linear(we wants a non-linear mapping)

Initial w is too large net may be large

the hidden unit will saturate (always 0 or 1)

kill gradients

ww w w

z z z z

net = ∑wz

a(net)

Saturate Linear SaturateLecture 04: SL - Multi-Layer Perceptron56



Learning Rates

Small learning rate

Ensures convergence

Low learning speed

Stuck in a local minimum

Large learning rate

High learning speed

May never convergence

Unstable




Learning Rates

Slower

convergence

Optimal

Converge by

one step

Diverge Oscillate but

slowly

converge

Let -./ be the optimal learning rate,

which lead to the local error minimum in one step




Momentum

What is Momentum?

The moving objects tend to keep moving unless acted upon by outside forces

Consider some fraction of the previous weight update in BP


Current w Previous w

Different ratio Momentum



Momentum

What is Momentum?

The moving objects tend to keep moving unless acted upon by outside forces

In BP algorithm, the approach is to alter the learning rule to include some fraction α of the previous

weight update

Current delta w Previous delta w

Momentum

Tradeoff Parameters




Momentum

Faster Acceleration J

w

J

w

�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)

J

w

�(01�) = �(0) + Δ�(0)Without Momentum With Momentum

1

(0)(1)

w(1)

2(2)

(1)

w(2) 2

(0)(1)

w(1)

w(2)(2)

(0)




Momentum

Escape from Local Minimum

(0)(0)

J

w1

(0)

(1)

w(1)

J

w

2

(2)(1)w(2)

�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)

J

w

2

(2)

(1)

w(1)

w(2)

�(01�) = �(0) + Δ�(0)


Without Momentum With Momentum



Momentum

Faster Convergence J

w

J

w

�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)

J

w

�(01�) = �(0) + Δ�(0)Without Momentum With Momentum

1

(0)(1)

w(1)

2

(0)(1)

w(2)(2)

2

(0)(1)

w(2)(2)w(1)




Stopped Training

Stopping the training before gradient descent is complete may avoid overfitting

A far more effective method is to stop training when the error on a separate validation set reaches a minimum

Validation Error

Generalization

ErrorTraining Error

Algorithm1. Separate the original training

set into two sets• New Training Set• Validation Set

2. Use New Training Set to trainthe classifier

3. Evaluate the classifier using Validation Set at the end of each epoch


Documents

Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching