32
Machine Learning Lecture 4 Supervised Learning Multi-Layer Perceptron Dr. Patrick Chan [email protected] South China University of Technology, China 1 Dr. Patrick Chan @ SCUT Agenda Artificial Neural Network Multi-Layer Perceptron Structure LMS VS Bayes Classifier Backpropagation Practical Techniques Lecture 04: SL - Multi-Layer Perceptron 2

Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Machine Learning

Lecture 4

Supervised Learning

Multi-Layer Perceptron

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Artificial Neural Network

Multi-Layer Perceptron

Structure

LMS VS Bayes Classifier

Backpropagation

Practical Techniques

Lecture 04: SL - Multi-Layer Perceptron2

Page 2: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Introduction

Recall, Linear Discriminant Functions:

Limited generalization capability

Cannot handle the non-linearly separable problem

Σ

Σ

x1

x2

x3

xd

g1

g2

Input Layer Output Layer

Σ gc… …

Lecture 04: SL - Multi-Layer Perceptron3

Dr. Patrick Chan @ SCUT

Introduction

Solution 1: Mapping Function φ(x)

Pro: Simple structure (still using LDF)

Cons: Selection of z(x) and its parameters

Σ

Σ

zx1

zx2

zx3

zxd

g1

g2

Input Layer Output Layer

Σ gc

… …

Lecture 04: SL - Multi-Layer Perceptron4

Page 3: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Introduction

Solution 2: Multi-Layer Neural Network

Standard structure

Hidden layers serve as mapping

No prior knowledge is required (no need to choose φ(x))

x1

x2

x3

x5

Input Layer Output Layer

Σ

Σ g1

g2

Hidden Layers

… …

Σ gc

… …Lecture 04: SL - Multi-Layer Perceptron5

Dr. Patrick Chan @ SCUT

Artificial Neural Network

ANN is inspired biologically by human brain,

input output

input output

Human Brain

Artificial NN

Lecture 04: SL - Multi-Layer Perceptron6

Page 4: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Artificial Neural Network

A neuron contains

a cell body

Dendrite: a branching input structure

Axon: a branching output structure

Synapse is connections between neurons

input

output

input

output

Lecture 04: SL - Multi-Layer Perceptron7

Dr. Patrick Chan @ SCUT

Artificial Neural Network

A neuron only fires if its input exceeds a threshold

Electro-chemical signals are propagated from dendrite, cell body, and axon to other neurons

Synapses vary in strength

Good connections allowing a large signal

Slight connections allow only a weak signal.

input

output

input

output

Lecture 04: SL - Multi-Layer Perceptron8

Page 5: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Artificial Neural Network

• Our brain contains ten billion (1010) neurons

• On average, several thousand connections

• Hundreds of operations per second

• Die off frequently (never replaced)

• Compensates for problems by massive parallelism

Lecture 04: SL - Multi-Layer Perceptron9

Dr. Patrick Chan @ SCUT

Artificial Neural Network

Each neuron

Inputs from other neurons

Weighted sum is calculated from inputs

If the activation level exceeds the threshold (t), the neuron fires

Output is connected to other neurons

Input…

w1

w2

wd

I1I2

Id

O

a

Weight

Activation

Function

Output

� ��

���

Lecture 04: SL - Multi-Layer Perceptron10

Page 6: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Artificial Neural Network

Type of ANN

x1

x2

x3

g1

g2

x1

x2

x3

g1

g2

gt

xt

Fully ConnectedFeedforward

Partial ConnectedFeedforward

Recurrent

Discuss later Discuss later

Lecture 04: SL - Multi-Layer Perceptron11

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron

Multi-Layer Perceptron (MLP)

Neurons are arranged in layers

Neurons are connected to all neurons in next layer

Fully-connected

Feedforward

Neurons may have different activation functions or no activation function

x1

x2

x3

g1

g2

Lecture 04: SL - Multi-Layer Perceptron12

Page 7: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

XOR Example

� � �

� � �

� �

bias

x1

x2

x1

x2

z1

x2

x1

z2

Lecture 04: SL - Multi-Layer Perceptron13

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

XOR Example

x1

= 1 x2

= 1

z1

= sgn(1+1+0.5) = sgn(2.5) = 1

z2 = sgn(1+1-1.5) = sgn(0.5) = 1

y = sgn(0.7 - 0.4 – 1) = sgn(-0.3) = -1

x1 = -1 x2 = -1

z1 = sgn(-1-1+0.5 ) = sgn(-1.5) = -1

z2 = sgn(-1-1-1.5 ) = sgn(-3.5) = -1

y = sgn(-0.7 + 0.4 – 1) = sgn(-1.3) = -1

� � �

� � �

� �

Lecture 04: SL - Multi-Layer Perceptron14

Page 8: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Activation Function

Real neuron has non-linear properties

Non-linear activation function is a better choice

Binary Step function is not a good choice

Non-differentiable

Many information is ignored

the straw that broke

the camel's back

No different

when input 2 or 3

Lecture 04: SL - Multi-Layer Perceptron15

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Activation Function

Sigmoid function is commonly used in MLP

Differentiable

Lecture 04: SL - Multi-Layer Perceptron16

Page 9: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Activation Function

Sigmoid

Nice interpretation on the firing rate

0 = not firing at all

1 = fully firing

Saturate: flat region (output is 0 or 1)

Gradient at these regions almost zero (NN will barely learn)

Almost no signal will flow to its weights

If initial weights are too large or too small, then most neurons would saturate

Lecture 04: SL - Multi-Layer Perceptron17

Saturate

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Activation Function

Tanh

Scaled sigmoid: Tanh(x)=2 sigm(2x) - 1

Like sigmoid, tanh neurons saturate

Unlike sigmoid, output is zero-centered

Lecture 04: SL - Multi-Layer Perceptron18

Saturate

Page 10: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Activation Function

Rectified Linear Unit (ReLU)

Most Deep Networks use ReLU nowadays

Trains much faster

accelerates the convergence of Stochastic gradient descent

due to linear, non-saturating form

Less expensive operations compared to sigmoid/tanh (exponentials etc.)

More expressive

Prevents the gradient vanishing problemLecture 04: SL - Multi-Layer Perceptron19

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Activation Function

Other activation functions:

https://en.wikipedia.org/wiki/Activation_function

Identity

Binary step

Sinc

Gaussian ���

Lecture 04: SL - Multi-Layer Perceptron20

Page 11: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Architecture

x1

x2 x3

g1 g2

Layer 1

Layer 2

Layer 3

Layer 4

1 3 4

2 3

1 2

1 2 3

�,�

�,

weight between the neurons j and k in the layer i

Output of the neuron j in the layer i

2

1

�,��

�, �

�,�

�,�

�,�

Activation Function

�,�

�,�� �,�

���

�,�� �,� �,�� �,�

�,�� �,� �, � �,

Lecture 04: SL - Multi-Layer Perceptron21

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Architecture

x1 x2 x3

g1 g2

Layer 1

Layer 2

Layer 3

Layer 4

1 3

2 3

1 2

1 2 3

2

1

,�

�,�

4

�,

�,�

�� = � � ��,��� � ��,�� � ��,����

���

��

���

,�

�,�� �,��

���

�,� �,

��

�,� �,��

���

Lecture 04: SL - Multi-Layer Perceptron22

Page 12: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Multi-Layer Perceptron (MLP)

Architecture

Example

Lecture 04: SL - Multi-Layer Perceptron23

Dr. Patrick Chan @ SCUT

Backpropagation

How to determine the weight?

Pseudoinverse cannot be used as ANN is not linear in general

Gradient Descent

: the learning rate

How to calculate ( )/ for each w?

x1

x2x

3

g1

g2

x1

x2x

3

g1

g2

MLPLDF

Lecture 04: SL - Multi-Layer Perceptron24

Page 13: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Backpropagation

Calculation of the derivative flows backwards through the network

Natural extension of LMS algorithm

x1

x2

x3

g1

g2

Calc

ula

te o

utp

ut

by W

Update

W b

y o

utp

ut

(err

or)

Lecture 04: SL - Multi-Layer Perceptron25

Dr. Patrick Chan @ SCUT

Backpropagation

Recall, Chain rule

��

��

��

Lecture 04: SL - Multi-Layer Perceptron26

Page 14: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Example 1

Which paths to the output are affected by w2,11?

Error on each output should be considered

Backprop from J to w2,11

x1 x2 x3

y2

J1

g1 g2y1

J2

J

��,��

��,�

Lecture 04: SL - Multi-Layer Perceptron27

Dr. Patrick Chan @ SCUT

Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

J2J1

J

��,��

�,��

�,��

� � � = �� + ��

� ��,��

��,��

���

��,�

Lecture 04: SL - Multi-Layer Perceptron28

Page 15: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

��,��

��,��

�,���

�,��

���

� � �

�,��

� ��

�,��

�� = 12 �� − �� �

��,�

J2J1

J

Lecture 04: SL - Multi-Layer Perceptron29

Dr. Patrick Chan @ SCUT

Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

��,��

��,��

�,�� � ��

�,��

���

�,��� �,��,��

�� = � � ��,�

����,�

��,�

�,��� �,��,��� �,�

�,��� �,��,��

Let � (�) = #$ �# �

�,�

�� �,��,� �,�� �,� �,�� �,� �,��

�,��

�,�

�� �,� �,���,��,��

J2J1

J

Lecture 04: SL - Multi-Layer Perceptron30

Page 16: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Example 1

x1 x2 x3

y2g1 g2y1

��,��

�,��,��

�,�� � � �,�

�� �,� �,���,��,��

���

�,� ��� �,���,����,��,� ��� �,��

�,� ��� �,���,� ��� �,���,��

� (�) = %� �% �

�,�

��� �,��

�,� �,�� �,� �,���,� �,�� �, �, �

�,��

��,� = � � ��,�

�����,��

�,�

��� �,�� �,�

J2J1

J

Lecture 04: SL - Multi-Layer Perceptron31

Dr. Patrick Chan @ SCUT

Backpropagation

Example 1

x1 x2 x3

y2

J1

g1 g2y1

J2

J

��,��

�,��

� � �,�

�� �,� �,��

�,�

��� �,�� �,�

�����,�

� (�) = %� �% �

Lecture 04: SL - Multi-Layer Perceptron32

Page 17: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Example 2

Which paths to the output are affected by w1,12?

Backprop from J to w1,12

Further away from J, more complicated update

��,�

��,��

��,���,�

��,�

g

x1 x2 x3

y

J

Lecture 04: SL - Multi-Layer Perceptron33

Dr. Patrick Chan @ SCUT

Backpropagation

Example 2

�,��

�,���

�,����,�

��,��

��,���,�

��,�

g

x1 x2 x3

y

J

�,��

� � � = 12 � − � �

Lecture 04: SL - Multi-Layer Perceptron34

Page 18: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Example 2

�,��

��,�

��,��

��,���,�

��,�

g

x1 x2 x3

y

J

�,�� �,��

� = � � ��,��

�����,�

�,����� �,��,����� �,�

�,����� �,��,��

�,� �,��

��� �,��,��,��

���

�,� �,��

���

�,��,��,�� �,�

�,��,��

�,��,��,��

Let � (�) = #$ �# �

�,����� �,��,��

Lecture 04: SL - Multi-Layer Perceptron35

Dr. Patrick Chan @ SCUT

Backpropagation

Example 2

�,��,��

��,�

��,��

��,���,�

��,�

g

x1 x2 x3

y

J

�,�� �,� �,�

��� �,��,��,��

���

�, �� �,�

�,��

�, �� �,�

�, �� �,��, �,� ���,��

�,

�� �,� �,���,��,��

�, �� �,�

�, �� �,�

�,� �,���,��

�,� �,���,��

�,� �,���,��

�, �, ��,��

��,� = � � ��,

����,�

� (�) = %� �% �

Lecture 04: SL - Multi-Layer Perceptron36

Page 19: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Backpropagation

Example 2

�,��,��

��,�

��,��

��,���,�

��,�

g

x1 x2 x3

y

J#& ' (#'),)�

�,� �,����� �,�#*+,,

#'),)�����

�, �� �,� �,��# *�,�#'),)�

��,� = � � ��,��

�����,��

�,��

��� �,�� �,�

�,����� �,���,��

�,����� �,���,����� �,��

�,����� �,���,��

�,����� �,���,����� �,��

�,� �,���,��

+ �,� �,��

�,��

+ �,� �,��

�,��

� (�) = %� �% �

Lecture 04: SL - Multi-Layer Perceptron37

Dr. Patrick Chan @ SCUT

Backpropagation

Example 2

#& ' (#'),)�

�,� �,����� �,�#*+,,

#'),)�����

�, �� �,� �,�� �,����� �,�� �,�

� (�) = %� �% ���,�

��,��

��,���,�

��,�

g

x1 x2 x3

y

J

Lecture 04: SL - Multi-Layer Perceptron38

Page 20: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Training Algorithm

Presenting sequences of training samples affect the learning of ANN

Two update methods:

Batch training

One update, all samples

Sample sequence will not affect

Stochastic training

One update, one sample

Samples can be chosen randomly to avoid the influence of the sample sequence

Lecture 04: SL - Multi-Layer Perceptron39

Dr. Patrick Chan @ SCUT

Training Algorithm

What is the objective of a classifier?

Classify training samples accurately?

Training Error (Empirical Error) (Remp)

Computable

Error in the training set

Training Objective

Classify unseen samples accurately?

Generalization Error (Rgen)

Non-computable

Estimate only

Ultimate Objective

Training and ultimate objectives are different

Lecture 04: SL - Multi-Layer Perceptron40

Page 21: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Training Algorithm

A classifier with smaller training error is preferable?

General answer is NO

Conflict with the objective function?

(All unseen samples)

OverfittingUnderfitting

Best

Learning

Iteration

Lecture 04: SL - Multi-Layer Perceptron41

Dr. Patrick Chan @ SCUT

Training Algorithm

Complexity of the model can also be considered in the objective function

The complexity is affected by

Network Architecture (Layer and Neuron #)

Values of Parameters

Although there are many parameters, the model can be simplified by setting them as zero

Lecture 04: SL - Multi-Layer Perceptron42

Page 22: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Training Algorithm

Regularization

Regularization term ( )

Measure the smoothness of boundary and complexity of a classifier

λ : Tradeoff parameter

May sacrifice accuracy on training set for the simplicity of a classifier

Minimize:

Training Error Regularization Term

Tradeoff

Lecture 04: SL - Multi-Layer Perceptron43

Dr. Patrick Chan @ SCUT

λ = 0 > λ > 0 λ

Minimize:

Similar to traditional training objective function

No effect on the regularization term

f with a good generalization ability can be found if suitable λ is chosen

Dominated by the regularization term

The most smooth classifier is found

Training Algorithm

Regularization

Lecture 04: SL - Multi-Layer Perceptron44

Page 23: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Training Algorithm

Regularization

Many solutions may obtains the same empirical error

the solution (boundary) is not unique

Considering Regularization reduce the number of solutions

Ill-posed problemmay be solved

Lecture 04: SL - Multi-Layer Perceptron45

Dr. Patrick Chan @ SCUT

Training Algorithm: Regularization

Weight Decay

Weight Decay

Well known regularization

Measure the value of weight

Smaller smaller output change (recall, )

smoother

The objective function becomes

Minimize:

Lecture 04: SL - Multi-Layer Perceptron46

Page 24: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Some issues of designing MLPNN:

Target Value

Scaling Input

Input Data Type

Architecture

Initializing Weights

Learning Rates

Momentum

Stopped Training

Lecture 04: SL - Multi-Layer Perceptron47

Dr. Patrick Chan @ SCUT

Practical Techniques

Target Value

Two-class problem

1 output

1 for class 1; -1 for class 2

Multi-class problem

c outputs

One-hot Encoding(a bit string contain only one 1)

yi = 1 if x belongs to class i; otherwise yi=0

Can we just use 1 output?

Label Encoding

Set y = i if x belongs to class i

ID y1

y2

y3

y

x(1) CN 1 0 0 1

x(2) CN 1 0 0 1

x(3) UK 0 1 0 2

x(4) US 0 0 1 3

One-hot

EncodingLabel

Encoding

ID y

x(1) CN 1

x(2) CN 1

x(3) UK -1

Lecture 04: SL - Multi-Layer Perceptron48

Page 25: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Target Value

Label Encoding should not be used

Represent categorical data by numerical data

Label has no sequence concept

E.g. UK (2) is in between CN (1) and US (3)

ID y1

y2

y3

y

x(1) CN 1 0 0 1

x(2) CN 1 0 0 1

x(3) UK 0 1 0 2

x(4) US 0 0 1 3

One-hot

EncodingLabel

Encoding

Lecture 04: SL - Multi-Layer Perceptron49

Dr. Patrick Chan @ SCUT

Practical Techniques

Scaling Input

Features with different natures have different properties (range, mean, …) E.g. Student

Weight and Height (meters)

Weight: 40 – 100 (kg)

Height: 0.6 – 2.2 (m)

Learning proper weights of ANN are difficult when feature properties are different E.g. change 0.1 is huge for height but not for

score

Lecture 04: SL - Multi-Layer Perceptron50

Page 26: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Scaling Input

How to reduce this influence?

Normalization (Standardization)

Standardize the samples have

Same range (e.g. 0 to 1 or -1 to 1)

Same variance (e.g. 1)

Same average (e.g. 0)

Lecture 04: SL - Multi-Layer Perceptron51

Dr. Patrick Chan @ SCUT

Practical Techniques

Input Data Type

ANN only support numerical data, how to deal with categorical data?

Nominal Data: Blue, Red, Green, Purple

Ordinal Data: Excellent, Good, Fair, Poor

“Excellent” - “Good” ≠ “Good” - “Fair”

Same as the class ID, One-hot encoding should be used

Drawback: generate huge and sparse features

1 value generates 1 feature

Many 0s

Lecture 04: SL - Multi-Layer Perceptron52

Page 27: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Architecture

Backpropagation has vanishing gradient problem

That is why Deep Architecture cannot obtained by simplify extending ANN

ANN with traditional learning cannot be too deep

Usually contains at most 3 to 4 hidden layers

Otherwise, the results are not good

Lecture 04: SL - Multi-Layer Perceptron53

Dr. Patrick Chan @ SCUT

Practical Techniques

Architecture

How to determine the number of layer? and neurons?

General Concept

More complicated problem, more complicated model

Empirical method (Ad-hoc)

Evaluate a setting by a trained classifier

Pruning Method

Train a complicated classifier and remove the unnecessary structures

Lecture 04: SL - Multi-Layer Perceptron54

Page 28: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Initializing Weights

If all weights are set to 0 initially, learning can never start

Input does not affect to the output, ∑wx = 0

Weights are initialized randomly

Data normalization is important

Lecture 04: SL - Multi-Layer Perceptron55

Dr. Patrick Chan @ SCUT

Practical Techniques

Initializing Weights

Initialization depends on activation function

Sigmoid

Initial w is too small net may be small

the function becomes linear(we wants a non-linear mapping)

Initial w is too large net may be large

the hidden unit will saturate (always 0 or 1)

kill gradients

ww w w

z z z z

net = ∑wz

a(net)

Saturate Linear SaturateLecture 04: SL - Multi-Layer Perceptron56

Page 29: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Learning Rates

Small learning rate

Ensures convergence

Low learning speed

Stuck in a local minimum

Large learning rate

High learning speed

May never convergence

Unstable

Lecture 04: SL - Multi-Layer Perceptron57

Dr. Patrick Chan @ SCUT

Practical Techniques

Learning Rates

Slower

convergence

Optimal

Converge by

one step

Diverge Oscillate but

slowly

converge

Let -./ be the optimal learning rate,

which lead to the local error minimum in one step

Lecture 04: SL - Multi-Layer Perceptron58

Page 30: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Momentum

What is Momentum?

The moving objects tend to keep moving unless acted upon by outside forces

Consider some fraction of the previous weight update in BP

Lecture 04: SL - Multi-Layer Perceptron59

Current w Previous w

Different ratio Momentum

Dr. Patrick Chan @ SCUT

Practical Techniques

Momentum

What is Momentum?

The moving objects tend to keep moving unless acted upon by outside forces

In BP algorithm, the approach is to alter the learning rule to include some fraction α of the previous

weight update

Current delta w Previous delta w

Momentum

Tradeoff Parameters

Lecture 04: SL - Multi-Layer Perceptron60

Page 31: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Momentum

Faster Acceleration J

w

J

w

�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)

J

w

�(01�) = �(0) + Δ�(0)Without Momentum With Momentum

1

(0)(1)

w(1)

2(2)

(1)

w(2) 2

(0)(1)

w(1)

w(2)(2)

(0)

Lecture 04: SL - Multi-Layer Perceptron61

Dr. Patrick Chan @ SCUT

Practical Techniques

Momentum

Escape from Local Minimum

(0)(0)

J

w1

(0)

(1)

w(1)

J

w

2

(2)(1)w(2)

�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)

J

w

2

(2)

(1)

w(1)

w(2)

�(01�) = �(0) + Δ�(0)

Lecture 04: SL - Multi-Layer Perceptron62

Without Momentum With Momentum

Page 32: Lecture04 - SL - Multi-Layer Perceptron · Dr. Patrick Chan @ SCUT Artificial Neural Network A neuron contains a cell body Dendrite: a branching input structure Axon: a branching

Dr. Patrick Chan @ SCUT

Practical Techniques

Momentum

Faster Convergence J

w

J

w

�(01�) = �(0) + (1 − 2)Δ�(0) + 2Δ�(0��)

J

w

�(01�) = �(0) + Δ�(0)Without Momentum With Momentum

1

(0)(1)

w(1)

2

(0)(1)

w(2)(2)

2

(0)(1)

w(2)(2)w(1)

Lecture 04: SL - Multi-Layer Perceptron63

Dr. Patrick Chan @ SCUT

Practical Techniques

Stopped Training

Stopping the training before gradient descent is complete may avoid overfitting

A far more effective method is to stop training when the error on a separate validation set reaches a minimum

Validation Error

Generalization

ErrorTraining Error

Algorithm1. Separate the original training

set into two sets• New Training Set• Validation Set

2. Use New Training Set to trainthe classifier

3. Evaluate the classifier using Validation Set at the end of each epoch

Lecture 04: SL - Multi-Layer Perceptron64