View
222
Download
0
Category
Preview:
Citation preview
Neural Networks and Machine Learning
ApplicationsCSC 563
Prof. Mohamed Batouche
Computer Science DepartmentCCIS – King Saud University
Riyadh, Saudi Arabia
mbatouche@ccis.ksu.edu.sa
Artificial Complex Systems
Artificial Neural NetworksPerceptrons and Multi Layer
Perceptrons (MLP)
Artificial Neural Networks
Perceptron
4
The Perceptron
x1
w0
y
x2
x3
x4
x5
w1
w2
w3
w4
w5
0
1
sgn wxwyn
iiiΣΣ
Initialisation : 00,..0 iwni
The first model of a biological neuron
5
Artificial Neuron: Perceptron
It’s a step function based on a linear It’s a step function based on a linear combination of real-valued inputs. If the combination of real-valued inputs. If the combination is above a threshold it combination is above a threshold it outputs a 1, otherwise it outputs a –1. outputs a 1, otherwise it outputs a –1.
x1x1
x2x2
xnxn
{1 or –1}{1 or –1}
X0=1X0=1
w0w0
w1w1
w2w2
wnwn
ΣΣ
6
Perceptron: activation rule
O(xO(x11,x,x22,…,x,…,xnn) = ) = 1 if w1 if w00 + w + w11xx11 + w + w22xx22 + … + w + … + wnnxxnn > 0 > 0
-1 otherwise-1 otherwise
To simplify we can represent the function as follows:To simplify we can represent the function as follows:
O(O(XX) = sgn() = sgn(WWTTXX) where) where
sgn(y) = sgn(y) = 1 if y > 01 if y > 0-1 otherwise-1 otherwise
Activation Rule: Linear Threshold (step Unit)
7
What a Perceptron does ?
For a perceptron with 2 input variables namely x1 and x2
Equation WTX = 0 determines a line separating positive from negative examples.
x2
w1 x
1 + w2 x
2 + w0 = 0
x1
x1
y
x2
w1
w2
ΣΣ
w0
y = sgn(w1x1+w2x2+w0)
8
What a Perceptron does ?
For a perceptron with n input variables, it draws a For a perceptron with n input variables, it draws a hyperplane as the decision boundary over the (n-hyperplane as the decision boundary over the (n-dimensional) input space. It classifies input patterns into dimensional) input space. It classifies input patterns into two classes.two classes.
The perceptron outputs 1 for instances lying on one side of thehyperplane and outputs –1 for instances on the other side.
x3
x2
w1 x
1 + w2 x
2 + w3 x
3 + w0 = 0
x1
9
What can be represented using Perceptrons?
and or
Representation Theorem: perceptrons can only represent linearly separable functions. Examples: AND,OR, NOT.
10
Limits of the PerceptronA perceptron can learn only examples that are called A perceptron can learn only examples that are called ““linearly separable”. These are examples that can be perfectly linearly separable”. These are examples that can be perfectly separated by a hyperplane. separated by a hyperplane.
++
++++
----
--
++
++
++--
--
--
Linearly separableLinearly separable Non-linearly separableNon-linearly separable
11
Functions for Perceptron
Perceptrons can learn many boolean functions:Perceptrons can learn many boolean functions:AND, OR, NAND, NOR, but not XOR AND, OR, NAND, NOR, but not XOR
AND: AND: x1x1
x2x2
X0=1X0=1
W0 = -0.8W0 = -0.8
W1=0.5W1=0.5
W2=0.5W2=0.5ΣΣ
12
Learning Perceptrons•Learning is a process by which the free parameters of a Learning is a process by which the free parameters of a neural network are adapted through a process of neural network are adapted through a process of stimulation by the environment in which the network is stimulation by the environment in which the network is embedded. The type of learning is determined by the embedded. The type of learning is determined by the manner in which the parameters changes take place.manner in which the parameters changes take place.
•In the case of Perceptrons, we use a supervised learning.In the case of Perceptrons, we use a supervised learning.
•Learning a perceptron means finding the right values for Learning a perceptron means finding the right values for W that satisfy the input examples {(inputW that satisfy the input examples {(inputii, target, targetii))**}}
•The hypothesis space of a perceptron is the space of all The hypothesis space of a perceptron is the space of all weight vectorsweight vectors..
13
Learning Perceptrons
Principle of learning using the perceptron rule:Principle of learning using the perceptron rule:
1.1. A set of training examples is given: {(x, t)*} where x A set of training examples is given: {(x, t)*} where x is the input and t the target output [supervised is the input and t the target output [supervised learning]learning]
2. Examples are presented to the network.2. Examples are presented to the network.
3.3. For each example, the network gives an output o.For each example, the network gives an output o.
4.4. If there is an error, the hyperplane is moved in order If there is an error, the hyperplane is moved in order to correct the output error.to correct the output error.
5.5. When all training examples are correctly classified, When all training examples are correctly classified, Stop learning.Stop learning.
14
Learning Perceptrons
More formally, the algorithm for learning Perceptrons is as More formally, the algorithm for learning Perceptrons is as follows:follows:
1.1. Assign random values to the weight vectorAssign random values to the weight vector
2. Apply the 2. Apply the perceptron ruleperceptron rule to every training example to every training example
3. Are all training examples correctly classified?3. Are all training examples correctly classified?
Yes. QuitYes. QuitNo. Go Back to Step 2.No. Go Back to Step 2.
15
Perceptron Training RuleThe perceptron training rule:The perceptron training rule:
For a new training example [X = (xFor a new training example [X = (x11, x, x22, …, x, …, xnn), t] ), t] update each weight according to this rule:update each weight according to this rule:
wwii = w = wii + Δw + Δwii
Where ΔwWhere Δwii = η (t-o) x = η (t-o) xii
t: target outputt: target outputo: output generated by the perceptrono: output generated by the perceptronη: constant called the learning rate (e.g., 0.1)η: constant called the learning rate (e.g., 0.1)
16
Perceptron Training RuleComments about the perceptron training rule:Comments about the perceptron training rule:
• If the example is correctly classified the term (t-o) equals zero, If the example is correctly classified the term (t-o) equals zero, and no update on the weight is necessary.and no update on the weight is necessary.
• If the perceptron outputs –1 and the real answer is 1, the If the perceptron outputs –1 and the real answer is 1, the weight is increased.weight is increased.
• If the perceptron outputs a 1 and the real answer is -1, the If the perceptron outputs a 1 and the real answer is -1, the weight is decreased.weight is decreased.
• Provided the examples are linearly separable and a small value Provided the examples are linearly separable and a small value for η is used, the rule is proved to classify all training examples for η is used, the rule is proved to classify all training examples correctly.correctly.
17
Perceptron Training RuleConsider the following example: (two classes: Red and Green)
18
Perceptron Training RuleRandom Initialization of perceptron weights …
19
Perceptron Training RuleApply Iteratively Perceptron Training Rule on the different examples:
20
Perceptron Training RuleApply Iteratively Perceptron Training Rule on the different examples:
21
Perceptron Training RuleApply Iteratively Perceptron Training Rule on the different examples:
22
Perceptron Training RuleApply Iteratively Perceptron Training Rule on the different examples:
23
Perceptron Training RuleAll examples are correctly classified … stop Learning
24
Perceptron Training RuleThe straight line w1x+ w2y + w0=0 separates the two classes
W 1x +
W 2y +
W 0 =
0
25
Demo Matlab
Perception training rule Demo Learning AND, OR functions Try to learn XOR with Perceptron
26
Learning AND/OR operationsP = [ 0 0 1 1; ... % Input patterns 0 1 0 1 ]; T = [ 0 1 1 1];% Desired Outputs
net = newp([0 1;0 1],1);
net.adaptParam.passes = 35;net = adapt(net,P,T);
x = [1; 1];y = sim(net,x);display(y);
x1
y
x2
w1
w2
ΣΣ
w0
Artificial Neural Networks
MultiLayer Perceptron (MLP)
28
Solution for XOR : Add a hidden layer !!
Input nodesInput nodes Internal nodesInternal nodes Output nodesOutput nodes
X1
X2
X1 XOR X2
x1
x2
x1
x2
x1
29
Solution for XOR : Add a hidden layer !!
Input nodesInput nodes Internal nodesInternal nodes Output nodesOutput nodes
X1
X2
The problem is: How to learn Multi Layer Perceptrons??
Solution: Backpropagation Algorithm invented by Rumelhart and colleagues in 1986
X1 XOR x2
30
MultiLayer Perceptron
In contrast to perceptrons, multilayer networks can learn not only In contrast to perceptrons, multilayer networks can learn not only multiple decision boundaries, but the boundaries may be nonlinear.multiple decision boundaries, but the boundaries may be nonlinear.
Input nodesInput nodes Internal nodesInternal nodes Output nodesOutput nodes
31
MultiLayer PerceptronDecision Boundaries
A
B
A
B
A
B
A
A
B
B
A
A
B
B
A
A
B
B
HALF PLANE BOUNDED BY HYPERPLANE
CONVEX OPEN OR CLOSED REGION
ARBITRARY (complexity limited by number of neurons)
Single-layer
Two-layer
Three-layer
32
Example
x1x1
x2x2
33
One single unit
To make nonlinear partitions on the space we need to define To make nonlinear partitions on the space we need to define each unit as a nonlinear function (unlike the perceptron).each unit as a nonlinear function (unlike the perceptron).One solution is to use the sigmoid unit.One solution is to use the sigmoid unit.
x1x1
x2x2
xnxnX0=1X0=1
w0w0
w1w1
w2w2
wnwn
ΣΣ
O = σ(net) = 1 / 1 + e O = σ(net) = 1 / 1 + e -net-net
netnet
34
Sigmoid or logistic function
O(xO(x11,x,x22,…,x,…,xnn) = ) = σ ( WX )σ ( WX )
where: σ ( WX ) = 1 / 1 + e where: σ ( WX ) = 1 / 1 + e -WX-WX
Function σ is called the sigmoid or logistic Function σ is called the sigmoid or logistic function. function.
This function is easy to differentiate and has This function is easy to differentiate and has the following property:the following property:
d σ(y) / dy = σ(y) (1 – σ(y))d σ(y) / dy = σ(y) (1 – σ(y))
35
Hyperbolic Tangent activation function: Tanh
O(xO(x11,x,x22,…,x,…,xnn) = ) = Tanh ( WX )Tanh ( WX )
where: Tanh ( WX ) = (e where: Tanh ( WX ) = (e WXWX - e - e –WX–WX) / (e ) / (e WXWX + e + e ––
WXWX) )
Function Tanh is called the Hyperbolic Tangent Function Tanh is called the Hyperbolic Tangent function. function.
This function is easy to differentiate and has the This function is easy to differentiate and has the following property:following property:
d Tanh(y) / dy = (1 + Tanh(y)) (1 – Tanh(y))d Tanh(y) / dy = (1 + Tanh(y)) (1 – Tanh(y))
36
Learning MultiLayer Perceptron
BackPropagation Algorithm:BackPropagation Algorithm:
Goal: To learn the weights for all links in an Goal: To learn the weights for all links in an interconnected multilayer network. interconnected multilayer network.
We begin by defining our measure of error:We begin by defining our measure of error:
E(W) = ½ ΣE(W) = ½ Σdd Σ Σkk (t (tkdkd – o – okdkd))2 2 = ½ Σ= ½ Σexamplesexamples (t-o) (t-o)22= ½ Err= ½ Err22
k varies along the output nodes and k varies along the output nodes and d over the training examples.d over the training examples.
The idea is to use a gradient descent over the space of The idea is to use a gradient descent over the space of weights to find a global minimum (no guarantee). weights to find a global minimum (no guarantee).
37
Gradient Descent
38
Minimizing Error Using Steepest Descent
The main idea:Find the way downhill and take a step:
E
x
minimum
downhill = - _____d Ed x
= step size
x x -d Ed x
39
Reduction of Squared Error
Gradient descent reduces the squared error by calculatingthe partial derivative of E with respect to each weight:
j
n
jjj
j
jj
xingErr
xWgtW
Err
W
ErrErr
W
EE
)('
0
jjj xingErrWW )('
chain rule for derivatives
expand second Err to (t – g(in))
This is called “in”
0
jW
tbecause and chain rule
The weight is updated by η times this gradient of error in weight space. The fact that the weight is updated in the correct direction (+/-) can be verified with examples.
learning rate
The learning rate, η, is typically set to a small value such as 0.1
E is a vector
40
BackPropagation Algorithm Create a network with nCreate a network with ninin input nodes, input nodes,
nnhiddenhidden internal nodes, and n internal nodes, and noutout output output nodes. nodes.
Initialize all weights to small random Initialize all weights to small random numbers in the range of -0.5 to 0.5. numbers in the range of -0.5 to 0.5.
Until error is small do:Until error is small do: For each example X doFor each example X do
Propagate example X forward through the networkPropagate example X forward through the network Propagate errors backward through the networkPropagate errors backward through the network
41
Y
BackPropagation Algorithm
X
E
D
y1
y2
y4
y3
e1
e2
e4
e3
x1
x2
x3
x4
x5
In the classification phase, only propagation step is used to classify patterns
(X,D)
42
The Backpropagation Algorithm for Three-Layer Networks with Sigmoid Units
Initialize all weights in the network to small random numbers. Until weights converge (may take thousands of iterations) do
For each training example Compute network output vector o For each output unit i do
Update each network weight
For each hidden unit j do
Update network weight from each input k to hidden j
))(1( iiii otoo
ioutputsi
jijjj Woo
)1(
error backpropagationijjiji aWWij ,,
jkkjkj aWWjk ,,
error gradient
43
The problem of overfitting … Approximation of the function y =
f(x) :2 neurons in hidden layer
5 neurons in hidden layer
40 neurons in hidden layer
x
y
The overfitting is not detectable in the learning phase …
So use Cross-Validation ...
44
Application of ANNs
Network
Stimulus Response
0 1 0 1 1 1 0 0
1 1 00 10 10
Input Pattern
Output Pattern
encodingdecoding
The general scheme when using ANNs is as follows:
45
Application: Digit Recognition
46
Matlab Demo
Function approximation Pattern Recognition
47
Learning XOR Operation: Matlab CodeP = [ 0 0 1 1; ... 0 1 0 1]T = [ 0 1 1 0];
net = newff([0 1;0 1],[6 1],{'tansig' 'tansig'});
net.trainParam.epochs = 4850;net = train(net,P,T);
X = [0 1];Y = sim(net,X);display(Y);
48
Function Approximation:Learning Sinus Function P = 0:0.1:10; T = sin(P)*10.0;
net = newff([0.0 10.0],[8 1],{'tansig' 'purelin'}); plot(P,T); pause; Y = sim(net,P); plot(P,T,P,Y,’o’); pause;
net.trainParam.epochs = 4850; net = train(net,P,T);
Y = sim(net,P); plot(P,T,P,Y,’o’);
Recommended