Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Single Layer Neural Networks
Steve Renals
Informatics 2B— Learning and Data Lecture 1126 February 2010
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 1
Overview
Today’s lecture
Linear discriminants and single-layer neural networks
Training the weights of a single-layer neural network directly
Gradient descent
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 2
Recap: Gaussians with equal covariance
Consider the special case in which the Gaussian pdfs for eachclass all share the same class-independent covariance matrix:
Σc = Σ ∀cBy dropping terms that are now constant we can simplify thediscriminant function to
yc(x) = (µTc Σ−1)x− 1
2µTc Σ−1µc + lnP(C )
This is a linear function of xWe can define two variables in terms of µc , P(C ) and Σ:
wTc = µT
c Σ−1 wc0 = −1
2µTc Σ−1µc + lnP(C )
Substituting wc and wc0 into the expression for yc(x):
yc(x) = wTc x + wc0
Linear discriminant functionInformatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 3
Recap: Decision Regions: equal covariance
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
x1
x 2
Decision regions: Equal Covariance Gaussians
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 4
Single-layer neural networks
Linear discriminant functions for a K -class problem
yk(x) = wTk x + wk0
May be represented as a single layer neural network
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 5
Multiclass single-layer neural network
+
Inputs
Outputs
Bias
+y1 yK
x0 x1 xd
w10
w11
wK0 wK1
wKdw1d
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 6
Single-layer neural networks
Linear discriminant functions for a K -class problem
yk(x) = wTk x + wk0
May be represented as a single layer neural network
Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT
k
The 0th column is given by the biases wk0
If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:
y = Wx
In terms of individual components
yk =d∑
i=0
wkixi
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7
Single-layer neural networks
Linear discriminant functions for a K -class problem
yk(x) = wTk x + wk0
May be represented as a single layer neural network
Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT
k
The 0th column is given by the biases wk0
If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:
y = Wx
In terms of individual components
yk =d∑
i=0
wkixi
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7
Single-layer neural networks
Linear discriminant functions for a K -class problem
yk(x) = wTk x + wk0
May be represented as a single layer neural network
Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT
k
The 0th column is given by the biases wk0
If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:
y = Wx
In terms of individual components
yk =d∑
i=0
wkixi
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7
Multiclass single-layer neural network
+
Inputs
Outputs
Bias
+y1 yK
x0 x1 xd
w10
w11
wK0 wK1
wKdw1d
Input vector x = (x0, x1, . . . , xd)
Output vector y = (y1, . . . , yK )
Weight matrix W: wki is the weight from input xi to output yk
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 8
Multiclass single-layer neural network
+
Inputs
Outputs
Bias
+y1 yK
x0 x1 xd
w10
w11
wK0 wK1
wKdw1d
y = Wx yk =d∑
i=0
wkixi
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 9
Training set
We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x
Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t
nK ) is the target
output vector for input vector xn
For a classification problem, if the correct class is C , then:
tnc = 1
tnk = 0 ∀k 6= c
1-from-n coding
We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10
Training set
We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x
Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t
nK ) is the target
output vector for input vector xn
For a classification problem, if the correct class is C , then:
tnc = 1
tnk = 0 ∀k 6= c
1-from-n coding
We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10
Training set
We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x
Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t
nK ) is the target
output vector for input vector xn
For a classification problem, if the correct class is C , then:
tnc = 1
tnk = 0 ∀k 6= c
1-from-n coding
We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10
Training set
We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x
Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t
nK ) is the target
output vector for input vector xn
For a classification problem, if the correct class is C , then:
tnc = 1
tnk = 0 ∀k 6= c
1-from-n coding
We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10
Sum-of-squares error function
Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network output
Sum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:
E (W) =1
2
N∑
n=1
K∑
k=1
(ynk − tnk )2
=1
2
N∑
n=1
K∑
k=1
(d∑
i=0
wkixni − tnk
)2
This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 11
Sum-of-squares error function
Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network outputSum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:
E (W) =1
2
N∑
n=1
K∑
k=1
(ynk − tnk )2
=1
2
N∑
n=1
K∑
k=1
(d∑
i=0
wkixni − tnk
)2
This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 11
Sum-of-squares error function
Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network outputSum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:
E (W) =1
2
N∑
n=1
K∑
k=1
(ynk − tnk )2
=1
2
N∑
n=1
K∑
k=1
(d∑
i=0
wkixni − tnk
)2
This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 11
Minimizing the error function
E (W) =1
2
N∑
n=1
K∑
k=1
(d∑
i=0
wkixni − tnk
)2
We find the minimum by looking for where the derivatives ofE with respect to the weights are 0
Since E is a quadratic function of the weights, the derivativesare linear functions of the weights
Solving for the weight values:
Exact approach: pseudoinverse of weight matrixIterative approaches: IRLS (Newton-Raphson), gradientdescent
We will only consider gradient descent
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 12
Minimizing the error function
E (W) =1
2
N∑
n=1
K∑
k=1
(d∑
i=0
wkixni − tnk
)2
We find the minimum by looking for where the derivatives ofE with respect to the weights are 0
Since E is a quadratic function of the weights, the derivativesare linear functions of the weights
Solving for the weight values:
Exact approach: pseudoinverse of weight matrixIterative approaches: IRLS (Newton-Raphson), gradientdescent
We will only consider gradient descent
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 12
Gradient descent (1)
Gradient descent can be used whenever it is possible tocompute the derivatives of the error function E with respectto the parameter to be optimized W
Basic idea: adjust the weights to move downhill in weightspace
Weight space: K · (d + 1) dimension space: a weight matrixW is a point in weight space
The gradient of the error in weight space:
∇WE =
(∂E
∂w10, . . . ,
∂E
∂wki, . . . ,
∂E
∂wKd
)T
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 13
Gradient descent (2)
Operation of gradient descent:1 Start with a guess for the weight matrix W (small random
numbers)2 Update the weights by adjusting the weight matrix in the
direction of −∇WE .3 Recompute the error, and iterate
The update for weight wki at iteration τ + 1 is:
w τ+1ki = w τ
ki − η∂E
∂wki
The parameter η is the learning rate
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 14
Gradients for a single-layer neural network
E (W) =1
2
N∑
n=1
K∑
k=1
(d∑
i=0
wkixni − tnk
)2
To minimize E with respect to W we differentiate E withrespect to each weight wki :
∂E
∂wki=
N∑
n=1
d∑
j=0
wkjxnj − tnk
xni
=N∑
n=1
(ynk − tnk )xni
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 15
Gradient descent for a single layer neural network (2)
If we define δnk as the difference between the network outputand the target, we can write:
δnk = ynk − tnk
∂E
∂wki=
N∑
n=1
δnkxni
The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set
Combining the expression for the derivatives with theexpression for gradient descent update we have:
w τ+1ki = w τ
ki − η∂E
∂wki= w τ
ki − ηN∑
n=1
δnkxni
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 16
Gradient descent for a single layer neural network (2)
If we define δnk as the difference between the network outputand the target, we can write:
δnk = ynk − tnk
∂E
∂wki=
N∑
n=1
δnkxni
The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set
Combining the expression for the derivatives with theexpression for gradient descent update we have:
w τ+1ki = w τ
ki − η∂E
∂wki= w τ
ki − ηN∑
n=1
δnkxni
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 16
Gradient descent for a single layer neural network (2)
If we define δnk as the difference between the network outputand the target, we can write:
δnk = ynk − tnk
∂E
∂wki=
N∑
n=1
δnkxni
The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set
Combining the expression for the derivatives with theexpression for gradient descent update we have:
w τ+1ki = w τ
ki − η∂E
∂wki= w τ
ki − ηN∑
n=1
δnkxni
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 16
Schematic of gradient descent training
+
Inputsx0 x1xd
Outputs
Bias
xi
yk
wk0 wk1 wki wkd
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17
Schematic of gradient descent training
+
Inputsx0 x1xd
Outputs
Bias
xi
yk
wk0 wk1 wki wkd
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17
Schematic of gradient descent training
+
Inputsx0 x1xd
Outputs
Bias
xi
wk0 wk1 wki wkd
yk =
d
∑j=0
wk jx j
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17
Schematic of gradient descent training
+
Inputsx0 x1xd
Outputs
Bias
xi
wk0 wk1 wki wkd
δk = yk! tk
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17
Schematic of gradient descent training
+
Inputsx0 x1xd
Outputs
Bias
xi
wk0 wk1 wki wkd
Δwτki = Δwτki+δk · xi
δkyk
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17
Interpreting the bias parameter
Derivative with respect to the bias (at the minimum):
∂E
∂wk0=
N∑
n=1
d∑
j=1
wkjxnj + wk0 − tnk
= 0
If we write:
x̄i =1
N
N∑
n=1
xni t̄k =1
N
N∑
n=1
tnk
Then we may write the solution for the bias as
wk0 = t̄k −d∑
i=1
wki x̄i
The bias may be interpreted as compensating for thedifference in the training set mean of the targets and thenetwork outputs
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 18
Summary
Training single-layer neural networks
Sum-of-squares error function
Gradient descent
Good coverage in Bishop, Pattern Recognition and NeuralNetworks (section 3.1.3, 3.4)
Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 19