Outputs K W rks krks Renals 11 2010 11 rks1 Overview lecture rks directly descent 11 rks2 riance each sameendentmatrix: c = 8 c the to y c( x ( T 1) x 1 2 T 1 ln P ( C ) arof x of

Single Layer Neural Networks

Steve Renals

Informatics 2B— Learning and Data Lecture 1126 February 2010

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 1

Overview

Today’s lecture

Linear discriminants and single-layer neural networks

Training the weights of a single-layer neural network directly

Gradient descent


Recap: Gaussians with equal covariance

Consider the special case in which the Gaussian pdfs for eachclass all share the same class-independent covariance matrix:

Σc = Σ ∀cBy dropping terms that are now constant we can simplify thediscriminant function to

yc(x) = (µTc Σ−1)x− 1

2µTc Σ−1µc + lnP(C )

This is a linear function of xWe can define two variables in terms of µc , P(C ) and Σ:

wTc = µT

c Σ−1 wc0 = −1

2µTc Σ−1µc + lnP(C )

Substituting wc and wc0 into the expression for yc(x):

yc(x) = wTc x + wc0

Linear discriminant functionInformatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 3

Recap: Decision Regions: equal covariance

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

x1

x 2

Decision regions: Equal Covariance Gaussians


Single-layer neural networks

Linear discriminant functions for a K -class problem

yk(x) = wTk x + wk0

May be represented as a single layer neural network


Multiclass single-layer neural network

+

Inputs

Outputs

Bias

+y1 yK

x0 x1 xd

w10

w11

wK0 wK1

wKdw1d




yk(x) = wTk x + wk0


Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT

k

The 0th column is given by the biases wk0

If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:

y = Wx

In terms of individual components

yk =d∑

i=0

wkixi




yk(x) = wTk x + wk0



k



y = Wx


yk =d∑

i=0

wkixi




yk(x) = wTk x + wk0



k



y = Wx


yk =d∑

i=0

wkixi



+

Inputs

Outputs

Bias

+y1 yK

x0 x1 xd

w10

w11

wK0 wK1

wKdw1d

Input vector x = (x0, x1, . . . , xd)

Output vector y = (y1, . . . , yK )

Weight matrix W: wki is the weight from input xi to output yk



+

Inputs

Outputs

Bias

+y1 yK

x0 x1 xd

w10

w11

wK0 wK1

wKdw1d

y = Wx yk =d∑

i=0

wkixi


Training set

We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x

Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t

nK ) is the target

output vector for input vector xn

For a classification problem, if the correct class is C , then:

tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding

We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)


Training set



nK ) is the target



tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding



Training set



nK ) is the target



tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding



Training set



nK ) is the target



tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding



Sum-of-squares error function

Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network output

Sum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:

E (W) =1

2

N∑

n=1

K∑

k=1

(ynk − tnk )2

=1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)



Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network outputSum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:

E (W) =1

2

N∑

n=1

K∑

k=1

(ynk − tnk )2

=1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2




Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network outputSum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:

E (W) =1

2

N∑

n=1

K∑

k=1

(ynk − tnk )2

=1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2



Minimizing the error function

E (W) =1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

We find the minimum by looking for where the derivatives ofE with respect to the weights are 0

Since E is a quadratic function of the weights, the derivativesare linear functions of the weights

Solving for the weight values:

Exact approach: pseudoinverse of weight matrixIterative approaches: IRLS (Newton-Raphson), gradientdescent

We will only consider gradient descent


Minimizing the error function

E (W) =1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

We find the minimum by looking for where the derivatives ofE with respect to the weights are 0

Since E is a quadratic function of the weights, the derivativesare linear functions of the weights

Solving for the weight values:

Exact approach: pseudoinverse of weight matrixIterative approaches: IRLS (Newton-Raphson), gradientdescent

We will only consider gradient descent


Gradient descent (1)

Gradient descent can be used whenever it is possible tocompute the derivatives of the error function E with respectto the parameter to be optimized W

Basic idea: adjust the weights to move downhill in weightspace

Weight space: K · (d + 1) dimension space: a weight matrixW is a point in weight space

The gradient of the error in weight space:

∇WE =

(∂E

∂w10, . . . ,

∂E

∂wki, . . . ,

∂E

∂wKd

)T


Gradient descent (2)

Operation of gradient descent:1 Start with a guess for the weight matrix W (small random

numbers)2 Update the weights by adjusting the weight matrix in the

direction of −∇WE .3 Recompute the error, and iterate

The update for weight wki at iteration τ + 1 is:

w τ+1ki = w τ

ki − η∂E

∂wki

The parameter η is the learning rate


Gradients for a single-layer neural network

E (W) =1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

To minimize E with respect to W we differentiate E withrespect to each weight wki :

∂E

∂wki=

N∑

n=1

d∑

j=0

wkjxnj − tnk

xni

=N∑

n=1

(ynk − tnk )xni


Gradient descent for a single layer neural network (2)

If we define δnk as the difference between the network outputand the target, we can write:

δnk = ynk − tnk

∂E

∂wki=

N∑

n=1

δnkxni

The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set

Combining the expression for the derivatives with theexpression for gradient descent update we have:

w τ+1ki = w τ

ki − η∂E

∂wki= w τ

ki − ηN∑

n=1

δnkxni




δnk = ynk − tnk

∂E

∂wki=

N∑

n=1

δnkxni



w τ+1ki = w τ

ki − η∂E

∂wki= w τ

ki − ηN∑

n=1

δnkxni




δnk = ynk − tnk

∂E

∂wki=

N∑

n=1

δnkxni



w τ+1ki = w τ

ki − η∂E

∂wki= w τ

ki − ηN∑

n=1

δnkxni


Schematic of gradient descent training

+

Inputsx0 x1xd

Outputs

Bias

xi

yk

wk0 wk1 wki wkd



+

Inputsx0 x1xd

Outputs

Bias

xi

yk

wk0 wk1 wki wkd



+

Inputsx0 x1xd

Outputs

Bias

xi

wk0 wk1 wki wkd

yk =

d

∑j=0

wk jx j



+

Inputsx0 x1xd

Outputs

Bias

xi

wk0 wk1 wki wkd

δk = yk! tk



+

Inputsx0 x1xd

Outputs

Bias

xi

wk0 wk1 wki wkd

Δwτki = Δwτki+δk · xi

δkyk


Interpreting the bias parameter

Derivative with respect to the bias (at the minimum):

∂E

∂wk0=

N∑

n=1

d∑

j=1

wkjxnj + wk0 − tnk

= 0

If we write:

x̄i =1

N

N∑

n=1

xni t̄k =1

N

N∑

n=1

tnk

Then we may write the solution for the bias as

wk0 = t̄k −d∑

i=1

wki x̄i

The bias may be interpreted as compensating for thedifference in the training set mean of the targets and thenetwork outputs


Summary

Training single-layer neural networks


Gradient descent

Good coverage in Bishop, Pattern Recognition and NeuralNetworks (section 3.1.3, 3.4)


Documents

Outputs K W rks krks Renals 11 2010 11 rks1 Overview lecture rks directly descent 11 rks2 riance each sameendentmatrix: c = 8 c the to y c( x ( T 1) x 1 2 T 1 ln P ( C ) arof x of