25
. . . . . Artificial Neural Networks Shreekanth Mandayam Robi Polikar . . . . . …… …. . . . . net k

Artificial Neural Networks Shreekanth Mandayam Robi Polikar.......... …… …... … net k

Embed Size (px)

Citation preview

.

.

.

.

.

Artificial Neural NetworksShreekanth Mandayam

Robi Polikar

.

.

.

.

.

…… …

.

..

..

netk

.

.

.

.

.

.

.

.

.

.

Function Approximation

Constructing complicated functions from simple building blocks Lego systems Fourier / wavelet transforms VLSI systems RBF networks

.

.

.

.

.

Function Approximation

** *

*

** *

*

*

*

?

.

.

.

.

.

Function Approx. Vs. Classification

Classification can be thought of as a special case of function approximation: For a three class problem:

Classifier

….x

Class 1: [1 0 0]

Class 2: [0 1 0]

Class 3: [0 0 1]

Classifier

….x

1: Class 12: Class 23: Class 3

x1

xd

x1

xd

d-dimensional inputx

1 or 3, c-dimensional inputy

y=f(x)

.

.

.

.

.

Radial Basis FunctionNeural Networks

The RBF networks, just like MLP networks, can therefore be used classification and/or function approximation problems.

The RBFs, which have a similar architecture to that of MLPs, however, achieve this goal using a different strategy:

……

…..

Input layerNonlinear

transformation layer(generates local receptive fields)

Linear outputlayer

.

.

.

.

.

Nonlinear Receptive Fields

The hallmark of RBF networks is their use of nonlinear receptive fields

RBFs are universal approximators ! The receptive fields nonlinearly transforms (maps) the input

feature space, where the input patterns are not linearly separable, to the hidden unit space, where the mapped inputs may be linearly separable.

The hidden unit space often needs to be of a higher dimensionality Cover’s Theorem (1965) on the separability of patterns:

A complex pattern classification problem that is nonlinearly separable in a low dimensional space, is more likely to be linearly separable in a high dimensional space.

.

.

.

.

.

The (you guessed it right) XOR Problem

0 1

0

1Consider the nonlinear functions to map the input vector x to the 1- 2 space

Te ]1 1[ )( 1

2

11 tx tx

Te ]0 0[ )( 2

2

22 tx tx

x1

x2

x=[x1 x2]

Input x 1(x) 2(x)

(1,1) 1 0.1353

(0,1) 0.3678 0.3678

(1,0) 0.3678 0.3678

(0,0) 0.1353 1

0 0.2 0.4 0.6 0.8 1.0 1.2 | | | | | |

_

_

_

_

_

_

1.0

0.8

0.6

0.4

0.2

0

(0,0)

(1,1)

(0,1)(1,0)

The nonlinear function transformed a nonlinearly separable problem into a linearly separable one !!!

.

.

.

.

.

Initial Assessment

Using nonlinear functions, we can convert a nonlinearly separable problem into a linearly separable one.

From a function approximation perspective, this is equivalent to implementing a complex function (corresponding to the nonlinearly separable decision boundary) using simple functions (corresponding to the linearly separable decision boundary)

Implementing this procedure using a network architecture, yields the RBF networks, if the nonlinear mapping functions are radial basis functions.

Radial Basis Functions: Radial: Symmetric around its center Basis Functions: A set of functions whose linear

combination can generate an arbitrary function in a given function space.

.

.

.

.

.

RBF Networks…

…...

.

……

...

Uji

Wkj

xd

x(d-1)

x2

x1

d inputnodes H hidden layer RBFs

(receptive fields)

c outputnodes

zc

z1

..

..zk netk

yj

1

H

j

x1

xd

uJi

2

Jux

Jux

e

netJ

: spread constant

TXU

H

jjkj

H

jjkjk ywywfnetf

11

Linear act. function

.

.

.

.

.

Principle of Operation

x1

xd

UJi

2

Jux

Jux

e

netJ : spread constant

y1

wKj

H

jjkj

H

jjkjkk ywywfnetfz

11

yJ

yH

Unknowns: uji, wkj,

Euclidean Norm

.

.

.

.

.

Principle of Operation

What do these parameters represent? Physical meanings:

: The radial basis function for the hidden layer. This is a simple nonlinear mapping function (typically Gaussian) that transforms the d- dimensional input patterns to a (typically higher) H-dimensional space. The complex decision boundary will be constructed from linear combinations (weighted sums) of these simple building blocks.

• uji: The weights joining the first to hidden layer. These weights constitute the center points of the radial basis functions.

: The spread constant(s). These values determine the spread (extend) of each radial basis function.

• Wjk: The weights joining hidden and output layers. These are the weights which are used in obtaining the linear combination of the radial basis functions. They determine the relative amplitudes of the RBFs when they are combined to form the complex function.

.

.

.

.

.

RBFNPrinciple of Operation

J: Jth RBF function

*uJ Center of Jth RBF

J

wJ:Relative weight of Jth RBF

.

.

.

.

.

How to Train?

There are various approaches for training RBF networks. Approach 1: Exact RBF – Guarantees correct classification of

all training data instances. Requires N hidden layer nodes, one for each training instance. No iterative training is involved. RBF centers (u) are fixed as training data points, spread as variance of the data, and w are obtained by solving a set of linear equations

Approach 2: Fixed centers selected at random. Uses H<N hidden layer nodes. No iterative training is involved. Spread is based on Euclidean metrics, w are obtained by solving a set of linear equations.

Approach 3: Centers are obtained from unsupervised learning (clustering). Spreads are obtained as variances of clusters, w are obtained through LMS algorithm. Clustering (k-means) and LMS are iterative. This is the most commonly used procedure. Typically provides good results.

Approach 4: All unknowns are obtained from supervised learning.

.

.

.

.

.

Approach 1

Exact RBF The first layer weights u are set to the training data; U=XT.

That is the gaussians are centered at the training data instances.

The spread is chosen as , where dmax is the maximum Euclidean distance between any two centers, and N is the number of training data points. Note that H=N, for this case.

The output of the kth RBF output neuron is then

During training, we want the outputs to be equal to our desired targets. Without loss of any generality, assume that we are approximating a single dimensional function, and let the unknown true function be f(x). The desired output for each input is then di=f(xi), i=1, 2, …, N.

N

d

2max

j

N

jkjk wz ux

1

j

N

jjwz ux

1

Single output

(Not to be confused with input dimensionality d)

Multiple outputs

(wj)

.

.

.

.

.

Approach 1(Cont.)

We then have a set of linear equations, which can be represented in the matrix form:

j

N

jjwz ux

1

NNNNNN

N

N

d

d

d

w

w

w

2

1

2

1

21

22221

11211

Njijiij ,...,2,1),( , xx

Nji

www

ddd

ij

TN

TN

,...,2,1),(|

],,[

],,[

21

21

w

d

Define:dw

dw1

Is this matrix always invertible?

y

.

.

.

.

.

Approach 1(Cont.)

Michelli’s Theorem (1986) If {xi}i

N=1 are a distinct set of points in the d-dimensional

space, then the N by N interpolation matrix with elements obtained from radial basis functions is nonsingular, and hence can be inverted!

Note that the theorem is valid regardless the value of N, the choice of the RBF (as long as it is an RBF), or what the data points may be, as long as they are distinct!

A large number of RBFs can be used:

• Multiquadrics:

• Inverse multiquadrics:

• Gaussian functions:

jiij xx

)(2/122 crr

rc ,0 somefor

1)(

2/122 crr

22 2)( rer r,0 somefor

jr xx

.

.

.

.

.

Approach1(Cont.)

The Gaussian is the most commonly used RBF (why…?). Note that

Gaussian RBFs are localized functions ! unlike the sigmoids used by MLPs

0)( , as rr

Using Gaussian radial basis functions Using sigmoidal radial basis functions

.

.

.

.

.

Exact RBF Properties

Using localized functions typically makes RBF networks more suitable for function approximation problems. Why?

Since first layer weights are set to input patterns, second layer weights are obtained from solving linear equations, and spread is computed from the data, no iterative training is involved !!!

Guaranteed to correctly classify all training data points! However, since we are using as many receptive fields as the number of

data, the solution is over determined, if the underlying physical process does not have as many degrees of freedom Overfitting!

The importance of : Too small willalso cause overfitting. Too large willfail to characterize rapid changes in the signal.

.

.

.

.

.

Too many Receptive Fields?

In order to reduce the artificial complexity of the RBF, we need to use fewer number of receptive fields.

How about using a subset of training data, say M < N of them. These M data points will then constitute M receptive field centers. How to choose these M points…?

At random Approach 2.

Unsupervised training: K-means Approach 3

The centers are selected through self organization of clusters, where the data is more densely populated. Determining M is usually heuristic.

MjNieyji

d

M

jiijj ,...,2,1 ,...,2,1 ,

2

2max

2

xx

xx M

d

2max

Output layer weights are determined as they were in Approach 1, through solving a set of M linear equations!

.

.

.

.

.

K-Means - Unsupervised

Clustering - Algorithm

Choose number of clusters, M Initialize M cluster centers to the first M training data points: tk=xk,

k=1,2,…,M. Repeat

At iteration n, group all patterns to the cluster whose center is closest

Compute the centers of all clusters after the regrouping

Until there is no change in cluster centers from one iteration to the next.

MknnC kk

,...,2,1 ,)()(minarg)( txxtk(n): center of kth RBF at nth iteration

kM

jj

kk M 1

1xt

New cluster centerfor kth RBF.

Number of instancesin the kth cluster

Instances that are groupedin the kth cluster

An alternate k-means algorithm is given in Haykin (p. 301).

Approach 3

.

.

.

.

.

Determining the Output Weights:

LMS Algorithm

The LMS algorithm is used to minimize the cost function where e(n) is the error at iteration n:

Using the steepest (gradient) descent method:

)(2

1)( 2 neE w

)()()()( nnndne T wy

ww

w

)(

)()(

)( nene

n

E)(

)(n

ney

w

)()()(

)(nen

n

Ey

w

w

)()()()1( nennn yww

Instance based LMS algorithm pseudocode (for single output):

Initialize weights, wj to some small random value, j=1,2,…,M

Repeat

Choose next training pair (x, d);

Compute network output at iteration n:

Compute error:

Update weights:Until weights converge to a steady set of values

ywxx

Tj

M

jjwnz

1

)(

)()()( nzndne

)()()()1( nnenn yww

Approach 3

.

.

.

.

.

Supervised RBF Training

This is the most general form. All parameters, receptive field centers (first layer weights), output layer weights

and spread constants, are learned through iterative supervised training using LMS / gradient descent algorithm.

Approach 4:

M

iijkjj

N

jj

wde

e

1

1

2

2

1

tx

E

ijCijG txtx

G’ represents the first derivativeof the function wrt its argument

.

.

.

.

.

RBF Matlab Demo

.

.

.

.

.

RBF LabDue: Friday , March 15

1. Implement the Exact RBF approach in MATLAB (writing your own code) on a simple one-dimensional function approximation problem, as well as a classification problem. Generate your own function approx. example, and use the IRIS database for classification (from UCI – ML database). Compare your results to that of MATLAB’s built-in function

2. Implement Approach 2, using the code you generated for Q1.3. Implement Approach 3. Write your own K-means and LMS codes.

Compare your results to that of MATLAB’s newrb() function, both for function approximation and classification problems.

4. Apply your algorithms to the Dominant VOC gas sensing database (available in \\galaxy\public1\polikar\PR_Clinic\Databases for PR Class.