Lecture3---Introdunction to Radial Basis Function Networks

7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

1/139

Introduction toRadial Basis Function Networks

:


2/139

Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFNs for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations


3/139


Overview


4/139

Typical Applications of NN

Pattern Classification

Function Approximation

Time-Series Forecasting

( )l f x l C N

( )fy xmY R y

m

X R x

nX R x

1 2 3( ) ( , , , )t t tft

x x x x


5/139


mX R fn

Y R

:f X YUnknown

mX RnY R

:f X YApproximator

f


6/139

NeuralNetwork

Supervised Learning

UnknownFunction

ixiy

iy1,2,i

++

ie


7/139

Neural Networks as

Universal Approximators Feedforward neural networks with a single hidden layer of

sigmoidal units are capable of approximating uniformly any

continuous multivariate function, to any desired degree ofaccuracy. Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer

Feedforward Networks are Universal Approximators," Neural Networks,2(5), 359-366.

Like feedforward neural networks with a single hidden layer

of sigmoidal units, it can be shown that RBF networks areuniversal approximators. Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using

Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257.

Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-Function Networks," Neural Computation, 5(2), 305-316.


8/139

Statistics vs. Neural Networks

Statistics Neural Networksmodel networkestimation learningregression supervised learninginterpolation generalization

observations training setparameters (synaptic) weightsindependent variables inputsdependent variables outputsridge regression weight decay


9/139


The Model of

Function Approximator


10/139

Linear Models

1 (( ) )i

ii

m

f w x x

Fixed BasisFunctions

Weights


11/139

Linear Models

Feature VectorsInputs

Hidden

Units

OutputUnits

Decomposition

Feature Extraction Transformation

Linearly

weightedoutput

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =


12/139

Linear Models

Feature VectorsInputs

Hidden

Units

OutputUnits

Decomposition

Feature Extraction Transformation

Linearly

weightedoutput

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

1

(( ) )ii

i

m

f w

x x


13/139

Example Linear Models

Polynomial

Fourier Series

( ) ii

iwf xx

0exp 2( )k

k jf w k xx

( ) , 0,1,2,ii ix x

0( ) exp 2 0,1,, 2,k x j k x k

1

(( ) )ii

i

m

f w

x x


14/139

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Single-Layer Perceptrons as

Universal Aproximators

Hidden

Units

With sufficient number ofsigmoidal units, it can be auniversal approximator.

1

(( ) )ii

i

m

f w

x x


15/139

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Radial Basis Function Networks as


Hidden

Units

With sufficient number ofradial-basis-function units,

it can also be a universalapproximator.

1

(( ) )ii

i

m

f w

x x


16/139

Non-Linear Models

1 (( ) )i

ii

m

f w x x

Adjusted by theLearning process

Weights

1

(( ) )ii

i

m

f w

x x


17/139


The Radial Basis

Function Networks


18/139

Radial Basis Functions

1

(( ) )ii

i

m

f w

x x

Center

Distance Measure Shape

Three parameters

for a radial function:

xi

r= ||x xi||

i(x)=(||x xi||)


19/139

Typical Radial Functions

Gaussian

Hardy Multiquadratic

Inverse Multiquadratic

2

2

2 0 and

r

r e r

2 2 0 andr r c c c r

2 2 0 andr c r c c r


20/139

Gaussian Basis Function (=0.5,1.0,1.5)

2

22 0 and

r

r e r

0.5 1.0

1.5


21/139

Inverse Multiquadratic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

2 2 0 andr c r c c r

c=1

c=2

c=3c=4

c=5


22/139

Most General RBF

1( ) ( )i iT

i x x x

12 3

Basis {i: i =1,2,} is `near orthogonal.


23/139

Properties of RBFs On-Center, Off Surround

Analogies with localized receptive fieldsfound in several biological structures, e.g., visual cortex;

ganglion cells

A f i


24/139

The Topology of RBF

Feature Vectors

x1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Projection

Interpolation

As a functionapproximator ( )y f x


25/139

The Topology of RBF

Feature Vectors

x1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Subclasses

Classes

As a pattern classifier.


26/139


RBFNs for



27/139

The idea

x

y Unknown Functionto Approximate

TrainingData


28/139

The idea

x

y Unknown Functionto Approximate

TrainingData

Basis Functions (Kernels)


29/139

The idea

x

y


Function

Learned

1

)) ((m

i

iiwy f

xx


30/139

The idea

x

y


Function

Learned

NontrainingSample

1

)) ((m

i

iiwy f

xx


31/139

The idea

x

yFunction

Learned

NontrainingSample

1

)) ((m

i

iiwy f

xx


32/139

Radial Basis Function Networks as


1

( ) ( )m

i

i

iy wf

x x

x1 x2 xn

w1 w2 wm

x =

Training set ( )1

( ),p

k

k

ky

xT

Goal (( ))k kfy x for all k

2

( )

1

( )min kp

k

k

SSE f y

x

2

)(

1

) (

1

p mk

i

k

k

i

i

y w

x


33/139

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xnx =

Training set ( )1

( ),p

k

k

ky

xT


1

( ) ( )m

i

i

iy wf

x x

2

( )

1

( )min kp

k

k

SSE f y

x

2

)(

1

) (

1

p mk

i

k

k

i

i

y w

x


34/139

2

( )

1

( )min kp

k

k

SSE f y

x

2

)(

1

) (

1

p mk

i

k

k

i

i

y w

x

RegularizationTraining set ( )

1

( ),p

k

k

ky

xT


2

1

m

i

ii w

2

1

m

i

ii w

0i If regularization

is unneeded, set 0i

C


35/139


2

2( )

1

( )

1

p mk

k

i

i

i

k wfyC

xMinimize

1

( ) ( )m

i

i

iwf

x x

(

( )

) ( )

1

2 2kp

j

k

j

k

k

j j

wfC

fw w

y

xx

( ) ( ) ( )1

2 2p

k kj

k

jjk f wy

x x

0

( ) * ( ) ( )1 1

)* (p p

k k k

j j

k k

jj

kwf y

x x x

*

1

*( ) ( )i

m

i

i

wf

x x


36/139


(1) ( ), ,T

p

j j j x x

* * (1) * ( ), ,T

pf ff x xDefine

(1) ( ), , Tpy yy

* *

j

T T

j jj w f y 1, ,j m

( ) * ( ) ( )1 1

)* (p p

k k k

j j

k k

jj

kwf y

x x x

1

( ) ( )m

i

i

iwf

x x *1

*( ) ( )i

m

i

i

wf

x x

* *T T


37/139


*

1 1

*

2 2

*

1

*

*

1

2 2

*

T T

T T

T T

m mmm

w

w

w

f y

f y

f y

* *

j

T T

j jj w f y 1, ,j m

1 2, , , m

1

2

m

* * *

1 2* , , ,

T

mw w ww

Define

**T T

w f y


38/139


**T T

w f y

* (1)

* (2)*

* ( )p

f

f

f

x

xf

x

*

1

*( ) ( )i

m

i

i

wf

x x

(1)

1

(2)

1

(

*

*

1

*

)

k

k

m

k

k

m

k

k

mP

k

k

k

w

w

w

x

x

x

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

1 2

m

m

p p pm

m

w

w

w

x x x

x x x

x x x

* w

* *T T w w y


39/139


* *T T w w y

*T T w y

1* T T w y

Variance Matrix

1 T A y

1:A

Design Matrix:

( ) ( )p

k k


40/139

Summary

1* T T

w y

1 T A y

(1)

(2)

( )

j

j

j

p

j

x

x

x

(1)

( 2)

( )p

yy

y

y

1 2, , , m

*1

*

2

*

*

m

w

w

w

w

1

2

m

1

)) ((m

i

iiwy f

xx

Training set ( )1

( ),p

k

k

ky

xT


41/139


The Projection Matrix

1* T


42/139

The Empirical-Error Vector

1* Tw A y

*

1w*

2w*

mw

( )k

x

( )

1

k

x( )

2

k

x( )k

nx( )k

x

( )

1

k

x( )

2

k

x( )k

nx

Unknown

Function

y ** f w

1 2 m

1* T


43/139

The Empirical-Error Vector

1* Tw A y

*

1w*

2w*

mw

( )k

x

( )

1

k

x( )

2

k

x( )k

nx( )k

x

( )

1

k

x( )

2

k

x( )k

nx

Unknown

Function

y ** f w

1 2 m

Error Vector

* e y f * y w 1 T Ay y

1 Tp AI y Py

If =0 the RBFNs learning algorithm


44/139

Sum-Squared-Error

e Py 1 Tp

P I A

T A

1 2, , , m Error Vector

T

Py Py2T

y P y

2

( ) * ( )

1

pk k

k

SSE y f

x

If =0, the RBFN s learning algorithmis to minimizeSSE (MSE).

2TSSE P T P


45/139

The Projection Matrix

e Py 1 Tp

P I A

T A

1 2, , , m Error Vector

2TSSE y P y

1 2( , , , )mspane 0

( ) P Py Pe e Py

T y Py


46/139


Learning the Kernels


47/139

RBFNs as Universal Approximators

x1 x2 xn

y1 yl

1 2 m

w11 w12 w1mwl1 wl2 wlml

Training set

) ( )(

1,

pk

k

k

yxT

2

2( ) exp 2

j

j

j

x

x

Kernels


48/139

What to Learn? Weightswijs

Centers js of js Widthsjs of js

Number of js

Model Selection

x1 x2 xn

y1 yl

1 2 m


( ) ( )l

k kf


49/139

One-Stage Learning

( ) ( ) ( )1 k k kij i i jw y f x x

( ) ( )1

k k

i ij j

j

f w

x x

( ) ( ) ( )2 21

ljk k k

j j ij i i

ij

w y f

x x x

2

( ) ( ) ( )

3 31

ljk k k

j j ij i i

ij

w y f

x x x

( ) ( )l

k kf


50/139

One-Stage Learning

( ) ( )1

k k

i ij j

j

f w

x xThe simultaneous updates of all threesets of parameters may be suitable

for non-stationary environments or on-line setting.

( ) ( ) ( )1 k k kij i i jw y f x x

( ) ( ) ( )2 21

ljk k k

j j ij i i

ij

w y f

x x x

2

( ) ( ) ( )

3 31

ljk k k

j j ij i i

ij

w y f

x x x


51/139

x1 x2 xn

y1 yl

1 2 m


Two-Stage Training

Step 1

Step 2

Determines Centers js of js. Widthsjs of js. Number of js.

Determines wijs.

E.g., using batch-learning.


52/139

Train the Kernels


53/139

Unsupervised Training


54/139

Subset Selection Random Subset Selection Forward Selection Backward Elimination

Clustering Algorithms

KMEANS LVQ Mixture Models

GMM

Methods


55/139

Subset Selection


56/139

Random Subset Selection Randomly choosing a subset of points from

training set

Sensitive to the initially chosen points.

Using some adaptive techniques to tune Centers Widths #points


57/139



58/139



59/139



60/139

Clustering Algorithms1

23

4


61/139


Bias-VarianceDilemma


62/139

Goal Revisit

Minimize Prediction Error

Ultimate Goal Generalization

Goal of Our Learning Procedure

Minimize Empirical Error


63/139

Badness of Fit Underfitting

A model (e.g., network) that is not sufficiently

complex can fail to detect fully the signal in acomplicated data set, leading to underfitting.

Produces excessive bias in the outputs.

Overfitting A model (e.g., network) that is too complex may

fit the noise, not just the signal, leading tooverfitting.

Produces excessive variance in the outputs.


64/139

Underfitting/Overfitting Avoidance

Model selection Jittering Early stopping Weight decay

Regularization

Ridge Regression Bayesian learning Combining networks


65/139

Best Way to Avoid Overfitting

Use lots of training data, e.g., 30 times as many training cases as there are

weights in the network. for noise-free data, 5 times as many training

cases as weights may be sufficient.

Dontarbitrarily reduce the number ofweights for fear of underfitting.


66/139

Badness of FitUnderfit Overfit


67/139

Badness of FitUnderfit Overfit

However it's not really a dilemma


68/139

Bias-Variance Dilemma

Underfit Overfit

Large bias Small bias

Small variance Large variance

However, it s not really a dilemma.


69/139

Underfit Overfit

Large bias Small bias

Small variance Large variance

Bias-Variance DilemmaMore on overfitting

Easily lead to predictionsthat are far beyond therange of the training data.

Produce wild predictionsin multilayer perceptronseven with noise-free data.

It's not really a dilemma.


70/139


It s not really a dilemma.

fitmore

overfitmore

underfit

Bias

Variance

The mean of the bias=?


71/139


Sets of functionsE.g., depend on #

hidden nodes used.

The true modelnoise

bias

Solution obtainedwith training set 2.



biasbias

The variance of the bias=?

The mean of the bias=?


72/139



hidden nodes used.

The true modelnoise

The variance of the bias=?

bias

Variance

Reduce the effective number of parameters.


73/139

Model Selection


hidden nodes used.

The true modelnoise

bias

Variance

Reduce the number of hidden nodes.

Goal: 2min ( ) ( )E y f x x


74/139



hidden nodes used.

The true modelnoise

bias

( )g x( ) ( )y g x x

( )f x

Goal: min ( ) ( )E y f x x



75/139



2

( ) ( )E y f

x x 2

( ) ( )E g f

x x

2 2

( ) ( ) 2 ( ) ( )E g f g f

x x x x

2 2( ) ( ) 2 ( ) ( )E g f E g f E

x x x x

Goal: 2min ( ) ( )E g f x x 2

min ( ) ( )E y f

x x

0 constant

2 22

( ) ( ) ( ) ( )E y f E E g f x x x x


76/139


2( ) ( )E g f

x x

2[ ( )] [( ) () )( ]E f E fE g f

x xx x

2

[ ]( )) (E fE g

x x 2

[ ]( )) (E fE f

x x

[ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x

2 2

[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff

x x x xx x x x

( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx

( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x

( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0

( ) ( ) ( ) ( )y f g f

0



77/139


2 22( ) ( ) ( ) ( )E y f E E g f

x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx


noise bias2 variance

Cannot beminimized

Minimize bothbias2 and variance



78/139

Model Complexity vs. Bias-Variance

2 22

( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx



ModelComplexity(Capacity)



79/139


2 22( ) ( ) ( ) ( )E y f E E g f

x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx



ModelComplexity(Capacity)


80/139

Example (Polynomial Fits)22exp( 16 )y x x


81/139

Example (Polynomial Fits)


82/139

Example (Polynomial Fits)

Degree 1 Degree 5

Degree 10 Degree 15


83/139


The Effective Numberof Parameters


84/139

Variance Estimation

Mean

Variance

22

1

1

p

ii xp

In general, notavailable.


85/139

Variance Estimation

Mean1

1

p

i

i

x x

p

Variance

22 2

1

1

1

p

iis xp

Loss 1 degreeof freedom


86/139

Simple Linear Regression0 1y x

2

~ (0, )N 0

1

y

x

y

x

2

p

i iSSE y y Minimize


87/139

Simple Linear Regression

y

x

0

1

y

x

0 1y x 2

~ (0, )N

1

i i

i

SSE y y

2

p

i iSSE y y Minimize


88/139

Mean Squared Error (MSE)

y

x

0

1

y

x

0 1y x 2

~ (0, )N

1

i i

i

y y

2

2

SSE

MSE p Loss 2 degrees

of freedom


89/139

Variance Estimation

2SSE

MSE mp

Loss m degreesof freedom

m: #parameters of the model


90/139

The Number of Parameters

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

i

i

iy wf

x x #degrees of freedom:


91/139

The Effective Number of Parameters ()

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

i

i

iy wf

x x The projection Matrix

1 Tp

P I AT A

0

( )p trace P m

( ) ( ) ( )trace trace trace A B A BFacts:

( ) ( )AB BA


92/139

The Effective Number of Parameters ()The projection Matrix

1 Tp

P I AT A

0

1( ) Tptrace trace AP I

1 Tp trace A

( ) ( )trace traceAB BA

1

TTp trace

Pf)

1T Tp trace

mp trace I

p m ( )p trace P m

( )p trace PThe effective number of parameters:


93/139

Regularization

((1

)

2

)

p

k

k

k fy

x2

1

m

i

ii w

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

k

i

k

k

i

iwy

x

Penalize models withlarge weightsSSE

( )p trace P

The effective number of parameters:

( )p trace P


94/139

Regularization

((1

)

2

)

p

k

k

k fy

x2

1

m

i

ii w

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

k

i

k

k

i

iwy

x


Without penalty (i=0),there are m degrees

of freedom tominimize SSE (Cost).

The effective numberof parameters =m.

( )p trace P


( )p trace P


95/139

Regularization

((1

)

2

)

p

k

k

k fy

x2

1

m

i

ii w

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

k

i

k

k

i

iwy

x


With penalty (i>0),the liberty to minimizeSSE will be reduced.

The effective numberof parameters


96/139

Variance Estimation

2 MSE

Loss degreesof freedom

( )p trace P

SSEp


( )p trace P


97/139

Variance Estimation

2 MSE

( )p trace P

( )SSE

trace P


98/139


Model Selection


99/139

Model Selection

Goal Choose the fittest model

Criteria Least prediction error

Main Tools (Estimate Model Fitness) Cross validation

Projection matrix Methods

Weight decay (Ridge regression) Pruning and Growing RBFNs


100/139

Empirical Error vs. Model Fitness


Ultimate Goal Generalization

Goal of Our Learning Procedure

Minimize Empirical Error


(MSE)


101/139

Estimating Prediction Error

When you have plenty of data useindependent test sets E.g., use the same training set to traindifferent models, and choose the best

model by comparing on the test set.

When data is scarce, use Cross-Validation Bootstrap


102/139

Cross Validation

Simplest and most widely used method forestimating prediction error.

Partition the original set into severaldifferent ways and to compute an averagescore over the different partitions, e.g., K-fold Cross-Validation Leave-One-Out Cross-Validation Generalize Cross-Validation


103/139

K-Fold CV

Split the set, say,D of availableinput-output patterns into kmutuallyexclusive subsets, sayD1,D2, ,Dk.

Train and test the learning algorithmktimes, each time it is trained onD\Di and tested onDi.


104/139

K-Fold CV

Available Data

Test Set


105/139

K-Fold CV

Available DataD1 D2 D3 . . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3 . . . Dk

D1 D2 D3. . . Dk

Training Set

Estimate 2

A special case of k-fold CV.


106/139

Leave-One-Out CV

Split thep available input-outputpatterns into a training set of sizep1 and a test set of size 1.

Average the squared error on theleft-out pattern over thep possibleways of partition.



107/139

Error Variance Predicted by LOO

22

1

1 ( )

p

LOO i i i

i

y fp

x

( , ) : 1,2, ,k kD y k p x

\ ( , )

i i i

D D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned usingDi as training set.

The estimate for the variance of prediction error using LOO:

Error-square forthe left-out element.



108/139


22

1

1 ( )

p

LOO i i i

i

y fp

x

( , ) : 1,2, ,k kD y k p x

\ ( , )

i i i

D D y x






Given a model, the function with least

empirical error forDi.

As an index of models fitness.We want to find a model also minimize this.



109/139


22

1

1 ( )

p

LOO i i i

i

y fp

x

( , ) : 1,2, ,k kD y k p x

\ ( , )

i i i

D D y x







110/139


22

1

1 ( )

p

LOO i i i

i

y fp

xError-square for

the left-out element.

2 2

(

1

( ))

T

LOO dip ag

y PP Py

2 2(1 ( ))

T

LOO dip

ag y PP Py


111/139

Generalized Cross-Validation

p

( )( ) p

tracediag

p

PP I

22

2

( )

T

GCV

p

trace

y P y

P

2

2

( )

Tp

p

y P y


112/139

More Criteria Based on CV

22

2

( )

T

GCV

p

p

y P y GCV(Generalized CV)

22

T

UEVp

y P y UEV(Unbiased estimate of variance)

2

2

T

FPE

p

p p

y P y FPE

(Final Prediction Error)

AkaikesInformation

Criterion

BIC(Bayesian Information Criterio)

22 ln( ) 1

T

BIC

p p

p p

y P y

2 2 2 2

UEV FPE GCV BIC


113/139


22

2

( )

T

GCV

p

p

y P y

22

T

UEVp

y P y

2

2

T

FPE

p

p p

y P y

22 ln( ) 1

T

BIC

p p

p p

y P y

2

UEV

p

p

2UEVp

p

2ln( ) 1

UEV

p p

p

2 2 2 2

UEV FPE GCV BIC


114/139


22

2

( )

T

GCV

p

p

y P y

22

T

UEVp

y P y

2

2

T

FPE

p

p p

y P y

22 ln( ) 1

T

BIC

p p

p p

y P y

2

UEV

p

p

2UEV

p

p

2ln( ) 1

UEV

p p

p

?P

Standard Ridge Regression,


115/139

Regularization

((1

)

2

)

p

k

k

k fy

x 2

1

m

i

ii w

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

k

i

k

k

i

iwy

x


,i i

Standard Ridge Regression,


116/139

Regularization

((1

)

2

)

p

k

k

k fy

x 2

1

m

i

ii w

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

k

i

k

k

i

iwy

x


2

1

m

i

iw

,i i

2

(( ) )

1 1

2

1

minp m m

k

i

k

k

i i

i i

C w wy

x


117/139

Solution Review

1* Tw A yT

m IA

1 Tp P I A

Used to compute model selection criteria

E l

22( ) (1 2 ) xy x x x e 2~ (0 0 2 )N


118/139

Example 50p ~ (0,0.2 )N

Width of RBF r= 0.5

E l

22( ) (1 2 ) xy x x x e 2~ (0 0 2 )N


119/139

Example2 2 2 2

UEV FPE GCV BIC

50p

~ (0,0.2 )N

Width of RBF r= 0.5

E l

22( ) (1 2 ) xy x x x e 2~ (0 0 2 )N


120/139

Example2 2 2 2

UEV FPE GCV BIC

50p

~ (0,0.2 )N

Width of RBF r= 0.5

Optimizing the


121/139

Optimizing theRegularization Parameter

Re-Estimation Formula

2 1 21

( )

T

T

trace

trace

A

w

Ay P y

A Pw


122/139

Local Ridge Regression

Re-Estimation Formula

2 1 21

( )

T

T

trace

trace

A

w

Ay P y

A Pw

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


123/139

Example Width of RBF( )p trace P

0.05r

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


124/139

Example Width of RBF( )p trace P

0.05r

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


125/139

Example Width of RBF0.05r

There are two local-minima.

2 1 2

1

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Using the about re-estimationformula, it will be stuck at thenearest local minimum.

That is, the solution depends onthe initial setting.

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


126/139



2 1 2

1

( )

T

T

trace

trace

A

w

Ay P y

A Pw

(0) 0.01

lim 0.1( )t

t

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


127/139



2 1 2

1

( )

T

T

trace

trace

A

w

Ay P y

A Pw

5(0) 10

4( )lim 2.1 10t

t

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


128/139

Example Width of RBF0.05rRMSE: Root Mean Squared ErrorIn real case, it is not available.

E l

sin(12 )y x 2~ (0,0.1 )N

50m p


129/139

Example Width of RBF0.05rRMSE: Root Mean Squared ErrorIn real case, it is not available.


130/139


2

(( ) )

1 1

2

1

min

p m m

ki

k

ki i

i i

C w wy

x

Standard Ridge Regression

2

(( ) )

1 1

2

1

minp m m

i i

k

i

k i

i

k

i

C w wy

x



131/139


2

(( ) )

1 1

2

1

min

p m m

ki

k

ki i

i i

C w wy

x


2

(( ) )

1 1

2

1

minp m m

i i

k

i

k i

i

k

i

C w wy

x


j implies that j()

can be removed.

Th S l ti


132/139

The Solutions

1* Tw A y

T

m IA

1 T

p

P I AUsed to compute model selection criteria

T A

T

A Linear Regression



Optimizing the

22

2

( )

T

GCV

p

trace y P y

P


133/139

p gRegularization Parameters

Incremental Operation

P:The current projection Matrix.

Pj:The projection Matrix obtained by removing j().

Tj j j j

j T

j j j j

P PP P

P

( )

Optimizing the


134/139


22

2

( )

T

GCV

p

trace

y P y

P

Solve2

0GCV

j

Subject to 0j

Tj j j j

j T

j j j j

P PP P

P

Optimizing the


135/139


2T

ja y P y2T T

j j j jb y P y P 2 2

( )T T

j j j j jc P y P ( )

jtrace P

2T

j j j P

ifif and

0 if and

if

c bT

j j j b a

c bTj

j j j b a

c b c bT T

j j j j j jb a b a

a ba b

a b

P

P

P P

Solve2

0GCV

j

Subject to 0j

Optimizing the


136/139


2T

ja y P y2T T

j j j jb y P y P 2 2

( )T T

j j j j jc P y P ( )

jtrace P

2T

j j j P

ifif and

0 if and

if

c bT

j j j b a

c bTj

j j j b a

c b c bT T

j j j j j jb a b a

a ba b

a b

P

P

P P

Solve2

0GCV

j

Subject to 0j

Remove j()

Th Al ith


137/139

The Algorithm

Initialize is.

e.g., performing standard ridge regression.

Repeat the following until GCV converges:

Randomly selectj and compute

Perform local ridge regression If GCV reduce & remove j()

j

j

R f


138/139

References

Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy

estimation and model selection," International Joint Conference on Artificial

Intelligence (IJCAI).

Mark J. L. Orr (April 1996),Introduction to Radial Basis Function

Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html .
http://www.anc.ed.ac.uk/~mjo/intro/intro.htmlhttp://www.anc.ed.ac.uk/~mjo/intro/intro.html


139/139


IncrementalOperations

Documents

Lecture3---Introdunction to Radial Basis Function Networks