Upload
lelien
View
217
Download
1
Embed Size (px)
Citation preview
1
Introduction to Radial Basis Function Networks
Contents
Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection
Introduction to Radial Basis Function Networks
Overview
RBF
Linear models have been studied in statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model.
However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians
Typical Applications of NN
Pattern Classification
Function Approximation
Time-Series Forecasting
( )l f xl C N
( )fy x mY R y
mX R x
nX R x
1 2 3( ) ( , , , )t t tft x x x x
Function Approximation
mX R f nY R
:f X YUnknown
mX RnY R
ˆ :f X YApproximator
ˆ f
2
Neural Network
Supervised Learning
Unknown Function
ixiy
ˆiy1,2,i
+ +
ie
Neural Networks as Universal Approximators
Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy.
Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators.
Statistics vs. Neural Networks
Statistics Neural Networks Model Network Estimation Learning Regression Supervised learning Interpolation Generalization Observations Training set Parameters (synaptic) Weights Independent variables Inputs Dependent variables Outputs Ridge regression Weight decay
Introduction to Radial Basis Function Networks
The Model of
Function Approximator
Linear Models
1
(( ) )ii
i
m
f w
x x
Fixed Basis Functions
Weights
Linear Models
Feature Vectors Inputs
Hidden Units
Output Units
• Decomposition • Feature Extraction • Transformation
Linearly weighted output
1
(( ) )ii
i
m
f w
x x
y
1 2 m
x1 x2 xn
w1 w2 wm
x =
3
Linear Models
Feature Vectors Inputs
Hidden Units
Output Units
• Decomposition • Feature Extraction • Transformation
Linearly weighted output
y
1 2 m
x1 x2 xn
w1 w2 wm
x =
1
(( ) )ii
i
m
f w
x x
Example Linear Models
Polynomial
Fourier Series
( ) ii
iwf xx
0exp 2( )k
k jf w k xx
( ) , 0,1,2,ii ix x
0( ) exp 2 0,1,, 2, k x j k x k
1
(( ) )ii
i
m
f w
x x
y
1 2 m
x1 x2 xn
w1 w2 wm
x =
Single-Layer Perceptrons as Universal Aproximators
Hidden Units
With sufficient number of sigmoidal units, it can be a universal approximator.
1
(( ) )ii
i
m
f w
x x
y
1 2 m
x1 x2 xn
w1 w2 wm
x =
Radial Basis Function Networks as Universal Aproximators
Hidden Units
With sufficient number of radial-basis-function units, it can also be a universal approximator.
1
(( ) )ii
i
m
f w
x x
Non-Linear Models
1
(( ) )ii
i
m
f w
x x
Adjusted by the Learning process
Weights
1
(( ) )ii
i
m
f w
x x
Introduction to Radial Basis Function Networks
The Radial Basis
Function Networks
4
Radial Basis Functions
1
(( ) )ii
i
m
f w
x x
Center
Distance Measure
Shape
Three parameters for a radial function:
xi
r = ||x xi||
i(x)= (||x xi||)
Typical Radial Functions
Gaussian
Hardy-Multiquadratic (1971)
Inverse Multiquadratic
2
22 0 and r
r e r
2 2 0 and r r c c c r
2 2 0 and r c r c c r
Gaussian Basis Function (=0.5,1.0,1.5)
2
22 0 and r
r e r
0.5 1.0
1.5
Inverse Multiquadratic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-10 -5 0 5 10
2 2 0 and r c r c c r
c=1
c=2 c=3
c=4 c=5
Most General RBF
1( ) ( )i iT
i μ x μx x
1μ2μ 3μ
Basis {i: i =1,2,…} is `near’ orthogonal.
Properties of RBF’s
On-Center, Off Surround
Analogies with localized receptive fields found in several biological structures, e.g., – visual cortex;
– ganglion cells
5
The Topology of RBF
Feature Vectors x1 x2 xn
y1 ym
Inputs
Hidden Units
Output Units
Projection
Interpolation
As a function approximator
( )y f x
The Topology of RBF
Feature Vectors x1 x2 xn
y1 ym
Inputs
Hidden Units
Output Units
Subclasses
Classes
As a pattern classifier.
Introduction to Radial Basis Function Networks
RBFN’s for
Function Approximation
Radial Basis Function Networks
Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm.
The activation function is selected from a class of functions called basis functions.
They usually train much faster than BP.
They are less susceptible to problems with non-stationary inputs
Radial Basis Function Networks
Popularized by Broomhead and Lowe (1988),
and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture.
The major difference between RBF and BP is the behavior of the single hidden layer.
Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.
The idea
x
y Unknown Function to Approximate
Training Data
6
The idea
x
y Unknown Function to Approximate
Training Data
Basis Functions (Kernels)
The idea
x
y
Basis Functions (Kernels)
Function Learned
1
)) ((m
iiiwy f
xx
The idea
x
y
Basis Functions (Kernels)
Function Learned
Nontraining Sample
1
)) ((m
iiiwy f
xx
The idea
x
y Function Learned
Nontraining Sample
1
)) ((m
iiiwy f
xx
Radial Basis Function Networks as Universal Aproximators
1
( ) ( )m
ii
iy wf
x x
x1 x2 xn
w1 w2 wm
x =
Training set ( )
1
( ),p
k
k
ky
xT
Goal (( ))k kfy x for all k
2
( )
1
( )min kp
k
k
SSE fy
x
2
)(
1
) (
1
p mk
ik
ki
i
y w
x
Learn the Optimal Weight Vector
w1 w2 wm
x1 x2 xn x =
Training set ( )
1
( ),p
k
k
ky
xT
Goal (( ))k kfy x for all k
1
( ) ( )m
ii
iy wf
x x
2
( )
1
( )min kp
k
k
SSE fy
x
2
)(
1
) (
1
p mk
ik
ki
i
y w
x
7
2
( )
1
( )min kp
k
k
SSE fy
x
2
)(
1
) (
1
p mk
ik
ki
i
y w
x
Regularization
Training set ( )
1
( ),p
k
k
ky
xT
Goal (( ))k kfy x for all k
2
1
m
iiiw
2
1
m
iiiw
0i
If regularization is unneeded, set 0i
C
Learn the Optimal Weight Vector
2
2( )
1
( )
1
p mk
ki
ii
k wfyC
xMinimize
1
( ) ( )m
ii
iwf
x x
(
( )
) ( )
1
2 2kp
jk
j
k
k
j j
wfC
fw w
y
x
x
( ) ( ) ( )
1
2 2p
k kj
kjj
k f wy
x x
0
( ) * ( ) ( )
1 1
)* (p p
k k kj j
k kjj
kwf y
x x x
*
1
*( ) ( )i
m
ii
wf
x x
Learn the Optimal Weight Vector
(1) ( ), ,T
pj j j φ x x
* * (1) * ( ), ,T
pf ff x xDefine
(1) ( ), ,Tpy yy
* *j
T Tj jjw φ f φ y 1, ,j m
( ) * ( ) ( )
1 1
)* (p p
k k kj j
k kjj
kwf y
x x x
1
( ) ( )m
ii
iwf
x x *
1
*( ) ( )i
m
ii
wf
x x
Learn the Optimal Weight Vector
*1 1
*2 2
*1
*
*
1
2 2
*
T T
T T
T Tm mmm
w
w
w
φ f φ y
φ f φ y
φ f φ y
* *j
T Tj jjw φ f φ y 1, ,j m
1 2, , , mΦ φ φ φ
1
2
m
Λ
* * *1 2* , , ,
T
mw w ww
Define
**T T wΦ Λf Φ y
Learn the Optimal Weight Vector
**T T wΦ Λf Φ y
* (1)
* (2)
*
* ( )p
f
f
f
x
xf
x
*
1
*( ) ( )i
m
ii
wf
x x
(1)
1
(2)
1
(
*
*
1
*
)
k
k
m
kk
m
kk
mP
kk
k
w
w
w
x
x
x
(1) (1) (1)*1 21
(2) (2) (2) *1 2 2
*( ) ( ) ( )
1 2
m
m
p p pm
m
w
w
w
x x x
x x x
x x x
*Φw
* *T T ΛwΦ wΦ Φ y
Learn the Optimal Weight Vector
* *T T ΛwΦ wΦ Φ y
*T T wΦ ΛΦ Φ y
1* T T
Λw ΦΦ Φ y
Variance Matrix
1 TA Φ y
1 :A
Design Matrix :Φ
8
Introduction to Radial Basis Function Networks
The Projection Matrix
The Empirical-Error Vector
1* Tw A Φ y
*1w *
2w*mw
( )kx ( )1
kx ( )2kx ( )k
nx( )kx ( )1
kx ( )2kx ( )k
nx
Unknown Function
y ** f Φw
1φ 2φ mφ
The Empirical-Error Vector
1* Tw A Φ y
*1w *
2w*mw
( )kx ( )1
kx ( )2kx ( )k
nx( )kx ( )1
kx ( )2kx ( )k
nx
Unknown Function
y ** f Φw
1φ 2φ mφ
Error Vector
* e y f * y Φw 1 T ΦAy Φ y
1 Tp
AI Φ Φ y Py
Sum-Squared-Error
e Py1 T
p P I Φ ΦA
T A Φ Φ Λ
1 2, , , mΦ φ φ φError Vector
T
Py Py2Ty P y
2
( ) * ( )
1
pk k
k
SSE y f
x
If =0, the RBFN’s learning algorithm is to minimize SSE (MSE).
The Projection Matrix
e Py1 T
p P I Φ ΦA
T A Φ Φ Λ
1 2, , , mΦ φ φ φError Vector
2TSSE y P y
1 2( , , , )mspane φ φ φΛ 0
( ) P Py Pee Py
Ty Py
Introduction to Radial Basis Function Networks
Learning the Kernels
9
RBFN’s as Universal Approximators
x1 x2 xn
y1 yl
1 2 m
w11 w12 w1m wl1 wl2 wlml
Training set
) ( )(
1,
pk
k
k
yxT
2
2( ) exp
2
j
j
j
x μx
Kernels
What to Learn?
Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s
Model Selection
x1 x2 xn
y1 yl
1 2 m
w11 w12 w1m wl1 wl2 wlml
One-Stage Learning
( ) ( ) ( )1
k k kij i i jw y f x x
( ) ( )
1
lk k
i ij jj
f w
x x
( ) ( ) ( )2 2
1
ljk k k
j j ij i iij
w y f
x μμ x x
2
( ) ( ) ( )3 3
1
ljk k k
j j ij i iij
w y f
x μx x
One-Stage Learning
( ) ( )
1
lk k
i ij jj
f w
x xThe simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.
( ) ( ) ( )1
k k kij i i jw y f x x
( ) ( ) ( )2 2
1
ljk k k
j j ij i iij
w y f
x μμ x x
2
( ) ( ) ( )3 3
1
ljk k k
j j ij i iij
w y f
x μx x
x1 x2 xn
y1 yl
1 2 m
w11 w12 w1m wl1 wl2 wlml
Two-Stage Training
Step 1
Step 2
Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.
Determines wij’s.
E.g., using batch-learning.
Train the Kernels
10
Unsupervised Training
Subset Selection – Random Subset Selection – Forward Selection – Backward Elimination
Clustering Algorithms – KMEANS – LVQ
Mixture Models – GMM
Methods
Subset Selection Random Subset Selection
Randomly choosing a subset of points from training set
Sensitive to the initially chosen points.
Using some adaptive techniques to tune – Centers – Widths – #points
Clustering Algorithms Clustering Algorithms
11
Clustering Algorithms Clustering Algorithms
1
2
3
4
Introduction to Radial Basis Function Networks
Bias-Variance Dilemma
Goal Revisit
Minimize Prediction Error
• Ultimate Goal Generalization
• Goal of Our Learning Procedure
Minimize Empirical Error
Badness of Fit
Underfitting – A model (e.g., network) that is not sufficiently
complex can fail to detect fully the signal in a complicated data set, leading to underfitting.
– Produces excessive bias in the outputs.
Overfitting – A model (e.g., network) that is too complex may
fit the noise, not just the signal, leading to overfitting.
– Produces excessive variance in the outputs.
Underfitting/Overfitting Avoidance
Model selection Jittering Early stopping Weight decay
– Regularization – Ridge Regression
Bayesian learning Combining networks
12
Best Way to Avoid Overfitting
Use lots of training data, e.g., – 30 times as many training cases as there are
weights in the network.
– for noise-free data, 5 times as many training cases as weights may be sufficient.
Don’t arbitrarily reduce the number of weights for fear of underfitting.
Badness of Fit
Underfit Overfit
Badness of Fit
Underfit Overfit
Bias-Variance Dilemma
Underfit Overfit
Large bias Small bias
Small variance Large variance
However, it's not really a dilemma.
Underfit Overfit
Large bias Small bias
Small variance Large variance
Bias-Variance Dilemma
More on overfitting
Easily lead to predictions that are far beyond the range of the training data.
Produce wild predictions in multilayer perceptrons even with noise-free data.
Bias-Variance Dilemma
It's not really a dilemma.
fit more
overfit more
underfit
Bias
Variance
13
Bias-Variance Dilemma
Sets of functions E.g., depend on # hidden nodes used.
The true model noise
bias
Solution obtained with training set 2.
Solution obtained with training set 1.
Solution obtained with training set 3.
bias bias
The mean of the bias=?
The variance of the bias=?
Bias-Variance Dilemma
Sets of functions E.g., depend on # hidden nodes used.
The true model noise
The mean of the bias=?
The variance of the bias=?
bias
Variance
Model Selection
Sets of functions E.g., depend on # hidden nodes used.
The true model noise
bias
Variance
Reduce the effective number of parameters.
Reduce the number of hidden nodes.
Bias-Variance Dilemma
Sets of functions E.g., depend on # hidden nodes used.
The true model noise
bias
( )g x( ) ( )y g x x
( )f x
Goal: 2
min ( ) ( )E y f
x x
Bias-Variance Dilemma
Goal: 2
min ( ) ( )E y f
x x
2
( ) ( )E y f
x x 2
( ) ( )E g f
x x
2 2( ) ( ) 2 ( ) ( )E g f g f
x x x x
2 2( ) ( ) 2 ( ) ( )E g f E g f E
x x x x
Goal: 2
min ( ) ( )E g f
x x 2
min ( ) ( )E y f
x x
0
constant
Bias-Variance Dilemma
2
( ) ( )E g f
x x 2
[ ( )] [( ) () )( ]E f E fE g f
x xx x
2
[ ]( )) (E fE g
x x 2
[ ]( )) (E fE f
x x
[ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x
2 2
[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff
x x x xx x x x
( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx
( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x
( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0
2 22( ) ( ) ( ) ( )E y f E E g f
x x x x
0
14
Bias-Variance Dilemma
2 22( ) ( ) ( ) ( )E y f E E g f
x x x x
2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f
x x xx
Goal: 2
min ( ) ( )E y f
x x
noise
bias2
variance
Cannot be minimized
Minimize both bias2 and variance
Model Complexity vs. Bias-Variance
2 22( ) ( ) ( ) ( )E y f E E g f
x x x x
2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f
x x xx
Goal: 2
min ( ) ( )E y f
x x
noise
bias2
variance
Model Complexity (Capacity)
Bias-Variance Dilemma
2 22( ) ( ) ( ) ( )E y f E E g f
x x x x
2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f
x x xx
Goal: 2
min ( ) ( )E y f
x x
noise
bias2
variance
Model Complexity (Capacity)
Example (Polynomial Fits)
22exp( 16 )y x x
Example (Polynomial Fits) Example (Polynomial Fits)
Degree 1 Degree 5
Degree 10 Degree 15
15
Introduction to Radial Basis Function Networks
The Effective Number of Parameters
Variance Estimation
Mean
Variance 22
1
1ˆ
p
ii
xp
In general, not available.
Variance Estimation
Mean 1
1ˆ
p
ii
x xp
Variance 22 2
1
ˆ1
1ˆ
p
ii
s xp
Loss 1 degree of freedom
Simple Linear Regression
0 1y x 2~ (0, )N
01
y
x
y
x
Simple Linear Regression
y
x
01ˆ
ˆ
y
x
0 1y x 2~ (0, )N
2
1
ˆp
i ii
SSE y y
Minimize
Mean Squared Error (MSE)
y
x
01ˆ
ˆ
y
x
0 1y x 2~ (0, )N
2
1
ˆp
i ii
SSE y y
Minimize
2
2ˆ
SSEMSE
p
Loss 2 degrees of freedom
16
Variance Estimation
2ˆSSE
MSEmp
Loss m degrees of freedom
m: #parameters of the model
The Number of Parameters
w1 w2 wm
x1 x2 xn x =
1
( ) ( )m
ii
iy wf
x x
m #degrees of freedom:
The Effective Number of Parameters ()
w1 w2 wm
x1 x2 xn x =
1
( ) ( )m
ii
iy wf
x x The projection Matrix
1 Tp
P I Φ ΦA
T A Φ Φ Λ
Λ 0
( )p trace P m
The Effective Number of Parameters ()
The projection Matrix
1 Tp
P I Φ ΦA
T A Φ Φ Λ
Λ 0
1( ) Tptrace trace AP I Φ Φ
1 Tp trace ΦA Φ
( ) ( ) ( )trace trace trace A B A BFacts:
( ) ( )trace traceAB BA
1 TTp trace
Φ Φ ΦΦ
Pf)
1T Tp trace
Φ ΦΦ Φ
mp trace I
p m ( )p trace P m
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models with large weights SSE
( )p trace PThe effective number of parameters:
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models with large weights SSE
Without penalty (i=0), there are m degrees
of freedom to minimize SSE (Cost).
The effective number of parameters =m.
The effective number of parameters:
( )p trace P
17
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models with large weights SSE
With penalty (i>0), the liberty to minimize SSE will be reduced.
The effective number of parameters <m.
The effective number of parameters:
( )p trace P
Variance Estimation
2ˆ MSE
Loss degrees of freedom
The effective number of parameters:
( )p trace P
SSE
p
Variance Estimation
2ˆ MSE
The effective number of parameters:
( )p trace P
( )
SSE
trace
P
Introduction to Radial Basis Function Networks
Model Selection
Model Selection
Goal – Choose the fittest model
Criteria – Least prediction error
Main Tools (Estimate Model Fitness) – Cross validation – Projection matrix
Methods – Weight decay (Ridge regression) – Pruning and Growing RBFN’s
Empirical Error vs. Model Fitness
Minimize Prediction Error
• Ultimate Goal Generalization
• Goal of Our Learning Procedure
Minimize Empirical Error
Minimize Prediction Error
(MSE)
18
Estimating Prediction Error
When you have plenty of data use independent test sets – E.g., use the same training set to train
different models, and choose the best model by comparing on the test set.
When data is scarce, use
– Cross-Validation – Bootstrap
Cross Validation
Simplest and most widely used method for estimating prediction error.
Partition the original set into several different ways and to compute an average score over the different partitions, e.g., – K-fold Cross-Validation – Leave-One-Out Cross-Validation – Generalize Cross-Validation
K-Fold CV
Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk.
Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.
K-Fold CV
Available Data
K-Fold CV
Available Data D1 D2 D3 . . . Dk
D1 D2 D3 . . . Dk
D1 D2 D3 . . . Dk
D1 D2 D3 . . . Dk
D1 D2 D3 . . . Dk
Test Set
Training Set
Estimate 2
Leave-One-Out CV
Split the p available input-output patterns into a training set of size p1 and a test set of size 1.
Average the squared error on the left-out pattern over the p possible ways of partition.
A special case of k-fold CV.
19
Error Variance Predicted by LOO
A special case of k-fold CV.
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
( , ): 1,2, ,k kD y k p x
\ ( , )i i iD D y x
Available input-output patterns.
1, ,i p Training sets of LOO.
if Function learned using Di as training set.
The estimate for the variance of prediction error using LOO:
Error-square for the left-out element.
Error Variance Predicted by LOO
A special case of k-fold CV.
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
( , ): 1,2, ,k kD y k p x
\ ( , )i i iD D y x
Available input-output patterns.
1, ,i p Training sets of LOO.
if Function learned using Di as training set.
The estimate for the variance of prediction error using LOO:
Error-square for the left-out element.
Given a model, the function with least empirical error for Di.
As an index of model’s fitness. We want to find a model also minimize this.
Error Variance Predicted by LOO
A special case of k-fold CV.
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
( , ): 1,2, ,k kD y k p x
\ ( , )i i iD D y x
Available input-output patterns.
1, ,i p Training sets of LOO.
if Function learned using Di as training set.
The estimate for the variance of prediction error using LOO:
Error-square for the left-out element.
Error Variance Predicted by LOO
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
Error-square for the left-out element.
2 2(1
ˆ ˆ ( )) ˆTLOO di
pag y PP Py
Generalized Cross-Validation
2 2(1
ˆ ˆ ( )) ˆTLOO di
pag y PP Py
( )( ) p
tracediag
p
PP I
22
2
ˆ ˆˆ
( )
T
GCV
p
trace
y P y
P
2
2
ˆ ˆ
( )
Tp
p
y P y
More Criteria Based on CV
22
2
ˆ ˆˆ
( )
T
GCV
p
p
y P y GCV (Generalized CV)
22 ˆ ˆ
ˆT
UEVp
y P y UEV (Unbiased estimate of variance)
22 ˆ ˆ
ˆT
FPE
p
p p
y P y FPE (Final Prediction Error)
Akaike’s Information
Criterion
BIC (Bayesian Information Criterio)
22 ln( ) 1 ˆ ˆ
ˆT
BIC
p p
p p
y P y
20
More Criteria Based on CV
22
2
ˆ ˆˆ
( )
T
GCV
p
p
y P y
22 ˆ ˆ
ˆT
UEVp
y P y
22 ˆ ˆ
ˆT
FPE
p
p p
y P y
22 ln( ) 1 ˆ ˆ
ˆT
BIC
p p
p p
y P y
2ˆUEV
p
p
2ˆUEV
p
p
2ln( ) 1ˆ
UEV
p p
p
2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
More Criteria Based on CV
22
2
ˆ ˆˆ
( )
T
GCV
p
p
y P y
22 ˆ ˆ
ˆT
UEVp
y P y
22 ˆ ˆ
ˆT
FPE
p
p p
y P y
22 ln( ) 1 ˆ ˆ
ˆT
BIC
p p
p p
y P y
2ˆUEV
p
p
2ˆUEV
p
p
2ln( ) 1ˆ
UEV
p p
p
2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
?P
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models with large weights SSE
Standard Ridge Regression,
,i i Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models with large weights SSE
2
1
m
iiw
Standard Ridge Regression,
,i i
Solution Review
2
(( ) )
1 1
2
1
min p m m
ki
k
ki i
i i
C w wy
x
1* Tw A Φ yT
m Φ Φ IA1 T
p P I Φ ΦA
Used to compute model selection criteria
Example
22( ) (1 2 ) xy x x x e
50p
2~ (0,0.2 )N
Width of RBF r = 0.5
21
Example 2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
22( ) (1 2 ) xy x x x e
50p
2~ (0,0.2 )N
Width of RBF r = 0.5
Example 2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
22( ) (1 2 ) xy x x x e
50p
2~ (0,0.2 )N
Width of RBF r = 0.5
Optimizing the Regularization Parameter
Re-Estimation Formula
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
Local Ridge Regression
Re-Estimation Formula
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF
( )p trace P
0.05r Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF
( )p trace P
0.05r
22
Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF 0.05r
There are two local-minima.
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
Using the about re-estimation formula, it will be stuck at the nearest local minimum.
That is, the solution depends on the initial setting.
Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF 0.05r
There are two local-minima.
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
ˆ(0) 0.01
lim 0.1ˆ( )t
t
Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF 0.05r
There are two local-minima.
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
5ˆ(0) 10
4ˆ( )lim 2.1 10t
t
Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF 0.05r
RMSE: Root Mean Squared Error In real case, it is not available.
Example
sin(12 )y x 2~ (0,0.1 )N
50m p
— Width of RBF 0.05r
RMSE: Root Mean Squared Error In real case, it is not available.
Local Ridge Regression
2
(( ) )
1 1
2
1
min p m m
ki
k
ki i
i i
C w wy
x
Standard Ridge Regression
2
(( ) )
1 1
2
1
min p m m
i ik
ik i
ik
i
C w wy
x
Local Ridge Regression
23
Local Ridge Regression
2
(( ) )
1 1
2
1
min p m m
ki
k
ki i
i i
C w wy
x
Standard Ridge Regression
2
(( ) )
1 1
2
1
min p m m
i ik
ik i
ik
i
C w wy
x
Local Ridge Regression
j implies that j() can be removed.
The Solutions
1* Tw A Φ y
Tm Φ Φ IA
1 Tp
P I Φ ΦAUsed to compute model selection criteria
T A Φ Φ Λ
TA Φ Φ Linear Regression
Standard Ridge Regression
Local Ridge Regression
Optimizing the Regularization Parameters
Incremental Operation
P: The current projection Matrix.
Pj: The projection Matrix obtained by removing j().
Tj j j j
j Tj j j j
Pφ φ PP P
φ Pφ
22
2
ˆ ˆˆ
( )
T
GCV
p
trace
y P y
P Optimizing the Regularization Parameters
22
2
ˆ ˆˆ
( )
T
GCV
p
trace
y P y
P
Solve 2ˆ
0GCV
j
Subject to 0j
Tj j j j
j Tj j j j
Pφ φ PP P
φ Pφ
Optimizing the Regularization Parameters
2Tjay P y2T Tj j j jby P φ y Pφ2 2( )T T
j j j j jcφ P φ y Pφ
( )jtrace P2T
j j j φ P φ
if
if and ˆ
0 if and
if
c bTj j j b a
c bTjj j j b a
c b c bT Tj j j j j jb a b a
a b
a b
a b
φ Pφ
φ Pφ
φ Pφ φ Pφ
Solve 2ˆ
0GCV
j
Subject to 0j
Optimizing the Regularization Parameters
2Tjay P y2T Tj j j jby P φ y Pφ2 2( )T T
j j j j jcφ P φ y Pφ
( )jtrace P2T
j j j φ P φ
if
if and ˆ
0 if and
if
c bTj j j b a
c bTjj j j b a
c b c bT Tj j j j j jb a b a
a b
a b
a b
φ Pφ
φ Pφ
φ Pφ φ Pφ
Solve 2ˆ
0GCV
j
Subject to 0j
Remove j()
24
The Algorithm
Initialize i’s.
– e.g., performing standard ridge regression.
Repeat the following until GCV converges:
– Randomly select j and compute
– Perform local ridge regression
– If GCV reduce & remove j()
ˆj
ˆj