View
10
Category
Tags:
Embed Size (px)
CSCI 5521: Paul Schrater
Radial Basis Function Networks
CSCI 5521: Paul Schrater
Introduction In this lecture we will look at RBFs, networks where the
activation of hidden is based on the distance betweenthe input vector and a prototype vector
Radial Basis Functions have a number ofinteresting properties
Strong connections to other disciplines function approximation, regularization theory, density estimation and
interpolation in the presence of noise [Bishop, 1995] RBFs allow for a straightforward interpretation of the
internal representation produced by the hidden layer RBFs have training algorithms that are significantly
faster than those for MLPs RBFs can be implemented as support vector machines
CSCI 5521: Paul Schrater
Radial Basis Functions Discriminant function flexibility
NONLinearBut with sets of linear parameters at each layerProvably general function approximators for sufficient nodes
Error functionMixed Different error criteria typically used for hidden vs.output layers.
Hidden: Input approx.Output: Training error
Optimization Simple least squares  with hybrid training solution isunique.
CSCI 5521: Paul Schrater
Two layer Nonlinear NN
Nonlinear functions are radially symmetric kernels, with freeparameters
CSCI 5521: Paul Schrater
Input to Hidden Mapping
Each region of feature space by means of a radially symmetricfunction Activation of a hidden unit is determined by the DISTANCE between the input
vector x and a prototype vector
Choice of radial basis function Although several forms of radial basis may be used, Gaussian kernels are
most commonly used The Gaussian kernel may have a fullcovariance structure, which requires D(D+3)/2
parameters to be learned
or a diagonal structure, with only (D+1) independent parameters
In practice, a tradeoff exists between using a small number of basis with manyparameters or a larger number of less flexible functions [Bishop, 1995]
CSCI 5521: Paul Schrater
Builds up a function out of Spheres
CSCI 5521: Paul Schrater
RBF vs. NN
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
RBFs have their origins in techniques forperforming exact interpolation
These techniques place a basis function at each ofthe training examples f(x)
and compute the coefficients wk so that themixture model has zero error at examples
CSCI 5521: Paul Schrater
Formally: Exact Interpolation
!
y = h(x) = wi" x # x i( )i=1:n
$
"11
"12
L "1n
"21
"22
L "2n
M M M M
"n1 "n2 L "nn
%
&
' ' '
(
)
* * *
w1
w2
M
wn
%
&
' ' '
(
)
* * *
=
t1
t2
M
tn
%
&
' ' '
(
)
* * *
" ij =" x i # x j( )
Goal: Find a function h(x), such that h(xi) = ti, for inputs xi and targets ti
!
"r w =
r t
r w = "T"( )
#1
"Tr t
Solve for w
Form
Recall the Kerneltrick
CSCI 5521: Paul Schrater
Solving conditions
Micchellis Theorem: If the points xi are distinct, then the matrix will be nonsingular.
Mhaskar and Micchelli, Approximation by superposition of sigmoidal and radial basis functions,Advances in Applied Mathematics,13, 350373, 1992.
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
Radial Basis function Networks
!
n outputs, m basis functions
yk = hk (x) = wki" ii=1:m
# x( ) + wk0, e.g. " i(x) = exp $x$ j
2
2%j2
&
' (
)
* +
= wki" ii= 0:m
# x( ), "0(x) =1
Rewrite this as :
r y = W
r " (x),
r " (x) =
"0(x)M
"m (x)
,

.
.
/
0
1 1
CSCI 5521: Paul Schrater
Learning
RBFs are commonly trained following a hybrid procedurethat operates in two stages
Unsupervised selection of RBF centers RBF centers are selected so as to match the distribution of training examples
in the input feature space This is the critical step in training, normally performed in a slow iterative
manner A number of strategies are used to solve this problem
Supervised computation of output vectors Hiddentooutput weight vectors are determined so as to minimize the sum
squared error between the RBF outputs and the desired targets Since the outputs are linear, the optimal weights can be computed using fast,
linear
CSCI 5521: Paul Schrater
Least squares solution for output weightsGiven L input vectors xl, with labels tl
!
Form the least square error function :
E = yk (x l ) " tlk{ }
k=1:n
#l=1:L
#2
, tlk is the k th value of the target vector
E = (r y (x l ) "
r t l )
t (r y (x l ) "
r t l )
l=1:L
#
= (r $ x l( )W
t "r t l )
l=1:L
#t
(r $ x l( )W
t "r t l )
E = (%Wt "T)l=1:L
#t
(%Wt "T) = W%t%Wt " 2W%tT+l=1:L
# TtT
&E
dW= 0 = 2%t%Wt " 2%tT
Wt = % t%( )"1
%tT
CSCI 5521: Paul Schrater
Solving for input parameters Now we have a linear solution, if we know the input
parameters. How do we set the input parameters? Gradient descentlike Backprop Density estimation viewpoint: By looking at the meaning of
the RBF network for classification, the input parameters canbe estimated via density estimation and/or clusteringtechniques
We will skip these in lecture
CSCI 5521: Paul Schrater
Unsupervised methods overview Random selection of centers
The simplest approach is to randomly select a number of training examples asRBF centers
This method has the advantage of being very fast, but the network will likely require anexcessive number of centers
Once the center positions have been selected, the spread parameters j can beestimated, for instance, from the average distance between neighboring centers
Clustering Alternatively, RBF centers may be obtained with a clustering procedure such as
the kmeans algorithm The spread parameters can be computed as before, or from the sample covariance
of the examples of each cluster Density estimation
The position of the RB centers may also be obtained by modeling the featurespace density with a Gaussian Mixture Model using the Expectation Maximizationalgorithm
The spread parameters for each center are automatically obtained from thecovariance matrices of the corresponding Gaussian components
CSCI 5521: Paul Schrater
Probabilistic Interpretion of RBFs forClassification
=
IntroduceMixture
CSCI 5521: Paul Schrater
Prob RBF cond
where
Basis function outputs: Posterior jthFeature probabilities
weights: Class probabilitygiven jth Feature value
CSCI 5521: Paul Schrater
Regularized Spline Fits
!
" =
#1X
1( ) L #N X1( )M
#1XM( ) L #N XM( )
$
%
& & &
'
(
) ) )
!
ypred ="w
Minimize
L = y "w( )T
y "w( ) + #wT$w
Given
Make a penalty on large curvature
!
"ij =d2#i x( )dx
2
$
% &
'
( ) *d2# j x( )dx
2
$
% &
'
( ) dx
!
Minimize
L = y "w( )Ty "w( ) + #wT$w
Solution
w = "T"+ #$( )%1
"T r y Generalized Ridge Regression
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
Equivalent Kernel
Linear Regression solution admit a kernelinterpretation
!
w = "T"+ #$( )%1
"Tr y
ypred (x) ="(x)w ="(x) "T"+ #$( )
%1
"Tr y
Let "T"+ #$( )%1
"T = S#"T
ypred (x) ="(x) S#r & (xi)yi'
= "(x)S#r & (xi)yi'
= K(x, xi)yi'
CSCI 5521: Paul Schrater
Eigenanalysis and DF
An eigenanalysis of the effective kernel gives theeffective degrees of freedom (DF), i.e. # of basisfunctions
!
Evaluate K at all points
K ij = K(xi,x j )
Eigenanalysis
K =VDV"1
=USUT
CSCI 5521: Paul Schrater
EquivalentBasis
Functions
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
Computing Error Bars on Predictions
It is important to be able to compute error bars onlinear regression predictions. We can do that bypropagating the error in the measured values to thepredicted values.
!
" =
#1
X1( ) L #N X1( )
M
#1
XM( ) L #N XM( )
$
%
& & &
'
(
) ) )
w = "T"( )*1"T
r y
ypred = "w = " "T"( )
*1"T
r y = S
r y
cov[ypred ] = cov[Sr y ] = Scov[
r y ]S
T = S +I( )ST = +SST
Given the Gram Matrix Simple Least squares
predictions
Error
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater
CSCI 5521: Paul Schrater