Radial Basis Function NetworksRadial Basis Function Networksyfwang/courses/cs290i_prann/pdf/rbf.pdf · Radial Basis Function Networks A special types of ANN that have three layers

Radial Basis Function NetworksRadial Basis Function NetworksRadial Basis Function NetworksRadial Basis Function Networks

Radial Basis Function Networks A special types of ANN that have three

layerslayers Input layerHidd lHidden layerOutput layer

i f i hidd l i Mapping from input to hidden layer is nonlinear

Mapping from hidden to output layer is linear

PR , ANN, & ML 2

ComparisonMulti-layer perceptron Multiple hidden layers

RBF Networks Single hidden layer Multiple hidden layers

Nonlinear mapping W: inner product

Single hidden layer Nonlinear + linear W: distance W: inner product

Global mapping W l ifi

W: distance Local mapping W d t Warp classifiers

Stochastic approximation

Warp data Curve fitting

approximation

PR , ANN, & ML 3

Another View: Curve Fitting We try to estimate a mapping from patterns

into classes f(patterns)->classes, f(X)->df(p ) , f( ) Patterns are represented as feature vector X Classes are decisions d Classes are decisions d Training samples: f(Xi)->di, i=1,..., n Interpolation of the f based on samples

d

x2

d

PR , ANN, & ML 4x1

Yet Another View: Warping Data If the problem is not linearly separable,

MLP will use multiple neurons to defineMLP will use multiple neurons to define complicated decision boundaries (warp classifiers)

Another alternative is to warp data into higher dimensional space that they are much more likely to be linearly separable (single perceptron will do)

This is very similar to the idea of Support Vector Machine

PR , ANN, & ML 5

Example XOR Warpped XOR

y|]0,0[|

2 )(t

e xx

(1 1)

)(xx

(1,1)

)()(

)(2

1

xx

x

yx

(0,0)

x |]11[| t

(0,1)(1,0)

( , )

PR , ANN, & ML 6

|]1,1[|1 )(

t

e xx

More Example

PR , ANN, & ML 7

A Pure Interpolation Approach Given: (Xi, di), i=1, …, n Desired: f(Xi)= di( i) i

Solution: f(X), with f(Xi)= di

Radial basis function solution iiwf )()( XXX X,Xi) – general form is shift and rotation invariant Shift invariant requires X-Xi

i

iiwf )()( XXX

q i

Rotation invariant requires || X-Xi ||

Example22)( Multiquadrics

Inserve Multiquadrics Gaussan

22)( crr

22

1)(cr

r

PR , ANN, & ML 8

2

2

2)( r

er

Graphical Interpretation Each neuron responds based on the distance to the center

of its receptive field The bottom level is a nonlinear mappingpp g The top level is a linear weighted sum

iiwf )()( XXX i

w1 wn

)( 1XX )( nXX )( 2XX

PR , ANN, & ML 9x1 x2 xm

Other Alternatives: Global Lagrange polynomials

)(n

knk Lyxfy

)())(())(()())(())((

)(

111

111,

0,

nkkkkkkok

nkkokn

kknk

xxxxxxxxxxxxxxxxxxxxL

yfy

PR , ANN, & ML 10

Other Alternatives: Local Bezier Basis B-spline basis

PR , ANN, & ML 11

B-Spline Interpolation A big subject in mathematics

U d i di i li Used in many disciplinesApproximation Pattern recognitionComputer graphics

As far as pattern recognition is concernedDetermine order of spline (DOFs)

Knot vectors (partition into intervals) Fitting in each interval

PR , ANN, & ML 12

Interpolation SolutionXXX ),()(

i

ii

d

wf

XX2

1

2

1

22221

11211

),(

jiijn

n

dd

ww

21

),(

jiij

nnnnnn dw

DΦWDΦW

1

is symmetrical is invertable (if all X ’s are distinct)

PR , ANN, & ML 13

is invertable (if all Xi s are distinct)

Practical Issue: Accuracy (cont.) The function represents the Green’s

function for a certain differential operatorfunction for a certain differential operator When it is shift and rotational invariant, we

it (X X ) G(||X X ||) ican write (X, Xi) as G(||X-Xi||), again, Gaussian Kernel is a popular choice here

PR , ANN, & ML 14

Practical Issues Accuracy

How about data are noisy?How about data are noisy? Speed

How about there are many sample points? Training

What is the training procedure?

PR , ANN, & ML 15

Practical Issue: Accuracy When data are noisy, pure interpolation

represents a form of “overfitting”represents a form of overfitting Need a stabilizing (or smoothing,

l i ti ) tregularization) term The solution should achieve two things

Good fitting Smoothness

PR , ANN, & ML 16

Practical Issue: Accuracy (cont.)2

1

2 ||21))((

21 Dfdf

n

iii X

1 22 i

TXn

i

m

ijiji DfGwd 2*

1

2

1

||21))((

21

DG)GG(GW

DG)WGG(GT

oT

To

Ti i

1

1 1 22

The solution is rooted in the regularization theory, which is way beyond the scope of this course (readwhich is way beyond the scope of this course (read the papers on the class Web sites for more details)

Try to minimize error as a weighted sum of two

PR , ANN, & ML 17

y gterms which impose the fitting and the smoothness constraints

Sidebar I It can be proven that MAP estimator (Baysian rule)

gives the same results as regularized RBF solution Un-regularized (fitting) solution assumes the same

prior

Dfdfn

iii ||

21))((

21 2

1

2X

PfPfPfP

i

)()()|()|(

22 1

XXX

cfPfPfPP

))(log()|(log)|(log)(

XXX

PR , ANN, & ML 18

Sidebar II Regularization is also similar to (or call)

ridge regression in statisticsridge regression in statistics The problem here is to fit a model to data

ith t fittiwithout overfitting In linear case, we have

wwxwyi j j

jjijoiridge

22)(minarg w

w

wxwyi j

jijoiridge

2)(minargw

w

PR , ANN, & ML 19

swsubjecttoj

j

2

Intuition When variables xi are highly correlated,

their coefficients become poorly determinedtheir coefficients become poorly determined with high varianceE g widely large positive coefficient on oneE.g. widely large positive coefficient on one

can be canceled by a similarly large negative coefficient on its correlated cousincoefficient on its correlated cousin

Size constraint is helpfulCaveat: constraint is problem dependentCaveat: constraint is problem dependent

PR , ANN, & ML 20

Solution to Ridge Regression Similar to regularization

ww i j j

jjijoiridge wwxwy 22)(minarg

WWYXWYXW

WWYXWYXWww

TT

TTridge

d )()(

)()(minarg

WYXWXw

WWYXWYXW

T

TT

dd

0)(

0)()(

YXIXXYXWIXX

WYXWX

TTridge

TT

λ 1)()(

0)(

PR , ANN, & ML 21

YXIXXw TTridge λ 1)(

Ugly Math

YXIXXw TTridge λ 1)(

YUVΣIUΣUVΣUΣYXIXXXXwY

TTTTTT

TTridge

λVVλ

1

1

)()(

YUΣVIUΣUVΣUΣYUVΣIUΣUVΣUΣ

TTTTTT λVVλVV

1111 )()()(

)(

YUΣIΣΣUΣYUΣIVUΣUVΣVUΣ

TTT

TTTTTTT

λVλVV

1

111

)()(

Yuu Ti

ii λ

2

2

)(

PR , ANN, & ML 22

ii λ

Physical Interpretation Singular values of X represents the spread

of data along different body-fittingof data along different body fitting dimensions

To estimate Y(=Xwridge) regularization To estimate Y( Xw ) regularization minimizes the contribution from less spread-out dimensionsLess spread-out dimensions usually have much

larger variance (high dimension eigen modes)1Trace X(XTX+I)-1XT is called effective

degrees of freedom

PR , ANN, & ML 23

More Details Trace X(XTX+I)-1XT is called effective

degrees of freedomdegrees of freedomControls how many eigen modes are actually

used or activeused or active Different methods are possible

Sh i ki th t ib ti l d Shrinking smoother: contributions are scaled Projection smoother: contributions are used (1)

or not sed (0)or not used (0)

PR , ANN, & ML 24

Practical Issue: Speed When there are many training samples, G

and matrices are of size n by nand matrices are of size n by n Inverting such a matrix is of O(n3) Reducing the number of bases used

PR , ANN, & ML 25

Practical Issue: Speed (cont.)

m

*

m<n

n m

jiji

jjj

DfGwd

Gwf

||21))((

21

)()(

2*2

1

*

TX

TXX

m

To

Ti i

GGG

),(),(),(

22

12111

1 1

TXTXTXDG)WGG(G

mnmnnn

m

GGG

GGG

),(),(),(

),(),(),(

21

22212

TXTXTX

TXTXTXG

m

m

o

mn

GGGGGG

),(),(),(),(),(),(

22212

12111

TTTTTTTTTTTT

G

PR , ANN, & ML 26

mmmnnn GGG

),(),(),( 21 TTTTTT

Practical Issue: Training How can the center of radial basis functions

for the reduced basis set be determined?for the reduced basis set be determined? Chosen randomly Training involves finding wi, using SVD

PR , ANN, & ML 27

Training with K-mean Using unsupervised clustering

Fi d h d t l t d th t i Find where data are clustered – that is where the radial basis functions should be

l dplaced With k-mean

PR , ANN, & ML 28

K-Means Algorithm(fixed # of clusters)(fixed # of clusters)

Arbitrarily pick N cluster centers, assign samples to nearest center

Compute sample mean of each clusterp p Reassign samples to clusters with the

nearest mean (for all samples)nearest mean (for all samples) Repeat if there are changes, otherwise stop

PR , ANN, & ML 29

PR , ANN, & ML 30

PR , ANN, & ML 31

PR , ANN, & ML 32

Training with Gradient Decent Error Expression

21 n m

1 1))((

21 )()()(

n

i

m

jjiji

nnn

Gwd TX

Free variables in the error expression areWeightgCenter locationBasis spreadp

PR , ANN, & ML 33

Effect of Weights)()(

2)( ))((

21 nnn m

jijin Gwd TX

)()()(

1 1

)(

))((2

nnmn

i jjiji

Gwd

TX

)()()(

1

)(

)(

nnn

n

jjijii

G

Gwd

TX

TX

)(

)(

1

)( )(n

iji

ni

j

Gw

TX

)(

)()()1(

n

j

n

wn

jn

jw

ww

PR , ANN, & ML 34

j

Effect of Center Positions)()(

2)( ))((

21 nnn m

jijin Gwd TX

)()()(

1 1

)(

2nnm

jijin

i

i j

Gwd TX

)()(

)(1)()(

)(

1

)()('2nn

n ji

n

jin

in

j

n

j

Gw TXΣTX

)(

)()()1(

1

)()(

nnn

jii

jiij

j

TT

T

)( n

j

jjT

TT T

PR , ANN, & ML 35

Effect of Basis Spread)()(

2)( ))((

21 nnn m

jijin Gwd TX

)()()(

1 1

)(

2nnm

jijin

i

i jjiji

Gwd

TX

)(

)()()()(

)(

1

)('n

nn

ij

n

jin

in

j

n

jjj

Gw

QTX

)()(

)(

)(

11

))((nn T

jijin

ij

iji

jiij

j

TXTXQ

Σ

)(11

)()(1)1(1

nj

nn

jn

j

jijiij

ΣΣΣ

Σ

PR , ANN, & ML 36

jΣ

Details A lot of theoretical development results are

omitted hereomitted hereE.g., relation to kernel regression and SVM

A l t f t i id ti t A lot of tuning considerations are not covered hereE.g., how to determine ?

This is an active research area

PR , ANN, & ML 37

Examples

PR , ANN, & ML 38544,000 data points w. 80,000 centersAccuracy of 1.4x10-6 for all data points

Problem Definitioni i l d f d Given a point cloud of data From laser range scanner, orCT MR etcCT, MR, etc.

Find a single analytical surface approximation Or an inside outside function Or an inside-outside function

Range data are s(X)=0Outside is s(X)>0Outside is s(X)>0 Inside is s(X)<0

Just sample data s(X)=0 is not enoughp ( ) g s can be a trivial zero functionNeed off-surface data generation

PR , ANN, & ML 39

Procedures1. Off surface data generation2. Choose a subset from the interpolation node xi and p i

fit an RBF only to these3. Evaluate the resideual ei = fi – s(xi)4. If max(ei)<accuracy, then stop5. Else append new centers where ei is largepp i g6. Re-fit RBF and go back to step 2

PR , ANN, & ML 40

More Results

PR , ANN, & ML 41Less smoothing More smoothing

Documents

Radial Basis Function NetworksRadial Basis Function Networksyfwang/courses/cs290i_prann/pdf/rbf.pdf · Radial Basis Function Networks A special types of ANN that have three layers