Lecture3---Introdunction to Radial Basis Function Networks

  • Upload
    nvbond

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    1/139

    Introduction toRadial Basis Function Networks

    :

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    2/139

    Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFNs for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    3/139

    Introduction toRadial Basis Function Networks

    Overview

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    4/139

    Typical Applications of NN

    Pattern Classification

    Function Approximation

    Time-Series Forecasting

    ( )l f x l C N

    ( )fy xmY R y

    m

    X R x

    nX R x

    1 2 3( ) ( , , , )t t tft

    x x x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    5/139

    Function Approximation

    mX R fn

    Y R

    :f X YUnknown

    mX RnY R

    :f X YApproximator

    f

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    6/139

    NeuralNetwork

    Supervised Learning

    UnknownFunction

    ixiy

    iy1,2,i

    ++

    ie

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    7/139

    Neural Networks as

    Universal Approximators Feedforward neural networks with a single hidden layer of

    sigmoidal units are capable of approximating uniformly any

    continuous multivariate function, to any desired degree ofaccuracy. Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer

    Feedforward Networks are Universal Approximators," Neural Networks,2(5), 359-366.

    Like feedforward neural networks with a single hidden layer

    of sigmoidal units, it can be shown that RBF networks areuniversal approximators. Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using

    Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257.

    Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-Function Networks," Neural Computation, 5(2), 305-316.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    8/139

    Statistics vs. Neural Networks

    Statistics Neural Networksmodel networkestimation learningregression supervised learninginterpolation generalization

    observations training setparameters (synaptic) weightsindependent variables inputsdependent variables outputsridge regression weight decay

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    9/139

    Introduction toRadial Basis Function Networks

    The Model of

    Function Approximator

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    10/139

    Linear Models

    1 (( ) )i

    ii

    m

    f w x x

    Fixed BasisFunctions

    Weights

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    11/139

    Linear Models

    Feature VectorsInputs

    Hidden

    Units

    OutputUnits

    Decomposition

    Feature Extraction Transformation

    Linearly

    weightedoutput

    1

    (( ) )ii

    i

    m

    f w

    x x

    y

    1 2 m

    x1 x2 xn

    w1 w2 wm

    x =

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    12/139

    Linear Models

    Feature VectorsInputs

    Hidden

    Units

    OutputUnits

    Decomposition

    Feature Extraction Transformation

    Linearly

    weightedoutput

    y

    1 2 m

    x1 x2 xn

    w1 w2 wm

    x =

    1

    (( ) )ii

    i

    m

    f w

    x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    13/139

    Example Linear Models

    Polynomial

    Fourier Series

    ( ) ii

    iwf xx

    0exp 2( )k

    k jf w k xx

    ( ) , 0,1,2,ii ix x

    0( ) exp 2 0,1,, 2,k x j k x k

    1

    (( ) )ii

    i

    m

    f w

    x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    14/139

    y

    1 2 m

    x1 x2 xn

    w1 w2 wm

    x =

    Single-Layer Perceptrons as

    Universal Aproximators

    Hidden

    Units

    With sufficient number ofsigmoidal units, it can be auniversal approximator.

    1

    (( ) )ii

    i

    m

    f w

    x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    15/139

    y

    1 2 m

    x1 x2 xn

    w1 w2 wm

    x =

    Radial Basis Function Networks as

    Universal Aproximators

    Hidden

    Units

    With sufficient number ofradial-basis-function units,

    it can also be a universalapproximator.

    1

    (( ) )ii

    i

    m

    f w

    x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    16/139

    Non-Linear Models

    1 (( ) )i

    ii

    m

    f w x x

    Adjusted by theLearning process

    Weights

    1

    (( ) )ii

    i

    m

    f w

    x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    17/139

    Introduction toRadial Basis Function Networks

    The Radial Basis

    Function Networks

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    18/139

    Radial Basis Functions

    1

    (( ) )ii

    i

    m

    f w

    x x

    Center

    Distance Measure Shape

    Three parameters

    for a radial function:

    xi

    r= ||x xi||

    i(x)=(||x xi||)

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    19/139

    Typical Radial Functions

    Gaussian

    Hardy Multiquadratic

    Inverse Multiquadratic

    2

    2

    2 0 and

    r

    r e r

    2 2 0 andr r c c c r

    2 2 0 andr c r c c r

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    20/139

    Gaussian Basis Function (=0.5,1.0,1.5)

    2

    22 0 and

    r

    r e r

    0.5 1.0

    1.5

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    21/139

    Inverse Multiquadratic

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    -10 -5 0 5 10

    2 2 0 andr c r c c r

    c=1

    c=2

    c=3c=4

    c=5

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    22/139

    Most General RBF

    1( ) ( )i iT

    i x x x

    12 3

    Basis {i: i =1,2,} is `near orthogonal.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    23/139

    Properties of RBFs On-Center, Off Surround

    Analogies with localized receptive fieldsfound in several biological structures, e.g., visual cortex;

    ganglion cells

    A f i

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    24/139

    The Topology of RBF

    Feature Vectors

    x1 x2 xn

    y1 ym

    Inputs

    HiddenUnits

    OutputUnits

    Projection

    Interpolation

    As a functionapproximator ( )y f x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    25/139

    The Topology of RBF

    Feature Vectors

    x1 x2 xn

    y1 ym

    Inputs

    HiddenUnits

    OutputUnits

    Subclasses

    Classes

    As a pattern classifier.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    26/139

    Introduction toRadial Basis Function Networks

    RBFNs for

    Function Approximation

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    27/139

    The idea

    x

    y Unknown Functionto Approximate

    TrainingData

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    28/139

    The idea

    x

    y Unknown Functionto Approximate

    TrainingData

    Basis Functions (Kernels)

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    29/139

    The idea

    x

    y

    Basis Functions (Kernels)

    Function

    Learned

    1

    )) ((m

    i

    iiwy f

    xx

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    30/139

    The idea

    x

    y

    Basis Functions (Kernels)

    Function

    Learned

    NontrainingSample

    1

    )) ((m

    i

    iiwy f

    xx

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    31/139

    The idea

    x

    yFunction

    Learned

    NontrainingSample

    1

    )) ((m

    i

    iiwy f

    xx

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    32/139

    Radial Basis Function Networks as

    Universal Aproximators

    1

    ( ) ( )m

    i

    i

    iy wf

    x x

    x1 x2 xn

    w1 w2 wm

    x =

    Training set ( )1

    ( ),p

    k

    k

    ky

    xT

    Goal (( ))k kfy x for all k

    2

    ( )

    1

    ( )min kp

    k

    k

    SSE f y

    x

    2

    )(

    1

    ) (

    1

    p mk

    i

    k

    k

    i

    i

    y w

    x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    33/139

    Learn the Optimal Weight Vector

    w1 w2 wm

    x1 x2 xnx =

    Training set ( )1

    ( ),p

    k

    k

    ky

    xT

    Goal (( ))k kfy x for all k

    1

    ( ) ( )m

    i

    i

    iy wf

    x x

    2

    ( )

    1

    ( )min kp

    k

    k

    SSE f y

    x

    2

    )(

    1

    ) (

    1

    p mk

    i

    k

    k

    i

    i

    y w

    x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    34/139

    2

    ( )

    1

    ( )min kp

    k

    k

    SSE f y

    x

    2

    )(

    1

    ) (

    1

    p mk

    i

    k

    k

    i

    i

    y w

    x

    RegularizationTraining set ( )

    1

    ( ),p

    k

    k

    ky

    xT

    Goal (( ))k kfy x for all k

    2

    1

    m

    i

    ii w

    2

    1

    m

    i

    ii w

    0i If regularization

    is unneeded, set 0i

    C

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    35/139

    Learn the Optimal Weight Vector

    2

    2( )

    1

    ( )

    1

    p mk

    k

    i

    i

    i

    k wfyC

    xMinimize

    1

    ( ) ( )m

    i

    i

    iwf

    x x

    (

    ( )

    ) ( )

    1

    2 2kp

    j

    k

    j

    k

    k

    j j

    wfC

    fw w

    y

    xx

    ( ) ( ) ( )1

    2 2p

    k kj

    k

    jjk f wy

    x x

    0

    ( ) * ( ) ( )1 1

    )* (p p

    k k k

    j j

    k k

    jj

    kwf y

    x x x

    *

    1

    *( ) ( )i

    m

    i

    i

    wf

    x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    36/139

    Learn the Optimal Weight Vector

    (1) ( ), ,T

    p

    j j j x x

    * * (1) * ( ), ,T

    pf ff x xDefine

    (1) ( ), , Tpy yy

    * *

    j

    T T

    j jj w f y 1, ,j m

    ( ) * ( ) ( )1 1

    )* (p p

    k k k

    j j

    k k

    jj

    kwf y

    x x x

    1

    ( ) ( )m

    i

    i

    iwf

    x x *1

    *( ) ( )i

    m

    i

    i

    wf

    x x

    * *T T

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    37/139

    Learn the Optimal Weight Vector

    *

    1 1

    *

    2 2

    *

    1

    *

    *

    1

    2 2

    *

    T T

    T T

    T T

    m mmm

    w

    w

    w

    f y

    f y

    f y

    * *

    j

    T T

    j jj w f y 1, ,j m

    1 2, , , m

    1

    2

    m

    * * *

    1 2* , , ,

    T

    mw w ww

    Define

    **T T

    w f y

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    38/139

    Learn the Optimal Weight Vector

    **T T

    w f y

    * (1)

    * (2)*

    * ( )p

    f

    f

    f

    x

    xf

    x

    *

    1

    *( ) ( )i

    m

    i

    i

    wf

    x x

    (1)

    1

    (2)

    1

    (

    *

    *

    1

    *

    )

    k

    k

    m

    k

    k

    m

    k

    k

    mP

    k

    k

    k

    w

    w

    w

    x

    x

    x

    (1) (1) (1)*1 21

    (2) (2) (2) *1 2 2

    *( ) ( ) ( )

    1 2

    m

    m

    p p pm

    m

    w

    w

    w

    x x x

    x x x

    x x x

    * w

    * *T T w w y

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    39/139

    Learn the Optimal Weight Vector

    * *T T w w y

    *T T w y

    1* T T w y

    Variance Matrix

    1 T A y

    1:A

    Design Matrix:

    ( ) ( )p

    k k

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    40/139

    Summary

    1* T T

    w y

    1 T A y

    (1)

    (2)

    ( )

    j

    j

    j

    p

    j

    x

    x

    x

    (1)

    ( 2)

    ( )p

    yy

    y

    y

    1 2, , , m

    *1

    *

    2

    *

    *

    m

    w

    w

    w

    w

    1

    2

    m

    1

    )) ((m

    i

    iiwy f

    xx

    Training set ( )1

    ( ),p

    k

    k

    ky

    xT

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    41/139

    Introduction toRadial Basis Function Networks

    The Projection Matrix

    1* T

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    42/139

    The Empirical-Error Vector

    1* Tw A y

    *

    1w*

    2w*

    mw

    ( )k

    x

    ( )

    1

    k

    x( )

    2

    k

    x( )k

    nx( )k

    x

    ( )

    1

    k

    x( )

    2

    k

    x( )k

    nx

    Unknown

    Function

    y ** f w

    1 2 m

    1* T

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    43/139

    The Empirical-Error Vector

    1* Tw A y

    *

    1w*

    2w*

    mw

    ( )k

    x

    ( )

    1

    k

    x( )

    2

    k

    x( )k

    nx( )k

    x

    ( )

    1

    k

    x( )

    2

    k

    x( )k

    nx

    Unknown

    Function

    y ** f w

    1 2 m

    Error Vector

    * e y f * y w 1 T Ay y

    1 Tp AI y Py

    If =0 the RBFNs learning algorithm

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    44/139

    Sum-Squared-Error

    e Py 1 Tp

    P I A

    T A

    1 2, , , m Error Vector

    T

    Py Py2T

    y P y

    2

    ( ) * ( )

    1

    pk k

    k

    SSE y f

    x

    If =0, the RBFN s learning algorithmis to minimizeSSE (MSE).

    2TSSE P T P

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    45/139

    The Projection Matrix

    e Py 1 Tp

    P I A

    T A

    1 2, , , m Error Vector

    2TSSE y P y

    1 2( , , , )mspane 0

    ( ) P Py Pe e Py

    T y Py

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    46/139

    Introduction toRadial Basis Function Networks

    Learning the Kernels

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    47/139

    RBFNs as Universal Approximators

    x1 x2 xn

    y1 yl

    1 2 m

    w11 w12 w1mwl1 wl2 wlml

    Training set

    ) ( )(

    1,

    pk

    k

    k

    yxT

    2

    2( ) exp 2

    j

    j

    j

    x

    x

    Kernels

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    48/139

    What to Learn? Weightswijs

    Centers js of js Widthsjs of js

    Number of js

    Model Selection

    x1 x2 xn

    y1 yl

    1 2 m

    w11 w12 w1mwl1 wl2 wlml

    ( ) ( )l

    k kf

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    49/139

    One-Stage Learning

    ( ) ( ) ( )1 k k kij i i jw y f x x

    ( ) ( )1

    k k

    i ij j

    j

    f w

    x x

    ( ) ( ) ( )2 21

    ljk k k

    j j ij i i

    ij

    w y f

    x x x

    2

    ( ) ( ) ( )

    3 31

    ljk k k

    j j ij i i

    ij

    w y f

    x x x

    ( ) ( )l

    k kf

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    50/139

    One-Stage Learning

    ( ) ( )1

    k k

    i ij j

    j

    f w

    x xThe simultaneous updates of all threesets of parameters may be suitable

    for non-stationary environments or on-line setting.

    ( ) ( ) ( )1 k k kij i i jw y f x x

    ( ) ( ) ( )2 21

    ljk k k

    j j ij i i

    ij

    w y f

    x x x

    2

    ( ) ( ) ( )

    3 31

    ljk k k

    j j ij i i

    ij

    w y f

    x x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    51/139

    x1 x2 xn

    y1 yl

    1 2 m

    w11 w12 w1mwl1 wl2 wlml

    Two-Stage Training

    Step 1

    Step 2

    Determines Centers js of js. Widthsjs of js. Number of js.

    Determines wijs.

    E.g., using batch-learning.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    52/139

    Train the Kernels

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    53/139

    Unsupervised Training

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    54/139

    Subset Selection Random Subset Selection Forward Selection Backward Elimination

    Clustering Algorithms

    KMEANS LVQ Mixture Models

    GMM

    Methods

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    55/139

    Subset Selection

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    56/139

    Random Subset Selection Randomly choosing a subset of points from

    training set

    Sensitive to the initially chosen points.

    Using some adaptive techniques to tune Centers Widths #points

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    57/139

    Clustering Algorithms

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    58/139

    Clustering Algorithms

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    59/139

    Clustering Algorithms

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    60/139

    Clustering Algorithms1

    23

    4

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    61/139

    Introduction toRadial Basis Function Networks

    Bias-VarianceDilemma

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    62/139

    Goal Revisit

    Minimize Prediction Error

    Ultimate Goal Generalization

    Goal of Our Learning Procedure

    Minimize Empirical Error

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    63/139

    Badness of Fit Underfitting

    A model (e.g., network) that is not sufficiently

    complex can fail to detect fully the signal in acomplicated data set, leading to underfitting.

    Produces excessive bias in the outputs.

    Overfitting A model (e.g., network) that is too complex may

    fit the noise, not just the signal, leading tooverfitting.

    Produces excessive variance in the outputs.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    64/139

    Underfitting/Overfitting Avoidance

    Model selection Jittering Early stopping Weight decay

    Regularization

    Ridge Regression Bayesian learning Combining networks

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    65/139

    Best Way to Avoid Overfitting

    Use lots of training data, e.g., 30 times as many training cases as there are

    weights in the network. for noise-free data, 5 times as many training

    cases as weights may be sufficient.

    Dontarbitrarily reduce the number ofweights for fear of underfitting.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    66/139

    Badness of FitUnderfit Overfit

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    67/139

    Badness of FitUnderfit Overfit

    However it's not really a dilemma

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    68/139

    Bias-Variance Dilemma

    Underfit Overfit

    Large bias Small bias

    Small variance Large variance

    However, it s not really a dilemma.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    69/139

    Underfit Overfit

    Large bias Small bias

    Small variance Large variance

    Bias-Variance DilemmaMore on overfitting

    Easily lead to predictionsthat are far beyond therange of the training data.

    Produce wild predictionsin multilayer perceptronseven with noise-free data.

    It's not really a dilemma.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    70/139

    Bias-Variance Dilemma

    It s not really a dilemma.

    fitmore

    overfitmore

    underfit

    Bias

    Variance

    The mean of the bias=?

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    71/139

    Bias-Variance Dilemma

    Sets of functionsE.g., depend on #

    hidden nodes used.

    The true modelnoise

    bias

    Solution obtainedwith training set 2.

    Solution obtainedwith training set 1.

    Solution obtainedwith training set 3.

    biasbias

    The variance of the bias=?

    The mean of the bias=?

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    72/139

    Bias-Variance Dilemma

    Sets of functionsE.g., depend on #

    hidden nodes used.

    The true modelnoise

    The variance of the bias=?

    bias

    Variance

    Reduce the effective number of parameters.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    73/139

    Model Selection

    Sets of functionsE.g., depend on #

    hidden nodes used.

    The true modelnoise

    bias

    Variance

    Reduce the number of hidden nodes.

    Goal: 2min ( ) ( )E y f x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    74/139

    Bias-Variance Dilemma

    Sets of functionsE.g., depend on #

    hidden nodes used.

    The true modelnoise

    bias

    ( )g x( ) ( )y g x x

    ( )f x

    Goal: min ( ) ( )E y f x x

    Goal: 2min ( ) ( )E y f x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    75/139

    Bias-Variance Dilemma

    Goal: min ( ) ( )E y f x x

    2

    ( ) ( )E y f

    x x 2

    ( ) ( )E g f

    x x

    2 2

    ( ) ( ) 2 ( ) ( )E g f g f

    x x x x

    2 2( ) ( ) 2 ( ) ( )E g f E g f E

    x x x x

    Goal: 2min ( ) ( )E g f x x 2

    min ( ) ( )E y f

    x x

    0 constant

    2 22

    ( ) ( ) ( ) ( )E y f E E g f x x x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    76/139

    Bias-Variance Dilemma

    2( ) ( )E g f

    x x

    2[ ( )] [( ) () )( ]E f E fE g f

    x xx x

    2

    [ ]( )) (E fE g

    x x 2

    [ ]( )) (E fE f

    x x

    [ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x

    2 2

    [ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff

    x x x xx x x x

    ( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx

    ( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x

    ( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0

    ( ) ( ) ( ) ( )y f g f

    0

    Goal: 2min ( ) ( )E y f x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    77/139

    Bias-Variance Dilemma

    2 22( ) ( ) ( ) ( )E y f E E g f

    x x x x

    2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx

    Goal: min ( ) ( )E y f x x

    noise bias2 variance

    Cannot beminimized

    Minimize bothbias2 and variance

    Goal: 2min ( ) ( )E y f x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    78/139

    Model Complexity vs. Bias-Variance

    2 22

    ( ) ( ) ( ) ( )E y f E E g f x x x x

    2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx

    Goal: min ( ) ( )E y f x x

    noise bias2 variance

    ModelComplexity(Capacity)

    Goal: 2min ( ) ( )E y f x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    79/139

    Bias-Variance Dilemma

    2 22( ) ( ) ( ) ( )E y f E E g f

    x x x x

    2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx

    Goal: min ( ) ( )E y f x x

    noise bias2 variance

    ModelComplexity(Capacity)

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    80/139

    Example (Polynomial Fits)22exp( 16 )y x x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    81/139

    Example (Polynomial Fits)

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    82/139

    Example (Polynomial Fits)

    Degree 1 Degree 5

    Degree 10 Degree 15

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    83/139

    Introduction toRadial Basis Function Networks

    The Effective Numberof Parameters

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    84/139

    Variance Estimation

    Mean

    Variance

    22

    1

    1

    p

    ii xp

    In general, notavailable.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    85/139

    Variance Estimation

    Mean1

    1

    p

    i

    i

    x x

    p

    Variance

    22 2

    1

    1

    1

    p

    iis xp

    Loss 1 degreeof freedom

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    86/139

    Simple Linear Regression0 1y x

    2

    ~ (0, )N 0

    1

    y

    x

    y

    x

    2

    p

    i iSSE y y Minimize

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    87/139

    Simple Linear Regression

    y

    x

    0

    1

    y

    x

    0 1y x 2

    ~ (0, )N

    1

    i i

    i

    SSE y y

    2

    p

    i iSSE y y Minimize

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    88/139

    Mean Squared Error (MSE)

    y

    x

    0

    1

    y

    x

    0 1y x 2

    ~ (0, )N

    1

    i i

    i

    y y

    2

    2

    SSE

    MSE p Loss 2 degrees

    of freedom

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    89/139

    Variance Estimation

    2SSE

    MSE mp

    Loss m degreesof freedom

    m: #parameters of the model

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    90/139

    The Number of Parameters

    w1 w2 wm

    x1 x2 xnx =

    1

    ( ) ( )m

    i

    i

    iy wf

    x x #degrees of freedom:

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    91/139

    The Effective Number of Parameters ()

    w1 w2 wm

    x1 x2 xnx =

    1

    ( ) ( )m

    i

    i

    iy wf

    x x The projection Matrix

    1 Tp

    P I AT A

    0

    ( )p trace P m

    ( ) ( ) ( )trace trace trace A B A BFacts:

    ( ) ( )AB BA

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    92/139

    The Effective Number of Parameters ()The projection Matrix

    1 Tp

    P I AT A

    0

    1( ) Tptrace trace AP I

    1 Tp trace A

    ( ) ( )trace traceAB BA

    1

    TTp trace

    Pf)

    1T Tp trace

    mp trace I

    p m ( )p trace P m

    ( )p trace PThe effective number of parameters:

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    93/139

    Regularization

    ((1

    )

    2

    )

    p

    k

    k

    k fy

    x2

    1

    m

    i

    ii w

    Empirial

    Er

    Model s

    penalror ts

    yCo t

    2

    ( )

    1 1

    ( )p m

    k

    i

    k

    k

    i

    iwy

    x

    Penalize models withlarge weightsSSE

    ( )p trace P

    The effective number of parameters:

    ( )p trace P

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    94/139

    Regularization

    ((1

    )

    2

    )

    p

    k

    k

    k fy

    x2

    1

    m

    i

    ii w

    Empirial

    Er

    Model s

    penalror ts

    yCo t

    2

    ( )

    1 1

    ( )p m

    k

    i

    k

    k

    i

    iwy

    x

    Penalize models withlarge weightsSSE

    Without penalty (i=0),there are m degrees

    of freedom tominimize SSE (Cost).

    The effective numberof parameters =m.

    ( )p trace P

    The effective number of parameters:

    ( )p trace P

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    95/139

    Regularization

    ((1

    )

    2

    )

    p

    k

    k

    k fy

    x2

    1

    m

    i

    ii w

    Empirial

    Er

    Model s

    penalror ts

    yCo t

    2

    ( )

    1 1

    ( )p m

    k

    i

    k

    k

    i

    iwy

    x

    Penalize models withlarge weightsSSE

    With penalty (i>0),the liberty to minimizeSSE will be reduced.

    The effective numberof parameters

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    96/139

    Variance Estimation

    2 MSE

    Loss degreesof freedom

    ( )p trace P

    SSEp

    The effective number of parameters:

    ( )p trace P

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    97/139

    Variance Estimation

    2 MSE

    ( )p trace P

    ( )SSE

    trace P

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    98/139

    Introduction toRadial Basis Function Networks

    Model Selection

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    99/139

    Model Selection

    Goal Choose the fittest model

    Criteria Least prediction error

    Main Tools (Estimate Model Fitness) Cross validation

    Projection matrix Methods

    Weight decay (Ridge regression) Pruning and Growing RBFNs

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    100/139

    Empirical Error vs. Model Fitness

    Minimize Prediction Error

    Ultimate Goal Generalization

    Goal of Our Learning Procedure

    Minimize Empirical Error

    Minimize Prediction Error

    (MSE)

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    101/139

    Estimating Prediction Error

    When you have plenty of data useindependent test sets E.g., use the same training set to traindifferent models, and choose the best

    model by comparing on the test set.

    When data is scarce, use Cross-Validation Bootstrap

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    102/139

    Cross Validation

    Simplest and most widely used method forestimating prediction error.

    Partition the original set into severaldifferent ways and to compute an averagescore over the different partitions, e.g., K-fold Cross-Validation Leave-One-Out Cross-Validation Generalize Cross-Validation

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    103/139

    K-Fold CV

    Split the set, say,D of availableinput-output patterns into kmutuallyexclusive subsets, sayD1,D2, ,Dk.

    Train and test the learning algorithmktimes, each time it is trained onD\Di and tested onDi.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    104/139

    K-Fold CV

    Available Data

    Test Set

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    105/139

    K-Fold CV

    Available DataD1 D2 D3 . . . Dk

    D1 D2 D3. . . Dk

    D1 D2 D3. . . Dk

    D1 D2 D3 . . . Dk

    D1 D2 D3. . . Dk

    Training Set

    Estimate 2

    A special case of k-fold CV.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    106/139

    Leave-One-Out CV

    Split thep available input-outputpatterns into a training set of sizep1 and a test set of size 1.

    Average the squared error on theleft-out pattern over thep possibleways of partition.

    A special case of k-fold CV.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    107/139

    Error Variance Predicted by LOO

    22

    1

    1 ( )

    p

    LOO i i i

    i

    y fp

    x

    ( , ) : 1,2, ,k kD y k p x

    \ ( , )

    i i i

    D D y x

    Available input-output patterns.

    1, ,i p Training sets of LOO.

    if Function learned usingDi as training set.

    The estimate for the variance of prediction error using LOO:

    Error-square forthe left-out element.

    A special case of k-fold CV.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    108/139

    Error Variance Predicted by LOO

    22

    1

    1 ( )

    p

    LOO i i i

    i

    y fp

    x

    ( , ) : 1,2, ,k kD y k p x

    \ ( , )

    i i i

    D D y x

    Available input-output patterns.

    1, ,i p Training sets of LOO.

    if Function learned usingDi as training set.

    The estimate for the variance of prediction error using LOO:

    Error-square forthe left-out element.

    Given a model, the function with least

    empirical error forDi.

    As an index of models fitness.We want to find a model also minimize this.

    A special case of k-fold CV.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    109/139

    Error Variance Predicted by LOO

    22

    1

    1 ( )

    p

    LOO i i i

    i

    y fp

    x

    ( , ) : 1,2, ,k kD y k p x

    \ ( , )

    i i i

    D D y x

    Available input-output patterns.

    1, ,i p Training sets of LOO.

    if Function learned usingDi as training set.

    The estimate for the variance of prediction error using LOO:

    Error-square forthe left-out element.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    110/139

    Error Variance Predicted by LOO

    22

    1

    1 ( )

    p

    LOO i i i

    i

    y fp

    xError-square for

    the left-out element.

    2 2

    (

    1

    ( ))

    T

    LOO dip ag

    y PP Py

    2 2(1 ( ))

    T

    LOO dip

    ag y PP Py

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    111/139

    Generalized Cross-Validation

    p

    ( )( ) p

    tracediag

    p

    PP I

    22

    2

    ( )

    T

    GCV

    p

    trace

    y P y

    P

    2

    2

    ( )

    Tp

    p

    y P y

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    112/139

    More Criteria Based on CV

    22

    2

    ( )

    T

    GCV

    p

    p

    y P y GCV(Generalized CV)

    22

    T

    UEVp

    y P y UEV(Unbiased estimate of variance)

    2

    2

    T

    FPE

    p

    p p

    y P y FPE

    (Final Prediction Error)

    AkaikesInformation

    Criterion

    BIC(Bayesian Information Criterio)

    22 ln( ) 1

    T

    BIC

    p p

    p p

    y P y

    2 2 2 2

    UEV FPE GCV BIC

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    113/139

    More Criteria Based on CV

    22

    2

    ( )

    T

    GCV

    p

    p

    y P y

    22

    T

    UEVp

    y P y

    2

    2

    T

    FPE

    p

    p p

    y P y

    22 ln( ) 1

    T

    BIC

    p p

    p p

    y P y

    2

    UEV

    p

    p

    2UEVp

    p

    2ln( ) 1

    UEV

    p p

    p

    2 2 2 2

    UEV FPE GCV BIC

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    114/139

    More Criteria Based on CV

    22

    2

    ( )

    T

    GCV

    p

    p

    y P y

    22

    T

    UEVp

    y P y

    2

    2

    T

    FPE

    p

    p p

    y P y

    22 ln( ) 1

    T

    BIC

    p p

    p p

    y P y

    2

    UEV

    p

    p

    2UEV

    p

    p

    2ln( ) 1

    UEV

    p p

    p

    ?P

    Standard Ridge Regression,

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    115/139

    Regularization

    ((1

    )

    2

    )

    p

    k

    k

    k fy

    x 2

    1

    m

    i

    ii w

    Empirial

    Er

    Model s

    penalror ts

    yCo t

    2

    ( )

    1 1

    ( )p m

    k

    i

    k

    k

    i

    iwy

    x

    Penalize models withlarge weightsSSE

    ,i i

    Standard Ridge Regression,

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    116/139

    Regularization

    ((1

    )

    2

    )

    p

    k

    k

    k fy

    x 2

    1

    m

    i

    ii w

    Empirial

    Er

    Model s

    penalror ts

    yCo t

    2

    ( )

    1 1

    ( )p m

    k

    i

    k

    k

    i

    iwy

    x

    Penalize models withlarge weightsSSE

    2

    1

    m

    i

    iw

    ,i i

    2

    (( ) )

    1 1

    2

    1

    minp m m

    k

    i

    k

    k

    i i

    i i

    C w wy

    x

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    117/139

    Solution Review

    1* Tw A yT

    m IA

    1 Tp P I A

    Used to compute model selection criteria

    E l

    22( ) (1 2 ) xy x x x e 2~ (0 0 2 )N

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    118/139

    Example 50p ~ (0,0.2 )N

    Width of RBF r= 0.5

    E l

    22( ) (1 2 ) xy x x x e 2~ (0 0 2 )N

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    119/139

    Example2 2 2 2

    UEV FPE GCV BIC

    50p

    ~ (0,0.2 )N

    Width of RBF r= 0.5

    E l

    22( ) (1 2 ) xy x x x e 2~ (0 0 2 )N

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    120/139

    Example2 2 2 2

    UEV FPE GCV BIC

    50p

    ~ (0,0.2 )N

    Width of RBF r= 0.5

    Optimizing the

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    121/139

    Optimizing theRegularization Parameter

    Re-Estimation Formula

    2 1 21

    ( )

    T

    T

    trace

    trace

    A

    w

    Ay P y

    A Pw

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    122/139

    Local Ridge Regression

    Re-Estimation Formula

    2 1 21

    ( )

    T

    T

    trace

    trace

    A

    w

    Ay P y

    A Pw

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    123/139

    Example Width of RBF( )p trace P

    0.05r

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    124/139

    Example Width of RBF( )p trace P

    0.05r

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    125/139

    Example Width of RBF0.05r

    There are two local-minima.

    2 1 2

    1

    ( )

    T

    T

    trace

    trace

    A

    w

    Ay P y

    A Pw

    Using the about re-estimationformula, it will be stuck at thenearest local minimum.

    That is, the solution depends onthe initial setting.

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    126/139

    Example Width of RBF0.05r

    There are two local-minima.

    2 1 2

    1

    ( )

    T

    T

    trace

    trace

    A

    w

    Ay P y

    A Pw

    (0) 0.01

    lim 0.1( )t

    t

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    127/139

    Example Width of RBF0.05r

    There are two local-minima.

    2 1 2

    1

    ( )

    T

    T

    trace

    trace

    A

    w

    Ay P y

    A Pw

    5(0) 10

    4( )lim 2.1 10t

    t

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    128/139

    Example Width of RBF0.05rRMSE: Root Mean Squared ErrorIn real case, it is not available.

    E l

    sin(12 )y x 2~ (0,0.1 )N

    50m p

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    129/139

    Example Width of RBF0.05rRMSE: Root Mean Squared ErrorIn real case, it is not available.

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    130/139

    Local Ridge Regression

    2

    (( ) )

    1 1

    2

    1

    min

    p m m

    ki

    k

    ki i

    i i

    C w wy

    x

    Standard Ridge Regression

    2

    (( ) )

    1 1

    2

    1

    minp m m

    i i

    k

    i

    k i

    i

    k

    i

    C w wy

    x

    Local Ridge Regression

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    131/139

    Local Ridge Regression

    2

    (( ) )

    1 1

    2

    1

    min

    p m m

    ki

    k

    ki i

    i i

    C w wy

    x

    Standard Ridge Regression

    2

    (( ) )

    1 1

    2

    1

    minp m m

    i i

    k

    i

    k i

    i

    k

    i

    C w wy

    x

    Local Ridge Regression

    j implies that j()

    can be removed.

    Th S l ti

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    132/139

    The Solutions

    1* Tw A y

    T

    m IA

    1 T

    p

    P I AUsed to compute model selection criteria

    T A

    T

    A Linear Regression

    Standard Ridge Regression

    Local Ridge Regression

    Optimizing the

    22

    2

    ( )

    T

    GCV

    p

    trace y P y

    P

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    133/139

    p gRegularization Parameters

    Incremental Operation

    P:The current projection Matrix.

    Pj:The projection Matrix obtained by removing j().

    Tj j j j

    j T

    j j j j

    P PP P

    P

    ( )

    Optimizing the

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    134/139

    p gRegularization Parameters

    22

    2

    ( )

    T

    GCV

    p

    trace

    y P y

    P

    Solve2

    0GCV

    j

    Subject to 0j

    Tj j j j

    j T

    j j j j

    P PP P

    P

    Optimizing the

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    135/139

    p gRegularization Parameters

    2T

    ja y P y2T T

    j j j jb y P y P 2 2

    ( )T T

    j j j j jc P y P ( )

    jtrace P

    2T

    j j j P

    ifif and

    0 if and

    if

    c bT

    j j j b a

    c bTj

    j j j b a

    c b c bT T

    j j j j j jb a b a

    a ba b

    a b

    P

    P

    P P

    Solve2

    0GCV

    j

    Subject to 0j

    Optimizing the

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    136/139

    p gRegularization Parameters

    2T

    ja y P y2T T

    j j j jb y P y P 2 2

    ( )T T

    j j j j jc P y P ( )

    jtrace P

    2T

    j j j P

    ifif and

    0 if and

    if

    c bT

    j j j b a

    c bTj

    j j j b a

    c b c bT T

    j j j j j jb a b a

    a ba b

    a b

    P

    P

    P P

    Solve2

    0GCV

    j

    Subject to 0j

    Remove j()

    Th Al ith

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    137/139

    The Algorithm

    Initialize is.

    e.g., performing standard ridge regression.

    Repeat the following until GCV converges:

    Randomly selectj and compute

    Perform local ridge regression If GCV reduce & remove j()

    j

    j

    R f

  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    138/139

    References

    Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy

    estimation and model selection," International Joint Conference on Artificial

    Intelligence (IJCAI).

    Mark J. L. Orr (April 1996),Introduction to Radial Basis Function

    Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html .

    http://www.anc.ed.ac.uk/~mjo/intro/intro.htmlhttp://www.anc.ed.ac.uk/~mjo/intro/intro.html
  • 7/28/2019 Lecture3---Introdunction to Radial Basis Function Networks

    139/139

    Introduction toRadial Basis Function Networks

    IncrementalOperations