Regression i i i

Embed Size (px)

Text of Regression i i i

  • CSCI 5521: Paul Schrater

    Radial Basis Function Networks

  • CSCI 5521: Paul Schrater

    Introduction In this lecture we will look at RBFs, networks where the

    activation of hidden is based on the distance betweenthe input vector and a prototype vector

    Radial Basis Functions have a number ofinteresting properties

    Strong connections to other disciplines function approximation, regularization theory, density estimation and

    interpolation in the presence of noise [Bishop, 1995] RBFs allow for a straightforward interpretation of the

    internal representation produced by the hidden layer RBFs have training algorithms that are significantly

    faster than those for MLPs RBFs can be implemented as support vector machines

  • CSCI 5521: Paul Schrater

    Radial Basis Functions Discriminant function flexibility

    NON-LinearBut with sets of linear parameters at each layerProvably general function approximators for sufficient nodes

    Error functionMixed- Different error criteria typically used for hidden vs.output layers.

    Hidden: Input approx.Output: Training error

    Optimization Simple least squares - with hybrid training solution isunique.

  • CSCI 5521: Paul Schrater

    Two layer Non-linear NN

    Non-linear functions are radially symmetric kernels, with freeparameters

  • CSCI 5521: Paul Schrater

    Input to Hidden Mapping

    Each region of feature space by means of a radially symmetricfunction Activation of a hidden unit is determined by the DISTANCE between the input

    vector x and a prototype vector

    Choice of radial basis function Although several forms of radial basis may be used, Gaussian kernels are

    most commonly used The Gaussian kernel may have a full-covariance structure, which requires D(D+3)/2

    parameters to be learned

    or a diagonal structure, with only (D+1) independent parameters

    In practice, a trade-off exists between using a small number of basis with manyparameters or a larger number of less flexible functions [Bishop, 1995]

  • CSCI 5521: Paul Schrater

    Builds up a function out of Spheres

  • CSCI 5521: Paul Schrater

    RBF vs. NN

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

    RBFs have their origins in techniques forperforming exact interpolation

    These techniques place a basis function at each ofthe training examples f(x)

    and compute the coefficients wk so that themixture model has zero error at examples

  • CSCI 5521: Paul Schrater

    Formally: Exact Interpolation

    !

    y = h(x) = wi" x # x i( )i=1:n

    $

    "11

    "12

    L "1n

    "21

    "22

    L "2n

    M M M M

    "n1 "n2 L "nn

    %

    &

    ' ' '

    (

    )

    * * *

    w1

    w2

    M

    wn

    %

    &

    ' ' '

    (

    )

    * * *

    =

    t1

    t2

    M

    tn

    %

    &

    ' ' '

    (

    )

    * * *

    " ij =" x i # x j( )

    Goal: Find a function h(x), such that h(xi) = ti, for inputs xi and targets ti

    !

    "r w =

    r t

    r w = "T"( )

    #1

    "Tr t

    Solve for w

    Form

    Recall the Kerneltrick

  • CSCI 5521: Paul Schrater

    Solving conditions

    Micchellis Theorem: If the points xi are distinct, then the matrix will be nonsingular.

    Mhaskar and Micchelli, Approximation by superposition of sigmoidal and radial basis functions,Advances in Applied Mathematics,13, 350-373, 1992.

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

    Radial Basis function Networks

    !

    n outputs, m basis functions

    yk = hk (x) = wki" ii=1:m

    # x( ) + wk0, e.g. " i(x) = exp $x$ j

    2

    2%j2

    &

    ' (

    )

    * +

    = wki" ii= 0:m

    # x( ), "0(x) =1

    Rewrite this as :

    r y = W

    r " (x),

    r " (x) =

    "0(x)M

    "m (x)

    ,

    -

    .

    .

    /

    0

    1 1

  • CSCI 5521: Paul Schrater

    Learning

    RBFs are commonly trained following a hybrid procedurethat operates in two stages

    Unsupervised selection of RBF centers RBF centers are selected so as to match the distribution of training examples

    in the input feature space This is the critical step in training, normally performed in a slow iterative

    manner A number of strategies are used to solve this problem

    Supervised computation of output vectors Hidden-to-output weight vectors are determined so as to minimize the sum-

    squared error between the RBF outputs and the desired targets Since the outputs are linear, the optimal weights can be computed using fast,

    linear

  • CSCI 5521: Paul Schrater

    Least squares solution for output weightsGiven L input vectors xl, with labels tl

    !

    Form the least square error function :

    E = yk (x l ) " tlk{ }

    k=1:n

    #l=1:L

    #2

    , tlk is the k th value of the target vector

    E = (r y (x l ) "

    r t l )

    t (r y (x l ) "

    r t l )

    l=1:L

    #

    = (r $ x l( )W

    t "r t l )

    l=1:L

    #t

    (r $ x l( )W

    t "r t l )

    E = (%Wt "T)l=1:L

    #t

    (%Wt "T) = W%t%Wt " 2W%tT+l=1:L

    # TtT

    &E

    dW= 0 = 2%t%Wt " 2%tT

    Wt = % t%( )"1

    %tT

  • CSCI 5521: Paul Schrater

    Solving for input parameters Now we have a linear solution, if we know the input

    parameters. How do we set the input parameters? Gradient descent--like Back-prop Density estimation viewpoint: By looking at the meaning of

    the RBF network for classification, the input parameters canbe estimated via density estimation and/or clusteringtechniques

    We will skip these in lecture

  • CSCI 5521: Paul Schrater

    Unsupervised methods overview Random selection of centers

    The simplest approach is to randomly select a number of training examples asRBF centers

    This method has the advantage of being very fast, but the network will likely require anexcessive number of centers

    Once the center positions have been selected, the spread parameters j can beestimated, for instance, from the average distance between neighboring centers

    Clustering Alternatively, RBF centers may be obtained with a clustering procedure such as

    the k-means algorithm The spread parameters can be computed as before, or from the sample covariance

    of the examples of each cluster Density estimation

    The position of the RB centers may also be obtained by modeling the featurespace density with a Gaussian Mixture Model using the Expectation- Maximizationalgorithm

    The spread parameters for each center are automatically obtained from thecovariance matrices of the corresponding Gaussian components

  • CSCI 5521: Paul Schrater

    Probabilistic Interpretion of RBFs forClassification

    =

    IntroduceMixture

  • CSCI 5521: Paul Schrater

    Prob RBF cond

    where

    Basis function outputs: Posterior jthFeature probabilities

    weights: Class probabilitygiven jth Feature value

  • CSCI 5521: Paul Schrater

    Regularized Spline Fits

    !

    " =

    #1X

    1( ) L #N X1( )M

    #1XM( ) L #N XM( )

    $

    %

    & & &

    '

    (

    ) ) )

    !

    ypred ="w

    Minimize

    L = y -"w( )T

    y -"w( ) + #wT$w

    Given

    Make a penalty on large curvature

    !

    "ij =d2#i x( )dx

    2

    $

    % &

    '

    ( ) *d2# j x( )dx

    2

    $

    % &

    '

    ( ) dx

    !

    Minimize

    L = y -"w( )Ty -"w( ) + #wT$w

    Solution

    w = "T"+ #$( )%1

    "T r y Generalized Ridge Regression

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

    Equivalent Kernel

    Linear Regression solution admit a kernelinterpretation

    !

    w = "T"+ #$( )%1

    "Tr y

    ypred (x) ="(x)w ="(x) "T"+ #$( )

    %1

    "Tr y

    Let "T"+ #$( )%1

    "T = S#"T

    ypred (x) ="(x) S#r & (xi)yi'

    = "(x)S#r & (xi)yi'

    = K(x, xi)yi'

  • CSCI 5521: Paul Schrater

    Eigenanalysis and DF

    An eigenanalysis of the effective kernel gives theeffective degrees of freedom (DF), i.e. # of basisfunctions

    !

    Evaluate K at all points

    K ij = K(xi,x j )

    Eigenanalysis

    K =VDV"1

    =USUT

  • CSCI 5521: Paul Schrater

    EquivalentBasis

    Functions

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

    Computing Error Bars on Predictions

    It is important to be able to compute error bars onlinear regression predictions. We can do that bypropagating the error in the measured values to thepredicted values.

    !

    " =

    #1

    X1( ) L #N X1( )

    M

    #1

    XM( ) L #N XM( )

    $

    %

    & & &

    '

    (

    ) ) )

    w = "T"( )*1"T

    r y

    ypred = "w = " "T"( )

    *1"T

    r y = S

    r y

    cov[ypred ] = cov[Sr y ] = Scov[

    r y ]S

    T = S +I( )ST = +SST

    Given the Gram Matrix Simple Least squares

    predictions

    Error

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater

  • CSCI 5521: Paul Schrater