Regression i i i

  • Published on
    20-Oct-2015

  • View
    8

  • Download
    2

Embed Size (px)

Transcript

<ul><li><p>CSCI 5521: Paul Schrater</p><p>Radial Basis Function Networks</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Introduction In this lecture we will look at RBFs, networks where the</p><p>activation of hidden is based on the distance betweenthe input vector and a prototype vector</p><p> Radial Basis Functions have a number ofinteresting properties</p><p> Strong connections to other disciplines function approximation, regularization theory, density estimation and</p><p>interpolation in the presence of noise [Bishop, 1995] RBFs allow for a straightforward interpretation of the</p><p>internal representation produced by the hidden layer RBFs have training algorithms that are significantly</p><p>faster than those for MLPs RBFs can be implemented as support vector machines</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Radial Basis Functions Discriminant function flexibility</p><p>NON-LinearBut with sets of linear parameters at each layerProvably general function approximators for sufficient nodes</p><p> Error functionMixed- Different error criteria typically used for hidden vs.output layers.</p><p>Hidden: Input approx.Output: Training error</p><p> Optimization Simple least squares - with hybrid training solution isunique.</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Two layer Non-linear NN</p><p>Non-linear functions are radially symmetric kernels, with freeparameters</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Input to Hidden Mapping</p><p> Each region of feature space by means of a radially symmetricfunction Activation of a hidden unit is determined by the DISTANCE between the input</p><p>vector x and a prototype vector </p><p> Choice of radial basis function Although several forms of radial basis may be used, Gaussian kernels are</p><p>most commonly used The Gaussian kernel may have a full-covariance structure, which requires D(D+3)/2</p><p>parameters to be learned</p><p> or a diagonal structure, with only (D+1) independent parameters</p><p>In practice, a trade-off exists between using a small number of basis with manyparameters or a larger number of less flexible functions [Bishop, 1995]</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Builds up a function out of Spheres</p></li><li><p>CSCI 5521: Paul Schrater</p><p>RBF vs. NN</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p><p>RBFs have their origins in techniques forperforming exact interpolation</p><p>These techniques place a basis function at each ofthe training examples f(x)</p><p>and compute the coefficients wk so that themixture model has zero error at examples</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Formally: Exact Interpolation</p><p>! </p><p>y = h(x) = wi" x # x i( )i=1:n</p><p>$</p><p>"11</p><p>"12</p><p>L "1n</p><p>"21</p><p>"22</p><p>L "2n</p><p>M M M M</p><p>"n1 "n2 L "nn</p><p>% </p><p>&amp; </p><p>' ' ' </p><p>( </p><p>) </p><p>* * * </p><p>w1</p><p>w2</p><p>M</p><p>wn</p><p>% </p><p>&amp; </p><p>' ' ' </p><p>( </p><p>) </p><p>* * * </p><p>=</p><p>t1</p><p>t2</p><p>M</p><p>tn</p><p>% </p><p>&amp; </p><p>' ' ' </p><p>( </p><p>) </p><p>* * * </p><p>" ij =" x i # x j( )</p><p>Goal: Find a function h(x), such that h(xi) = ti, for inputs xi and targets ti</p><p>! </p><p>"r w =</p><p>r t </p><p>r w = "T"( )</p><p>#1</p><p>"Tr t </p><p>Solve for w</p><p>Form</p><p>Recall the Kerneltrick</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Solving conditions</p><p> Micchellis Theorem: If the points xi are distinct, then the matrix will be nonsingular.</p><p>Mhaskar and Micchelli, Approximation by superposition of sigmoidal and radial basis functions,Advances in Applied Mathematics,13, 350-373, 1992.</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Radial Basis function Networks</p><p>! </p><p>n outputs, m basis functions</p><p>yk = hk (x) = wki" ii=1:m</p><p># x( ) + wk0, e.g. " i(x) = exp $x$ j</p><p>2</p><p>2%j2</p><p>&amp; </p><p>' ( </p><p>) </p><p>* + </p><p>= wki" ii= 0:m</p><p># x( ), "0(x) =1</p><p>Rewrite this as :</p><p>r y = W</p><p>r " (x),</p><p>r " (x) =</p><p>"0(x)M</p><p>"m (x)</p><p>, </p><p>- </p><p>. </p><p>. </p><p>/ </p><p>0 </p><p>1 1 </p></li><li><p>CSCI 5521: Paul Schrater</p><p>Learning</p><p> RBFs are commonly trained following a hybrid procedurethat operates in two stages</p><p>Unsupervised selection of RBF centers RBF centers are selected so as to match the distribution of training examples</p><p>in the input feature space This is the critical step in training, normally performed in a slow iterative</p><p>manner A number of strategies are used to solve this problem</p><p> Supervised computation of output vectors Hidden-to-output weight vectors are determined so as to minimize the sum-</p><p>squared error between the RBF outputs and the desired targets Since the outputs are linear, the optimal weights can be computed using fast,</p><p>linear</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Least squares solution for output weightsGiven L input vectors xl, with labels tl</p><p>! </p><p>Form the least square error function :</p><p>E = yk (x l ) " tlk{ }</p><p>k=1:n</p><p>#l=1:L</p><p>#2</p><p>, tlk is the k th value of the target vector</p><p>E = (r y (x l ) "</p><p>r t l )</p><p>t (r y (x l ) "</p><p>r t l )</p><p>l=1:L</p><p>#</p><p>= (r $ x l( )W</p><p>t "r t l )</p><p>l=1:L</p><p>#t</p><p>(r $ x l( )W</p><p>t "r t l )</p><p>E = (%Wt "T)l=1:L</p><p>#t</p><p>(%Wt "T) = W%t%Wt " 2W%tT+l=1:L</p><p># TtT</p><p>&amp;E</p><p>dW= 0 = 2%t%Wt " 2%tT</p><p>Wt = % t%( )"1</p><p>%tT</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Solving for input parameters Now we have a linear solution, if we know the input</p><p>parameters. How do we set the input parameters? Gradient descent--like Back-prop Density estimation viewpoint: By looking at the meaning of</p><p>the RBF network for classification, the input parameters canbe estimated via density estimation and/or clusteringtechniques</p><p> We will skip these in lecture</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Unsupervised methods overview Random selection of centers</p><p> The simplest approach is to randomly select a number of training examples asRBF centers</p><p> This method has the advantage of being very fast, but the network will likely require anexcessive number of centers</p><p> Once the center positions have been selected, the spread parameters j can beestimated, for instance, from the average distance between neighboring centers</p><p> Clustering Alternatively, RBF centers may be obtained with a clustering procedure such as</p><p>the k-means algorithm The spread parameters can be computed as before, or from the sample covariance</p><p>of the examples of each cluster Density estimation</p><p> The position of the RB centers may also be obtained by modeling the featurespace density with a Gaussian Mixture Model using the Expectation- Maximizationalgorithm</p><p> The spread parameters for each center are automatically obtained from thecovariance matrices of the corresponding Gaussian components</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Probabilistic Interpretion of RBFs forClassification</p><p>=</p><p>IntroduceMixture</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Prob RBF cond</p><p>where</p><p>Basis function outputs: Posterior jthFeature probabilities</p><p>weights: Class probabilitygiven jth Feature value</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Regularized Spline Fits</p><p>! </p><p>" =</p><p>#1X</p><p>1( ) L #N X1( )M</p><p>#1XM( ) L #N XM( )</p><p>$ </p><p>% </p><p>&amp; &amp; &amp; </p><p>' </p><p>( </p><p>) ) ) </p><p>! </p><p>ypred ="w</p><p>Minimize</p><p>L = y -"w( )T</p><p>y -"w( ) + #wT$w</p><p>Given</p><p>Make a penalty on large curvature</p><p>! </p><p>"ij =d2#i x( )dx</p><p>2</p><p>$ </p><p>% &amp; </p><p>' </p><p>( ) *d2# j x( )dx</p><p>2</p><p>$ </p><p>% &amp; </p><p>' </p><p>( ) dx</p><p>! </p><p>Minimize</p><p>L = y -"w( )Ty -"w( ) + #wT$w</p><p>Solution</p><p>w = "T"+ #$( )%1</p><p>"T r y Generalized Ridge Regression</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Equivalent Kernel</p><p> Linear Regression solution admit a kernelinterpretation</p><p>! </p><p>w = "T"+ #$( )%1</p><p>"Tr y </p><p>ypred (x) ="(x)w ="(x) "T"+ #$( )</p><p>%1</p><p>"Tr y </p><p>Let "T"+ #$( )%1</p><p>"T = S#"T</p><p>ypred (x) ="(x) S#r &amp; (xi)yi'</p><p> = "(x)S#r &amp; (xi)yi'</p><p> = K(x, xi)yi'</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Eigenanalysis and DF</p><p> An eigenanalysis of the effective kernel gives theeffective degrees of freedom (DF), i.e. # of basisfunctions</p><p>! </p><p>Evaluate K at all points</p><p> K ij = K(xi,x j )</p><p>Eigenanalysis</p><p>K =VDV"1</p><p>=USUT</p></li><li><p>CSCI 5521: Paul Schrater</p><p>EquivalentBasis</p><p>Functions</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p><p>Computing Error Bars on Predictions</p><p> It is important to be able to compute error bars onlinear regression predictions. We can do that bypropagating the error in the measured values to thepredicted values.</p><p>! </p><p>" =</p><p>#1</p><p>X1( ) L #N X1( )</p><p>M</p><p>#1</p><p>XM( ) L #N XM( )</p><p>$ </p><p>% </p><p>&amp; &amp; &amp; </p><p>' </p><p>( </p><p>) ) ) </p><p> w = "T"( )*1"T</p><p>r y </p><p>ypred = "w = " "T"( )</p><p>*1"T</p><p>r y = S</p><p>r y </p><p>cov[ypred ] = cov[Sr y ] = Scov[</p><p>r y ]S</p><p>T = S +I( )ST = +SST</p><p>Given the Gram Matrix Simple Least squares</p><p>predictions</p><p>Error</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p></li><li><p>CSCI 5521: Paul Schrater</p></li></ul>