A Leave-K-Out Cross-Validation Scheme for Unsupervised Kernel Regression · 2010. 8. 13. · Stefan Klanke and Helge Ritter. {sklanke,helge}@techfak.uni-bielefeld.de Neuroinformatics

A Leave-K-Out Cross-Validation Scheme

for Unsupervised Kernel Regression

Stefan Klanke and Helge Ritter

{sklanke,helge}@techfak.uni-bielefeld.de

Neuroinformatics Group

Faculty of Technology

University of Bielefeld

P.O. Box 10 01 31

33501 Bielefeld, Germany

Abstract

We show how to employ leave-K-out cross-validation in Unsuper-vised Kernel Regression, a recent method for learning of nonlinear man-ifolds. We thereby generalize an already present regularization method,yielding more flexibility without additional computational cost. Wedemonstrate our method on both toy and real data.

1 Introduction

Unsupervised Kernel Regression (UKR) is a recent approach for the learningof principal manifolds. It has been introduced as an unsupervised counter-part of the Nadaraya-Watson kernel regression estimator in [1]. Probablythe most important feature of UKR is the ability to include leave-one-outcross-validation (LOO-CV) at no additional cost. In this work, we show howextending LOO-CV to leave-K-out cross-validation (LKO-CV) gives rise toa more flexible regularization approach, while keeping the computationalefficiency.

The paper is organized as follows: In the next section we recall the UKRalgorithm and briefly review its already existing regularization approaches.After that, we introduce our generalization to LKO-CV as well as a simplecomplementary regularizer. Then, we report some results of our experimentsand finally we conclude with some remarks on the method and an outlookto further work.

1

2 The UKR Algorithm

In classical (supervised) kernel regression, the Nadaraya-Watson estimator[2, 3]

f(x) =∑

i

yi

K(x− xi)∑

j K(x − xj)(1)

is used to describe a smooth mapping y = f(x) that generalizes the re-lation between available input and output data samples {xi} and {yi}.Here, K(·) is a density kernel function, e.g. the Gaussian kernel K(v) ∝exp

[

− 1

2h2‖v‖2]

, where h is a bandwidth parameter which controls the smooth-ness of the mapping.

In unsupervised learning, one seeks both a faithful lower dimensionalrepresentation (latent variables) X = (x1,x2, . . . ,xN ) of an observed dataset Y = (y1,y2, . . . ,yN ) and a corresponding functional relationship. UKRaddresses this problem by using (1) as the mapping from latent space todata space, whereby the latent variables take the role of the input data andare treated as parameters of the regression function. By introducing a vectorb(·) ∈ R

N of basis functions, the latter can conveniently be written as

f(x;X) =∑

i

yi

K(x − xi)∑

j K(x− xj)= Yb(x;X) . (2)

While the bandwidth parameter h is crucial in classical kernel regression,here we can set h=1, because the scaling of X itself is free. Thus, UKRrequires no additional parameters besides the choice of a density kernel1.This distinguishes UKR from many other algorithms (e.g. [4, 5]) that, albeitusing a similar form of regression function, need an a priori specification ofmany parameters (e.g. the number of basis functions).

Training an UKR manifold, that is, finding optimal latent variables X,involves gradient-based minimization of the reconstruction error (or empir-ical risk)

R(X) =1

N

∑

i

‖yi − f(xi;X)‖2 =1

N‖Y − YB(X)‖2

F , (3)

where the N × N -matrix of basis functions B(X) is given by

(B(X))ij = bi(xj) =K(xi − xj)

∑

k K(xk − xj). (4)

To avoid getting stuck in poor local minima, one can incorporate nonlinearspectral embedding methods (e.g. [6, 7, 8]) to find good initializations.

It is easy to see that without any form of regularization, (3) can betrivially minimized to R(X) = 0 by moving the xi infinitely apart from each

1which is known to be of relatively small importance in classical kernel regression

2

other. In this case, since K(·) is a density function, ∀i6=j‖xi − xj‖ → ∞implies that K(xi − xj) → δijK(0) and thus B(X) becomes the N × N

identity matrix.

2.1 Existing regularization approaches

2.1.1 Extension of latent space

A straight-forward way to prevent the aforementioned trivial interpolationsolution and to control the complexity of an UKR model is to restrict thelatent variables to lie within a certain allowed (finite) domain X , e.g. asphere of radius R. Training of the UKR model then means solving theoptimization problem

minimize R(X) =1

N‖Y − YB(X)‖2

F subject to ∀i‖xi‖ ≤ R . (5)

A closely related, but softer and numerically easier method is to add apenalty term to the reconstruction error (3) and minimize Re(X, λ) =R(X) + λS(X) with S(X) =

∑

i ‖xi‖2. Other penalty terms (e.g. the

Lp-norm) are possible.With the above formalism, the model complexity can be directly con-

trolled by the pre-factor λ or the parameterization of X . However, normallyone has no information about how to choose these parameters. Bigger valuesof λ lead to stronger overlapping of the density kernels and thus to smoothermanifolds, but it is not clear how to select λ to achieve a certain degree ofsmoothness.

2.1.2 Density in latent space

The denominator in (1) is proportional to the Rosenblatt-Parzen densityestimator p(x) = 1

N

∑Ni=1

K(x−xi). Stronger overlap of the kernel functionscoincides with higher densities in latent space, which gives rise to anothermethod for complexity control. As in the last paragraph, the density p(x)can be used both in a constraint minimization of R(X) subject to ∀ip(xi) ≥ η

or in form of a penalty function with some pre-factor λ. Compared toa regularization based on the extension of latent space, the density basedregularization tends to work more locally and allows a clustered structure ofthe latent variables (non-contiguous manifolds). Again, suitable values forλ and η can be difficult to specify.

2.1.3 Leave-one-out cross-validation

Perhaps the strongest feature of UKR is the ability to include leave-one-out cross-validation (LOO-CV) without additional computational cost. In-stead of minimizing the reconstruction error of a UKR model including the

3

complete dataset, in LOO-CV each data vector yi has to be reconstructedwithout using yi itself:

Rcv(X) =1

N

∑

i

‖yi − f−i(xi;X)‖2 =1

N‖Y − YBcv(X)‖2

F (6)

f−i(x) =∑

m6=i

ym

K(x − xm)∑

j 6=i K(x− xj)(7)

For the computation of the matrix of basis functions Bcv, this just meanszero-ing the diagonal elements before normalizing the column sums to 1. Asimilar strategy works also for calculating the gradient of (6).

As long as the dataset is not totally degenerated (e.g. each yi existsat least twice), LOO-CV can be used as a built-in automatic complexitycontrol. However, under certain circumstances LOO-CV can severely un-dersmooth the manifold, particularly in the case of densely sampled noisydata. See Fig. 1 (first plot, K=1) for a UKR curve fitted to a sample of anoisy spiral distribution as a result of minimizing the LOO-CV-error (6).

2.1.4 Regularization by special loss functions

Recently, we showed in [9] how to regularize UKR manifolds by incorporatinggeneral loss functions instead of the squared Euclidean error in (3). Inparticular, the ǫ-insensitive loss is favorable if one has information aboutthe level of noise present in the data.

3 UKR with Leave-K-Out Cross-Validation

Generally, leave-K-out cross-validation consists of forming several subsetsfrom a dataset, each missing a different set of K patterns. These K patternsare used to validate a model that is trained with the corresponding subset.The resulting models are then combined (e.g. averaged) to create a modelfor the complete dataset. The special case K = 1 is identical to LOO-CV.

Since UKR comes with LOO-CV “for free”, it is interesting to investigateif the concept is applicable for K > 1. Hereto, we first have to specifyhow to form the subsets. With the aim to both maximize and equallydistribute the effect of omitting each K data vectors on how UKR fits themanifold, we opt to reconstruct each data vector without itself and its K−1nearest neighbors. Concerning this, please recall that the UKR function (2)computes a locally weighted average of the dataset. Therefore, normally,each data vector is mainly reconstructed from its neighbors. By omittingthe immediate neighbors we shift the weight to data vectors farther away,which forces the kernel centers xi to huddle closer together and thus leadsto a smoother regression function.

4

Please note that in contrast to standard LKO-CV, this procedure yieldsN different subsets of size N − K, each being responsible for the recon-struction of one data vector. A corresponding objective function, whichautomatically combines the subset models, can be stated as

Rlko(X) =1

N

∑

i

‖yi − fi(xi;X)‖2 =1

N‖Y − YBlko(X)‖2

F (8)

fi(x) =∑

m6∈Ni

ym

K(x − xm)∑

j 6=i K(x− xj), (9)

where Ni describes the index set of neighbors excluded for reconstructingyi.

In principle, we may consider neighborhoods both in latent space anddata space, since a good mapping will preserve the topology anyway. How-ever, it is much simpler to regard only the original neighborhood relation-ships in data space, because these are fixed. The latent space neighborhoodsmay change with every training step, and thus have to be recomputed. Fur-thermore, convergence is not guaranteed anymore, because the latent vari-ables X might jump between two “optimal” states belonging to differentneighborhood structures.

As with LOO-CV, data space neighborhood LKO-CV can be imple-mented in UKR with nearly no additional cost. All one has to do is zeroingcertain components of the matrix Blko before normalizing its column sumsto 1. In particular, set bij = 0, if i ∈ Nj, with fixed and precomputed indexsets Nj.

One might argue that the whole idea seems somehow strange, especiallyif the UKR model is initialized by a spectral embedding method (e.g. LLE)which takes into account some K ′ nearest neighbors for constructing thelower dimensional representation. Thus, in a way, UKR with LKO-CV worksagainst its initialization method. On the other hand, this can be viewed asbeing complementary. Furthermore, our experiments not only show that theidea is sound, but even indicate that selecting K = K ′ is not a bad choiceat all.

3.1 How to get smooth borders

As we will show in the next section, LKO-CV does work well at the interior ofa manifold, but not at its borders. This results naturally from the topology:At the borders of a 1D manifold (that is, at the ends of a curve) for example,all K neighbors lie in the same direction. Thus, the nearest data pointstaking part in reconstructing the end points are exceptionally far away. If,after training, the curve is sampled by evaluating the normal UKR function(2), the ends get very wiggly, especially for larger K.

5

To overcome this problem, we propose to employ an additional regular-izer that smoothes at the borders without disturbing LKO-CV in regionsthat are already fine. Hereto, penalizing the extension of latent space (e.g.by using a penalty term of the form S(X) = ‖X‖2

F ) is a bad choice, sincethis would affect the manifold as a whole and not only the borders. Thesame argument applies to a penalty term of the form S(X) = −

∑

i log p(xi),which favors high densities and thus again smoothes the complete manifold.A possible choice, however, is to penalize the variance of the density inlatent space. For this, we apply the following penalty term:

S(X) =1

N

∑

i

(p(xi) − p(X))2 , p(X) =1

N

∑

j

p(xj). (10)

The UKR model is thus regularized by two factors: 1) the LKO param-eter K determines the overall smoothness and 2) the penalty term S(X),scaled by an appropiate pre-factor λ, ensures that the smoothness is evenlydistributed.

Because these regularizers have more or less independent goals, one mayhope that the results show considerable robustness towards the choice of λ.Indeed, for a UKR model of a noisy spiral (Fig. 2), there was no visualdifference between results for λ = 0.001 and λ = 0.0001. Only a muchsmaller value (λ = 10−6), led to wiggly ends, again.

4 Experiments

In all following experiments, we trained the UKR manifolds in a commonway: For initialization, we calculated multiple LLE [6] solutions correspond-ing to different neighborhood sizes K ′, which we compared with respect totheir LKO-CV error (8) after a coarse optimization of their overall scale.While this procedure may seem rather computationally expensive, it greatlyenhances the robustness, because LLE and other nonlinear spectral embed-ding methods can depend critically on the choice of K ′. In our experiments,the best LLE neighborhood size K ′ did not depend on which LKO neigh-borhood size K we used. Further fine-tuning was done by gradient-basedminimization, applying 500 RPROP [10] steps. For simplicity, we used onlythe Gaussian kernel in latent space.

4.1 Noisy Spiral

As a first example, we fitted a UKR model to a 2D “noisy spiral” toy dataset,which contains 300 samples with noise distributed uniformly in the interval[−0.1; 0.1]. We tried LLE neighborhood sizes K ′ = 4 . . . 12, of which K ′ = 7led to the best initialization. Fig. 1 shows the results for different values ofthe LKO-CV parameter K as indicated in the plots. Note how the manifoldgets smoother for larger K, without suffering from too much bias towards

6

the inner of the spiral. A bit problematic are the manifolds ends, which getquite wiggly for larger K. Note that K = K ′ = 7 yields a satisfactory levelof smoothness.

K=1 K=2 K=3 K=4 K=7

K=10 K=13 K=16 K=20 K=24

Figure 1: UKR model of a noisy spiral using LKO-CV. The data points aredepicted as grey dots, the black curve shows the manifold which results fromsampling f(x;X).

To show the effect of the density variance penalty term (10), we repeatedthe experiment adding the penalty with pre-factors λ = 10−3, 10−4 and 10−6.Fig. 2 shows the results for λ = 10−4, which are visually identical to thosefor λ = 10−3. However, a pre-factor of only 10−6 turned out to be too small,resulting in wiggly ended curves similar to those in Fig. 1.

K=1 K=2 K=3 K=4 K=7

K=10 K=13 K=16 K=20 K=24

Figure 2: UKR model of a noisy spiral using both LKO-CV and the densityvariance penalty term (10) scaled by a pre-factor λ = 10−4.

Some insight on the influence of the density variance penalty is providedby Fig. 3: Most of the latent variables stay in the same region, but theoutliers (the little bumps to the far left and right) are drawn towards thecenter, compacting the occupied latent space. Figure 4 shows a magnifiedcomparison of the UKR models (K = 24) with and without the penalty

7

term. In addition to the original data points and the resulting curve, it alsodepicts the data as it is reconstructed during training, that is, using theLKO function (9). Note that these LKO reconstructions show a strong biastowards the inner of the spiral, which is not present in the final mapping (2)based on the complete data set.

−80 −60 −40 −20 0 20 40 60 800

0.02

0.04

0.06

0.08

0.1

x (latent space)

dens

ity

Figure 3: Comparison of latent densities for UKR models of a noisy spiralusing a) only LKO-CV (K = 24, depicted in black) and b) LKO-CV togetherwith the densisty variance penalty (K = 24, λ = 10−4, depicted in gray).The curves result from sampling p(x), the dots indicate the latent variablepositions.

K=24 K=24

Figure 4: Comparison of UKR models of a noisy spiral. Left: pure LKO-CV(K = 24). Right: with additional density variance penalty (λ = 10−4). Thedots depict the observed data points, the black curve depicts the manifold,and the gray pluses depict the LKO reconstructions (9).

4.2 Noisy Swiss Roll

As a second experiment, we fitted a UKR model to a noisy “Swiss Roll”dataset. We first computed LLE solutions with K ′ = 3 . . . 18, of whichK ′ = 7 was selected as the best initialization for all UKR models. Figure

8

5 shows the dataset as reconstructed with LOO-CV (K = 1) and LKO-CV(K =7). Instead of comparing the results for multiple K’s visually again, weprojected the reconstructed datasets onto the underlying data model (i.e.the smooth continuous “Swiss Roll”). Figure 6 shows the resulting meandistance as a function of K. The minimum is at K =9, with our proposedautomatic choice K =K ′=7 being nearly as good.

−10 −5 0 5 10010

20

−10

−5

0

5

10

−10 −5 0 5 10010

20

−10

−5

0

5

10

Figure 5: UKR reconstruction of a “Swiss Roll”. Left: K = 1 (LOO-CV).Right: K = 7. The black dots depict the UKR reconstructions, whereas thegray dots depict their projection (along the black lines) onto the underlyingsmooth data model. Note the much smaller projection error (distance tothe “true” manifold) in the right plot.

1 4 7 10 13 16

0.25

0.3

0.35

0.4

K

mea

n pr

ojec

tion

erro

r

Figure 6: Mean distance between LKO-CV-UKR reconstructions and theirprojections onto the underlying smooth data manifold. The correspondingprojection error of the observed (noisy) data points is 0.498. Please notethat the y-axis does not start at 0.

4.3 USPS Digits

To show that LKO-CV-UKR also works with higher dimensional data, ourlast experiment deals with the USPS handwritten digits. In particular, we

9

work with the subset corresponding to the digit “2”, which contains 731data vectors in 256 dimensions (16x16 pixel gray-scale images). As withthe “Swiss Roll”, we compared the results of LOO-CV and LKO-CV withK = K ′ = 12, that is, we chose the LKO parameter to be identical tothe automatically selected LLE neighborhood size. Both models use thedensity variance penalty with a pre-factor2 λ = 0.01. Figure 7 visualizesthe resulting manifolds (we chose a 2D embedding) by sampling f(x;X) inlatent space and depicting the function value as the corresponding image.Note the smaller extension in latent space and the blurrier images of themodel belonging to K = 12 (right plot).

−20 −10 0 10 20

−20

−15

−10

−5

0

5

10

15

20

−15 −10 −5 0 5 10 15

−15

−10

−5

0

5

10

15

Figure 7: UKR model of the USPS digit “2”, shown by evaluating f(x;X)on a 20x20 grid enclosing the latent variables. Grid positions of low densityp(x) are left blank. Left: K = 1 (LOO-CV). Right: K = 12.

5 Conclusion

In this work, we described how leave-K-out cross-validation (LKO-CV) canbe employed in the manifold learning method UKR, generalizing the alreadypresent LOO-CV regularization. We demonstrated our approach on bothsynthetic and real data. When used with pre-calculated data space neigh-borhoods, LKO-CV involves nearly no additional computational cost, butcan yield favorable results. This was revealed especially in the noisy “SwissRoll” experiment, where LKO-CV significantly reduced the projection error,i.e. the mean distance between the reconstructed (de-noised) dataset andthe “true” underlying manifold.

While we gave no final answer to the question how to choose the newregularization parameter K, our experiments indicate that simply setting

2Here, we used a larger λ because the data’s variance is larger, too.

10

K = K ′ (the neighborhood size of the best LLE solution, which UKR canautomatically detect) yields satisfactory results. In addition, we showed howa complementary regularizer, which is based on penalizing a high varianceof the latent density, can further enhance the UKR models trained withLKO-CV. By promoting an even distribution of smoothness, this regularizerdiminishes the problem of rather wiggly manifold borders, which otherwisemay result from a pure LKO-CV regularization. When used as a penaltyterm, the complementary regularizer is quite robust towards the choice ofan appropiate pre-factor.

Further work may address other possibilities to deal with the borderproblem, e.g. by a smart local adaption of the neighborhood parameter K.We also succesfully experimented with leave-R-out CV, a scheme where nota fixed number of neighbors are left out, but all neighbors within a sphereof fixed size. Finally, it will be interesting to see how UKR with LKO-CVperforms in real applications.

References

[1] Meinicke, P., Klanke, S., Memisevic, R., Ritter, H.: Principal surfacesfrom Unsupervised Kernel Regression. IEEE Transactions on PatternAnalysis and Machine Intelligence 27(9) (2005) 1379–1391

[2] Nadaraya, E.A.: On estimating regression. Theory of Probability andIts Application 10 (1964) 186–190

[3] Watson, G.S.: Smooth regression analysis. Sankhya Series A 26 (1964)359–372

[4] Bishop, C.M., Svensen, M., Williams, C.K.I.: GTM: The GenerativeTopographic Mapping. Neural Computation 10(1) (1998) 215–234

[5] Smola, A.J., Williamson, R.C., Mika, S., Scholkopf, B.: RegularizedPrincipal Manifolds. Lecture Notes in Computer Science 1572 (1999)214–229

[6] Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by LocallyLinear Embedding. Science 290 (2000) 2323–2326

[7] Belkin, M., Niyogi, P.: Laplacian Eigenmaps for dimensionality re-duction and data representation. Neural Computation 15 (6) (2003)1373–1396

[8] Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometricframework for nonlinear dimensionality reduction. Science 290 (2000)2319–2323

11

[9] Klanke, S., Ritter, H.: Variants of Unsupervised Kernel Regression:General loss functions. In: Proc. European Symposium on ArtificialNeural Networks. (2006) to appear.

[10] Riedmiller, M., Braun, H.: A direct adaptive method for faster back-propagation learning: The RPROP algorithm. In: Proc. of the IEEEIntl. Conf. on Neural Networks. (1993) 586–591

12

Documents

A Leave-K-Out Cross-Validation Scheme for Unsupervised Kernel Regression · 2010. 8. 13. · Stefan Klanke and Helge Ritter. {sklanke,helge}@techfak.uni-bielefeld.de Neuroinformatics