Dimension Reduction for Regression with Reproducing Kernel ...fukumizu/papers/kernelDRR.pdf · sider only real Hilbert spaces for simplicity. The most important point on reproducing

Dimension Reduction for Regression with

Reproducing Kernel Hilbert Spaces

Kenji Fukumizu∗

[email protected] of Statistical Mathematics

Francis R. [email protected]

University of California, Berkeley

Michael I. [email protected]

University of California, Berkeley

February 26, 2003

Draft. Not for circulation.

AbstractWe propose a novel method of dimension reduction for regression

using reproducing kernel Hilbert spaces. For a regression problem,where the statistical dependency of a variable Y on explanatory vari-ables X is analyzed, the purpose of dimension reduction is to find theeffective subspace of X, which keeps all the statistical information onY . Using the formulation of conditional independency with the covari-ance operators on reproducing kernel Hilbert spaces, we theoreticallyderive the kernel generalized variance as the contrast function for es-timation of the effective subspace. Unlike many conventional methodsof dimension reduction for regression, the proposed method requiresneither assumptions on the distribution of X and Y , nor a parametricmodel of the regressor. The effectiveness of the method is verified bycomparative experiments with conventional methods.

Keywords: regression, dimension reduction, variable selection, featureselection, kernel, conditional independence.

∗This work was done while the author was visiting University of California, Berkeley.

1

1 Introduction

One of the most important statistical methods is regression, which triesto determine the statistical dependency of a random variable Y using ex-planatory variables X. It includes many problems in statistics and machinelearning, such as pattern classification, function estimation, and explanatorydata analysis.

Dimension reduction of the explanatory variables is often employed inconjunction with regression to achieve a compact representation of the rela-tion between X and Y . In data analysis such as multiple regression, variableselection and dimension reduction are popular tools to find a subset or linearcombinations of variables that explain Y most effectively. In addition, di-mension reduction is a basic technique to treat very high-dimensional data,which raise problems in computational cost and estimation accuracy. As thenecessity of analyzing many high-dimensional data, such as images, texts,and DNA microarray, is growing, dimension reduction and variable selectionfor high-dimensional data have attracted much attention (Kambhatla andLeen 1997; Deerwester et al. 1990; Golub et al. 1999).

This paper proposes Kernel Dimension reduction for Regression (KDR)as a novel semiparametric method of dimension reduction for regression us-ing reproducing kernel Hilbert spaces (Aronszajn 1950). Since the success ofsupport vector machines (Boser, Guyon, and Vapnik 1992), the methodologyof reproducing kernels has been developed in many fields of data analysis(Vapnik et al. 1997; Scholkopf et al. 1998). Bach and Jordan (2002a)proposed kernel ICA, a kernel method for independent component analysis(ICA). Unlike many other kernel methods, the kernel ICA uses reproduc-ing kernels only to define a contrast function for the independency in thesemiparametric framework, in which the unestimated functions are taken inthe reproducing kernel Hilbert spaces. The KDR method extends this ideato conditional independency by using covariance operators on reproducingkernel Hilbert spaces.

For a regression problem with a random variable Y and its explanatoryvariables X, our problem is to find a projection ΠS of X onto a subspaceS such that the conditional probability of Y given X is completely real-ized by the conditional probability of Y given ΠSX. This is equivalent tosearching a projection ΠS which makes Y and (I−ΠS)X conditionally inde-pendent given ΠSX. We mathematically derive that the kernel generalizedvariance (Bach and Jordan 2002a) gives the objective function to measurethe extent of this conditional independency. The kernel generalized vari-ance was originally proposed for the kernel ICA, and has been applied to

2

the tree-dependent component analysis (Bach and Jordan 2002b) and learn-ing in general graphical models (Bach and Jordan 2003). In the latter twocases, it is used as a surrogate for the mutual information to measure theconditional independency described by a graph, while it shows a good per-formance in practical use. In dimension reduction for regression, the kernelgeneralized variance is proved to be a rigorous criterion of the conditionalindependency.

The KDR method does not require strong assumptions on the data struc-ture, which are often necessary for conventional dimension reduction meth-ods. Apart from regression, the most famous method of dimension reductionis the principal component analysis (PCA). Although PCA is sometimesused for preprocessing to regression, the subspace obtained without usingY does not necessarily give the effective subspace for regression. Amongmethods incorporating X and Y , canonical correlation analysis (CCA) andpartial least square (PLS, Hoskuldsson 1988; Helland 1988) have been usedfor dimension reduction in regression (Fung et al. 2002; Nguyen and Rocke2002). Since they assume a linear structure in the model, they do not workfor the data with strong, unknown nonlinearity in the regressor. Sliced in-verse regression (SIR, Li 1991), principle Hessian direction (pHd, Li 1992),and sliced average variance estimation (SAVE, Cook and Weisberg 1991;Cook and Yin 2001) are semiparametric methods without assumptions onthe regressor (see also Cook 1998). However, they put strong restrictions onthe probability distribution of the explanatory variables. If these assump-tions do not hold, there is no guarantee to find the subspace. Projectionpursuit regression (Friedman and Stuetzle 1981), ACE and additive mod-els (Breiman and Friedman 1985; Hastie and Tibshirani 1986) also providea methodology of dimension reduction and variable selection, in which theadditive model E[Y |X] = g1(βT

1 X) + · · · + gK(βTKX) is assumed for the

regressor. There are also nonparametric approaches (Samarov 1993; Hris-tache et al. 2001), which estimate the derivative of the regressor to achievedimension reduction, based on the fact that the derivative of the conditionalexpectation E[y|BT x] with respect to x belongs to the subspace spanned byB. However, nonparametric estimation of derivatives suffers severely fromthe curse of dimensionality for high dimensional X. In contrast to these con-ventional ones, the KDR method needs no assumptions on the regressor orthe distribution of the explanatory variables. We compare the performanceof some of these methods experimentally in Section 4.

This paper is organized as follows. In Section 2, we specify the prob-lem of dimension reduction for regression, and describe its relation withconditional independency and mutual information. Section 3 theoretically

3

derives the contrast function to estimate the effective subspace for regres-sion, and describes the KDR method. All the mathematical details used inSection 3 are explained in Appendix. In Section 4, we experimentally verifythe effectiveness of our method with various data sets, and compare it withconventional ones. Section 5 concludes the paper and discusses future works.

2 Dimension reduction for regression

We consider a regression problem, in which Y is an dimensional randomvector, and X is an m-dimensional explanatory variables. The variable Ymay be either continuous or discrete. The probability density function ofY given X is denoted by pY |X(y|x). Assume that there is an r-dimensionalsubspace S ⊂ R

m such that

pY |X(y|x) = pY |ΠSX(y|ΠSx), (1)

for all x and y, where ΠS is the orthogonal projection of Rm onto S. The

subspace S is called the effective subspace for regression.The problem discussed in this paper is to search the effective subspace

S given an i.i.d. sample (X1, Y1), . . . , (Xn, Yn), which follows the condi-tional probability eq.(1) and a fixed probability PX of X. The crux of theproblem is that we have no a priori knowledge on the regressor, and putno assumptions or models on the conditional probability pY |X in finding thesubspace S.

The effective subspace is formulated by the conditional independency.Let (B, C) be the m-dimensional orthogonal matrix such that the columnvectors of B span the subspace S, and define U = BT X and V = CT X.Because (B, C) is an orthogonal matrix, we have

pX(x) = pU,V (u, v), pX,Y (x, y) = pU,V,Y (u, v, y), (2)

for the probability density functions. From eq.(2), eq.(1) is equivalent to

pY |U,V (y|u, v) = pY |U (y|u). (3)

This shows that the effective subspace S is the one which makes Y and Vconditionally independent given U (see Figure 1).

Mutual information gives another look at the equivalence between theconditional independency and the existence of the effective subspace. Fromeq.(2), it is straightforward to see

I(Y, X) = I(Y, U) + EU

[I(Y |U, V |U)

], (4)

4

UV

Y

X

Y

(a) (b)

Figure 1: Graphical representation of dimension reduction for regression.The variables Y and V are conditionally independent given U , where X =(U, V ).

where I(Z, W ) denotes the mutual information defined by

I(Z, W ) :=∫ ∫

pZ,W (z, w) logpZ,W (z, w)

pZ(z)pW (w)dzdw. (5)

Because eq.(1) means I(Y, X) = I(Y, U), the effective subspace S is char-acterized as the one which keeps the mutual information with Y by theprojection onto that space, or equivalently, which gives I(Y |U, V |U) = 0.This is again the conditional independency of Y and V given U .

The expression of eq.(4) can be understood using the mutual informationof a tree, which is called T-mutual information in Bach and Jordan (2002b).The T-mutual information IT for the tree Y −U −V (Figure 1 (b)) is givenby

IT = I(Y, U, V ) − I(Y, U) − I(U, V ). (6)

This is equal to the KL-divergence of a probability of (Y, U, V ) and its pro-jection onto the family of probabilities defined on the tree, that is, proba-bilities that verify Y ⊥⊥V | U . Using eq.(2), we can easily see I(Y, U, V ) =I(Y, X) + I(U, V ), and obtain

IT = I(Y, X) − I(Y, U) = EU [I(Y |U, V |U)]. (7)

Then, the dimension reduction for regression is a special case of minimizingT-mutual information for the fixed tree structure.

5

3 Kernel method of dimension reduction for re-gression

3.1 Covariance operators on reproducing Kernel Hilbert spaces

We use covariance operators on reproducing kernel Hilbert spaces to derivean objective function for the dimension reduction. While covariance opera-tors are generally defined for random variables in Banach spaces (Vakhaniaet al. 1987, Baker 1973), the theory is much simpler for reproducing kernelHilbert spaces. We summarize only basic mathematical facts in this sub-section, and defer the details to Appendix. Let (H, k) be a reproducingkernel Hilbert space of functions on a set Ω with a positive definite kernelk : Ω × Ω → R. The inner product of H is denoted by 〈·, ·〉H. We con-sider only real Hilbert spaces for simplicity. The most important point onreproducing kernel Hilbert spaces is the reproducing property;

〈f, k(·, x)〉H = f(x) for all x ∈ Ω and f ∈ H. (8)

We use the Gaussian kernel

k(x1, x2) = exp(‖x1 − x2‖2/σ2

), (9)

which corresponds to a Hilbert space of smooth functions.Let (H1, k1) and (H2, k2) be reproducing kernel Hilbert spaces over mea-

surable spaces (Ω1,B1) and (Ω2,B2), respectively, with k1 and k2 measur-able. For a random vector (X, Y ) on Ω1 ×Ω2, the covariance operator fromH1 to H2 is defined by the relation

〈g, ΣY Xf〉H2 = EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )] (10)

for all f ∈ H1 and g ∈ H2. Eq.(10) implies that the covariance of f(X) andg(Y ) is given by the linear operation of ΣY X and the inner product. SeeAppendix for the basic properties of covariance operators.

Covariance operators provide a useful method to discuss the conditionalprobability, and then, conditional independency. As we show in Corollary 3of Appendix, we obtain the following relation between conditional expecta-tion and the covariance operators, given that ΣXX is invertible1 ;

EY |X [g(Y ) | X] = Σ−1XXΣXY g for all g ∈ H2, (11)

1Even if ΣXX is not invertible, a similar fact holds. See Corollary 3.

6

In eq.(16), the inequality should be understood as the partial order of self-adjoint operators. From these relations, the effective subspace S gives thesolution of the following minimization problem;

minS

ΣY Y |U , subject to U = ΠSX. (18)

To derive an objective function from eq.(18), we have to estimate theconditional covariance operator with given data, and choose a specific wayto evaluate the closeness to zero in the order of self-adjoint operators. For theestimation of the operator, we follow exactly the same way as the derivationof the kernel ICA (Bach and Jordan 2002a). Let KY be the centralizedGram matrix (Bach and Jordan 2002a; Scholkopf et al. 1998) defined by

KY =(In − 1

n1n1Tn

)GY

(In − 1

n1n1Tn

), (19)

where (GY )ij = k1(Yi, Yj) is the Gram matrix and 1n = (1, . . . , 1)T is thevector with all elements 1. The matrices KU and KV are defined similarlyusing Uin

i=1 and Vini=1, respectively. The empirical conditional covari-

ance matrix ΣY Y |U is defined by

ΣY Y |U := ΣY Y −ΣY U Σ−1UU ΣUY = (KY +εIn)2−KY KU (KU +εIn)−2KUKY ,

(20)where ε > 0 is a constant for the regularization technique, which is used inBach and Jordan (2002a) to keep the matrix nondegenerate.

The greatness of ΣY Y |U in the order of symmetric matrices can be eval-uated by its determinant. Although there are other possible choices, suchas trace and the largest eigenvalue, we use only determinant in this pa-per. With Schur decomposition det(A−BC−1BT ) = det

(A B

BT C

)/detC, the

determinant of ΣY Y |U is written by

det ΣY Y |U =det Σ[Y U ][Y U ]

det ΣUU

, (21)

where Σ[Y U ][Y U ] is defined by

Σ[Y U ][Y U ] =(

ΣY Y ΣY U

ΣUY ΣUU

)=

((KY + εIn)2 KY KU

KUKY (KU + εIn)2

). (22)

Then, the contrast function of dimension reduction for regression is givenby

minS

det Σ[Y U ][Y U ]

det ΣY Y det ΣUU

. (23)

8

The constant det ΣY Y is multiplied to the denominator just for normaliza-tion.

Eq.(23) is exactly the same as the kernel generalized variance of Y and U(Bach and Jordan 2002a), whose negative log has been used for an approxi-mation of the mutual information through analogy with Gaussian variables.Since the objective of our problem is to maximize the mutual informationI(Y, U), the above derivation theoretically justifies the minimization of thekernel generalized variance as the method of maximizing mutual informationof Y and U in this specific case.

The KDR method is viewed as a semiparametric method, because wedo not assume anything on the conditional probability pY |X . It has infinitedegree of freedom in principle, which is treated by the infinite dimensionalreproducing Hilbert spaces. The significance of the kernel approach is, aseq.(14) indicates, that the functional part can be finally separated with theconditional covariance operator, which contains all the information of theregression dependency of the random variables.

For the optimization of the matrix B to specify the subspace S, any typeof nonlinear optimization technique can be used. We use the line searchmethod with gradient to minimize eq.(23). One problem in our method isthe computational cost in handling the matrices of size n × n. We can usethe computational techniques developed for the kernel ICA, since the objec-tive function is exactly the same. In particular, when the number of sampleis large, the incomplete Cholesky decomposition (Bach and Jordan 2002a)works very efficiently to avoid multiplications of n × n matrices. Anotherproblem is existence of local optima, as many nonlinear optimization prob-lems suffers. To alleviate this problem, we use an annealing technique, inwhich the scale parameter σ for the Gaussian kernel is decreased graduallyduring the iterations of optimization. For a larger σ, the graph of the con-trast function has less local optima, which makes optimization easier. Thesearch becomes more accurate, as σ is diminished.

4 Experimental results

We verify the effectiveness of the proposed method through experiments,and compare it with conventional methods: SIR, pHd, CCA, and PLS. Forthe experiments with SIR and pHd, the implementation for R (Weisberg2002) is used2.

2http://www.jstatsoft.org

9

-6 -4 -2 0 2 4 6-0.5

0

0.5

1

1.5

2

2.5

-6 -4 -2 0 2 4 6

-6

-4

-2

0

2

4

6

A: (X1, Y ) A: (X1, X2)

−8 −6 −4 −2 0 2 4 6 8−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

8

B: (X1, Y ) B: (X1, X2)

Figure 2: Data A and B. One dimensional Y depends only on X1 in X =(X1, X2).

4.1 Synthesized data

The first data sets A and B incorporate one-dimensional Y and two dimen-sional X = (X1, X2). One hundred i.i.d. data are generated by

A : Y ∼ 1/(1 + exp(−X1)) + Z,

B : Y ∼ 2 exp(−X21 ) + Z,

where Z ∼ N(0, 0.12), and X = (X1, X2) follows a normal distribution anda normal mixture with two components for A and B, respectively. Theeffective subspace is spanned by B0 = (1, 0)T in both cases. The data setsare depicted in Figure 2.

Table 1 shows the angles between B0 and the estimated direction. ForData A, all the methods except PLS give good estimation of B0. Data B issurprisingly difficult for the conventional methods, because the distributionof X is not spherical and the regressor has strong nonlinearity. The KDRmethod succeeds to find the correct direction accurately for both data sets.

10

SIR pHd CCA PLS KernelA: angle (rad.) 0.0087 -0.1971 0.0099 0.2736 -0.0014B: angle (rad.) -1.5101 -0.9951 -0.1818 0.4554 0.0052

Table 1: Angles between the true and the estimated spaces for Data A andB.

SIR(10) SIR(15) SIR(20) SIR(25) pHd KernelR(b1) 0.987 0.993 0.988 0.990 0.110 0.999R(b2) 0.421 0.705 0.480 0.526 0.859 0.984

Table 2: Correlation coefficients for Data B. SIR(m) indicates the SIR withm slices.

Data C has 300 samples of 17 dimensional X and one dimensional Y ,which are generated by

C : Y ∼ 0.9X1 + 0.21

1 + X17+ Z, (24)

where Z ∼ N(0, 0.012) and X follows the uniform distribution on [0, 1]17.The effective subspace is given by b1 = (1, 0, . . . , 0) and b2 = (0, . . . , 0, 1).We compare the KDR method with SIR and pHd only. CCA and PLS cannotfind a 2-dimensional subspace, because Y is one-dimensional. To evaluatethe accuracy of the results, we use the multiple correlation coefficients

R(b) = maxβ∈B

βT ΣXXb√βT ΣXXβ · bT ΣXXb

, (b ∈ B0), (25)

which is used in Li (1991). Table 2 shows the results, in which the KDRmethod outperforms the others in finding the weak contribution of the sec-ond direction.

4.2 Real data: Variable selection

The KDR method can be applied also to variable selection, while it doesnot select variables primarily but provides the linear subspace. For variableselection, we compare the values of the kernel generalized variance for allthe subspaces spanned by combinations of a fixed number of selected vari-ables. This gives a reasonable way to select variables, because for a subset

11

W = Xj1 , . . . , XjK ⊂ X1, . . . , Xm, the variables Y and WC are condi-tionally independent given W if and only if Y and ΠW cX are conditionallyindependent given ΠW X, where ΠW and ΠW C are the orthogonal projec-tions onto the subspaces spanned by W and WC , respectively. If we tryto select K variables among M explanatory variables, the total number ofevaluations is

(MK

), which is tractable for a moderate number of variables

and data size.We apply this method of variable selection to Boston Housing data (Har-

rison and Rubinfeld 1978) and Ozone data (Breiman and Friedman 1985),which have been often used as typical examples of variable selection. Tables3 and 4 give the detailed description of the data sets. There are 506 samplesin Boston Housing data, for which the variable MV, the median value ofhouse prices in a tract, is estimated by using other 13 variables. We use thecorrected version of the data set given by Gilley and Pace (1996). In Ozonedata of 330 samples, the variable UPO3, the ozone concentration, should bepredicted by other 9 variables.

Table 5 shows the best three sets of four variables that attain the smallestvalues of the kernel generalized variance. For Boston data, RM and LSTATare included in all the three results in Table 5, and PTRATIO and TAXare included in two of them. This observation agrees well to the analysisusing alternating conditional expectation (ACE) by Breiman and Friedman(1985), which gives RM, LSTAT, PTRATIO, and TAX as the four majorcontributors. The original motivation on Boston data set was to investigatethe influence of nitrogen oxide concentration (NOX) over the house price(Harrison and Rubinfeld 1978). In accordance with the previous studies,our analysis also shows the relatively small contribution of NOX. For Ozonedata, all the tree sets in the variable selection method include HMDT, SBTP,and IBHT. The variables IBTP, DGPG, and VDHT are chosen in one of thesets. This shows a fair accordance with the previous results by Breiman andFriedman (1985) and Li et al. (2000); the former concludes by ACE thatSBTP, IBHT, DGPG, and VSTY are the most influential, and the latterselects HMDT, IBHT, and DGPG using a pHd-based method.

4.3 Real data: classification

We apply the kernel method for dimension reduction in classification prob-lems. Many conventional methods of dimension reduction for regression arenot suitable for classification. In SIR, the dimensionality of the effective sub-space must be less than the number of classes, because it uses the average ofX in slices along the variable Y . In binary classification, for example, only

12

Variable DescriptionMV median value of owner-occupied home

CRIM crime rate by townZN proportion of town’s residential land zoned for lots

greater than 25,000 square feetINDUS proportion of nonretail business acres per townCHAS Charles River dummy

(= 1 if tract bounds the Charles River, 0 otherwise)NOX nitrogen oxide concentration in pphmRM average number of rooms in owner unitsAGE proportion of owner units build prior to 1940DIS weighted distances to five employment centers

in the Boston regionRAD index of accessibility to radial highwaysTAX full property tax rate ($/$10,000)

PTRATIO pupil-teacher ratio by town school districtB black proportion of population

LSTAT proportion of population that is lower status

Table 3: Boston Housing Data

Variable DescriptionUPO3 upland ozone concentratin (ppm)VDHT Vandenburg 500 millibar height (m)HMDT himidity (percent)IBHT inversion base height (ft.)DGPG Daggett pressure gradient (mmhg)IBTP inversion base temperature (F)SBTP Sandburg Air Force Base temperature (C)VSTY visibility (miles)WDSP wind speed (mph)DAY day of the year

Table 4: Ozone data

13

Boston 1st 2nd 3rdCRIM X

ZNINDUSCHASNOXRM X X XAGEDIS XRADTAX X X

PTRATIO X XB

LSTAT X X XKGV .1768 .1770 .1815

Ozone 1st 2nd 3rdVDHT XHMDT X X XIBHT X X XDGPG XIBTP XSBTP X X XVSTYWDSPDAYKGV .2727 .2736 .2758

Table 5: Variable selection using the proposed kernel method.

a one-dimensional subspace can be found, because two slices are availableat most. The methods CCA and PLS also have similar limitation on thedimensionality of the effective subspace. They cannot find a subspace oflarger dimensionality than that of Y . We compare the result of KDR onlywith pHd, which is applicable to binary classification problems. Cook andLee (1999) discuss dimension reduction methods for binary classification,and propose difference of covariance (DOC). They theoretically comparepHd and DOC, and show that these methods are the same in binary clas-sification if the population ratio of the one class is 1/2, which is almost thecase in our experiments.

In the first experiment, we show the visualization capability of the di-mension reduction methods. We use the Wine data set in UCI machineleaning repository (Murphy and Aha 1994) to see how the projection onto afewer dimensional space realizes an effective description of data. The winedata consists of 178 samples with 13 variables and a label of three classes.I apply the KDR method, CCA, PLS, SIR, and pHd. Figure 3 shows theprojection onto the 2-dimensional subspace estimated by each method. TheKDR method separates the data into three classes most completely, whileCCA also shows perfect separation. We can see that the data are alreadynonlinearly separable in the two dimensional space. The other methods donot separate the classes completely.

14

Data set dim. of X training sample test sampleHeart-disease 13 149 148Ionosphere 34 151 200

Breast-cancer-Wisconsin 30 200 369

Table 6: Data description for binary classification task.

Next, using binary classification problems, we investigate how much in-formation on Y is preserved in the subspace. After reducing the dimen-sionality, we make a classification boundary using support vector machines(SVM), and compare its accuracy with the SVM trained by the full dimen-sionality of X. Because SVM is the best known method for many binaryclassification problems, it is reasonable to evaluate the class information inthe data with the classification boundary by SVM. In our experiments withSVM, the Matlab Support Vector Toolbox by S. Gunn is used3. We use thedata sets of Heart-disease4, Ionosphere, and Wisconsin-breast-cancer in UCIrepository. The data description is shown in Table 6.

Figure 4 shows the classification rates for the test set with various dimen-sional subspaces. We can see that KDR keeps good separation ability evenin low dimensional subspaces, while pHd is much worse in low dimensions.It is noteworthy that in the Ionosphere data set the classification rates ofthe dimension 5, 10, and 20 outperform the full dimension. These resultsshow that the kernel method successfully find an effective subspace whichpreserves the class information even in the very low dimensional cases.

5 Conclusion

We have proposed KDR, a method of dimension reduction for regression us-ing kernels. We have derived a contrast function, which is exactly the sameas the kernel generalized variance (Bach and Jordan 2002a). One theoret-ical significance of this paper is that we have mathematically justified thekernel generalized variance as the contrast function of the conditional inde-pendency in the specific case of dimension reduction for regression. Insteadof the Gaussian approximation of mutual information, which was used as

3http://www.isis.ecs.soton.ac.uk/resources/svminfo/4We use the Cleveland data set, created by Dr. Robert Detrano of V.A. Medical

Center, Long Beach and Cleveland Clinic Foundation. Although the original data set hasfive classes, we use only ”no presence” (0) and ”presence” (1-4) for the binary class labels.Samples with missing values are removed in our experiments.

15

-20 -15 -10 -5 0 5 10 15 20

-20

-15

-10

-5

0

5

10

15

20

-20 -15 -10 -5 0 5 10 15 20

-20

-15

-10

-5

0

5

10

15

20

(a) KDR (b) CCA

-20 -15 -10 -5 0 5 10 15 20

-20

-15

-10

-5

0

5

10

15

20

-20 -15 -10 -5 0 5 10 15 20

-20

-15

-10

-5

0

5

10

15

20

(c) PLS (d) SIR

-20 -15 -10 -5 0 5 10 15 20

-20

-15

-10

-5

0

5

10

15

20

(e) pHd

Figure 3: Wine data. Projections onto the estimated 2 dimensional space.The plots ”+”, ””, and gray ”©” represent the three classes.

16

(a) Heart-disease

3 5 7 9 11 1350

55

60

65

70

75

80

85

Number of variables

Classification rate (%)

KernelPHDAll variables

(b) Ionosphere

3 5 10 15 20 3488

90

92

94

96

98

100

Number of variables

Classification rate (%)


(c) Wisconsin Breast Cancer

0 5 10 15 20 25 3070

75

80

85

90

95

100

Number of variables

Cla

ssifi

catio

n ra

te (

%)


Figure 4: Classification accuracy of SVM for test data after dimension re-duction.

17

an explanation of the kernel generalized variance in previous works, we havediscussed covariance operators to formulate the conditional independency.

The experiments have proved that the KDR method has wide applica-bility and shows good performance in finding an effective subspace. Themethod does not require any assumption on the regressor and distributionof X, unlike many conventional methods. and shows better performancein complex and realistic problems. In particular, in classification tasks, theclassification rate remains high, even when the dimensionality is reduced toa very small one. From these results, we can see that KDR is very promisingin many practical applications of dimension reduction for regression.

One problem of KDR is local minima in the optimization of the contrastfunction. While we have used annealing to avoid them, there is no theoreticalguarantee to escape from them. We may need try-and-errors to find the bestparameter. While this problem is general in nonlinear optimization, furtherimprovement is desired to achieve easier applicability.

Although our justification has been confined in the dimension reduc-tion, from the success of the kernel generalized variance in trees (Bach andJordan 2002b) and graphical models (Bach and Jordan 2003), it may bepossible that this criterion gives a valid objective function to a wider classof problems. It is very interesting to consider a sufficient condition that theconditional independency can be solved by the kernel generalized variance.This is also a direction for a future research.

Acknowledgments

The authors thank Dr. Noboru Murata in Waseda University and Dr. Mo-toaki Kawanabe in Fraunhofer, FIRST for their helpful comments on theearly version of this work.

Appendix

A Cross-covariance operators on reproducing ker-nel Hilbert spaces and independency of randomvariables

A.1 Cross-covariance operators

While the cross-covariance operators are generally defined for random vari-ables on Banach spaces (Vakhania et al. 1987; Baker 1973), they are more

18

easily defined on reproducing kernel Hilbert spaces (RKHS). In this subsec-tion, we summarize some basic mathematical facts used in Sections 3.1 and3.2. While we discuss only real Hilbert spaces, extension to the complexcases is easy.

Theorem 1. Let (Ω1,B1) and (Ω2,B2) be measurable spaces, and (H1, k1)and (H2, k2) be reproducing kernel Hilbert spaces on Ω1 and Ω2, respectively,with k1 and k2 measurable. Suppose we have a random vector (X, Y ) onΩ1 × Ω2 such that EX [k1(X, X)] and EY [k2(Y, Y )] are finite. Then, thereuniquely exists an operator ΣY X from H1 to H2 such that

〈g, ΣY Xf〉H2 = EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )] (26)

holds for all f ∈ H1 and g ∈ H2. This is called cross-covariance operator.

Proof. Obviously, the operator is unique, if it exists. From Riesz’s represen-tation theorem (see Reed and Simon 1980, Theorem II.4, for example), theexistence of ΣY Xf ∈ H2 for a fixed f can be proved by showing the righthand side of eq.(26) is a bounded linear functional on H2. The linearity isobvious, and the boundedness is shown by∣∣EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )]

∣∣= EXY

∣∣〈k1(·, X), f〉H1〈k2(·, Y ), g〉H2

∣∣ + EX

∣∣〈k1(·, X), f〉H1

∣∣ · EY

∣∣〈k2(·, Y ), g〉H2

∣∣≤ EXY

[‖k1(·, X)‖H1‖f‖H1‖k2(·, Y )‖H2‖g‖H2

]+ EX

[‖k1(·, X)‖H1‖f‖H1

]EY

[‖k2(·, Y )‖H2‖g‖H2

]≤

EX [k1(X, X)]1/2EY [k2(Y, Y )]1/2 + EX [k1(X, X)1/2]EY [k2(Y, Y )1/2]‖f‖H1‖g‖H2 .(27)

For the last inequality, ‖k(·, x)‖2H = k(x, x) is used. The linearity of the map

ΣY X is given by the uniqueness part of Riesz’s representation theorem.

From eq.(27), ΣY X is bounded, and by definition, we see Σ∗Y X = ΣXY ,

where A∗ denotes the adjoint of A. If the two RKHS are the same, theoperator ΣXX is called covariance operator. A covariance operator ΣXX isbounded, self-adjoint, and trace-class.

In RKHS, in a similar manner to finite dimensional Gaussian randomvariables, the conditional expectations can be expressed by cross-covarianceoperators.

Theorem 2. Let (H1, k1) and (H2, k2) be RKHS on measurable spaces Ω1

and Ω2, respectively, with k1 and k2 measurable, and (X, Y ) be a random

19

vector on Ω1 × Ω2. Assume that EX [k1(X, X)] and EY [k2(Y, Y )] are finite,and for all g ∈ H2 the conditional expectation EY |X [g(Y ) | X = ·] is includedin H1. Then, we have for all g ∈ H2

ΣXXEY |X [g(Y ) | X] = ΣXY g, (28)

where ΣXX and ΣXY are the covariance and cross-covariance operator.

Proof. For any f ∈ H1, we have

〈f, ΣXXEY |X [g(Y )|X]〉H1

= EX

[f(X)EY |X [g(Y )|X]

] − EX [f(X)]EX

[EY |X [g(Y )|X]

]= EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )] = 〈f, ΣXY g〉H1 .

This completes the proof.

Corollary 3. Let Σ−1XX be the right-inverse of ΣXX on (KerΣXX)⊥. Under

the same assumptions as Theorem 2, we have

〈f, Σ−1XXΣXY g〉 = 〈f, EY |X [g(Y ) | X]〉 (29)

for all f ∈ (KerΣXX)⊥ and g ∈ H2. In particular, if KerΣXX = 0, we have

Σ−1XXΣXY g = EY |X [g(Y ) | X]. (30)

Proof. Note that the multiplication Σ−1XXΣXY is well-defined, because RangeΣXY ⊂

RangeΣXX = (KerΣXX)⊥. The first inclusion is shown from the expres-sion ΣXY = Σ1/2

XXV Σ1/2Y Y with a bounded operator V (Baker 1973, Theo-

rem 1), and the second equation holds for any self-adjoint operators. Takef = ΣXXh ∈ RangeΣXX . Then, Theorem 2 derives

〈f, Σ−1XXΣXY g〉 = 〈h, ΣXXΣ−1

XXΣXXEY |X [g(Y ) | X]〉= 〈h, ΣXXEY |X [g(Y ) | X]〉 = 〈f, EY |X [g(Y ) | X]〉.

This completes the proof.

The assumption EY |X [g(Y )|X = ·] ∈ H1 in Theorem 2 can be simplifiedso that it can be checked without taking ”arbitrary g”.

Proposition 4. Under the condition of Theorem 2, if there exists C > 0such that

EY |X [k2(y1, Y )|X = x1]EY |X [k2(y2, Y )|X = x2] ≤ Ck1(x1, x2)k2(y1, y2)(31)

for all x1, x2 ∈ Ω1 and y1, y2 ∈ Ω2, then for all g ∈ H2 the conditionalexpectation EY |X [g(Y )|X = ·] is included in H1.

20

Proof. See Theorem 2.3.13 in Alpay (2001).

For a function f in a RKHS, the expectation of f(X) can be formulatedas the inner product of the f and a fixed element. Let (Ω,B) be a measurablespace, and (H, k) be a RKHS on Ω with k measurable. Note that for arandom variable X on Ω, the linear functional f → EX [f(X)] is bounded ifEX [k(X, X)] exists. By Riesz’s theorem, there is u ∈ H such that 〈u, f〉H =EX [f(X)] for all f ∈ H. If we define EX [k(·, X)] ∈ H by this element u, weformally obtain the equality

〈EX [k(·, X)], f〉H = EX [〈k(·, X), f〉H], (32)

which looks like the interchangeability of the expectation by X and the innerproduct. While the expectation EX [k(·, X)] can be defined, in general, asan integral with respect to the distribution on H induced by k(·, X), theelement EX [k(·, X)] is formally obtained as above in a reproducing kernelHilbert space.

A.2 Conditional variance and conditional independence

We define the conditional (cross-)covariance operator, and derive its relationwith the conditional covariance of random variables. Let (H1, k1), (H2, k2),and (H3, k3) be RKHS on measurable spaces Ω1, Ω2, and Ω3, respectively,and (X, Y, Z) be a random vector on Ω1 × Ω2 × Ω3. The conditional cross-covariance operator of (X, Y ) given Z is defined by

ΣY X|Z := ΣY X − ΣY ZΣ−1ZZΣZX . (33)

Because KerΣZZ ⊂ KerΣY Z from the fact ΣY Z = Σ1/2Y Y V Σ1/2

ZZ for somebounded operator V (Baker 1973, Theorem 1), the operator ΣY ZΣ−1

ZZΣY X

can be uniquely defined, unless Σ−1ZZ is unique. By abuse of notation, we

write ΣY ZΣ−1ZZΣY X , when cross-covariance operators are discussed.

The conditional cross-covariance operator is related to the conditionalcovariance of the random variables.

Proposition 5. Let (H1, k1), (H2, k2), and (H3, k3) be RKHS on measur-able spaces Ω1, Ω2, and Ω3, respectively, with ki measurable, and (X, Y, Z)be a measurable random vector on Ω1 × Ω2 × Ω3 such that EX [k1(X, X)],EY [k2(Y, Y )], and EZ [k3(Z, Z)] are finite. It is assumed that EX|Z [f(X)|Z]and EY |Z [g(Y )|Z] are included in H3 for all f ∈ H1 and g ∈ H2. Then, for

21

all f ∈ H1 and g ∈ H2, we have

〈g, ΣY X|Zf〉H2 = EXY [f(X)g(Y )] − EZ

[EX|Z [f(X)|Z]EY |Z [g(Y )|Z]

]= EZ

[CovXY |Z

(f(X), g(Y ) | Z

)]. (34)

Proof. From the decomposition ΣY Z = Σ1/2Y Y V Σ1/2

ZZ , we have ΣZY g ∈ (KerΣZZ)⊥.Then, by Corollary 3, we obtain

〈g, ΣY ZΣ−1ZZΣZXf〉 = 〈ΣZY g, Σ−1

ZZΣZXf〉 = 〈ΣZY g, EX|Z [f(X)|Z]〉= EY Z

[g(Y )EX|Z [f(X)|Z]

] − EX [f(X)]EY [g(Y )].

From this equation, the theorem is proved by

〈g, ΣY X|Zf〉 = EXY [f(X)g(Y )] − EX [f(X)]EY [g(Y )]

− EY Z

[g(Y )EX|Z [f(X)|Z]

]+ EX [f(X)]EY [g(Y )]

= EXY [f(X)g(Y )] − EZ

[EX|Z [f(X)|Z]EY |Z [g(Y )|Z]

]. (35)

The following definition is important to describe our main theorem. Let(Ω,B) be a measurable space, (H, k) be a RKHS over Ω with k measurableand bounded, and S be the set of all the probabilities on (Ω,B). The RKHSH is called probability-determining, if the map

S P → (f → EX∼P [f(X)]) ∈ H∗ (36)

is one-to-one, where H∗ is the dual space of H. From Riesz’s theorem, H isprobability-determining if and only if the map

S P → EX∼P [k(·, X)] ∈ His one-to-one. Theorem 2 in Bach and Jordan (2002a) shows the followingfact;

Theorem 6 (Bach and Jordan 2002). For an arbitrary σ > 0, thereproducing kernel Hilbert space with Gaussian kernel k(x, y) = exp(‖x −y‖2/σ) on R

m is probability-determining.

Recall that for two RKHS H1 and H2 on Ω1 and Ω2, respectively, thedirect product H1 ⊗H2 is the RKHS on Ω1 × Ω2 with the positive definitekernel k1k2 (see Aronszajn 1950). The relation between the conditional in-dependence and the conditional covariance operator is given by the followingtheorem;

22

Theorem 7. Let (H11, k11), (H12, k12), and (H2, k2) be reproducing kernelHilbert spaces on measurable spaces Ω11, Ω12, and Ω2, respectively, withcontinuous and bounded kernels. Let (X, Y ) = (Z, W, Y ) be a random vectoron Ω11 × Ω12 × Ω2, where X = (Z, W ), and H1 = H11 ⊗ H12 be the directproduct. It is assumed that EY |Z [g(Y )|Z] ∈ H11 and EY |X [g(Y )|X] ∈ H1

for all g ∈ H2. Then, we have

ΣY Y |Z ≥ ΣY Y |X , (37)

where the inequality means the order of self-adjoint operators, and

ΣY Y |X = ΣY Y |Z ⇐⇒ Y ⊥⊥W | Z. (38)

Proof. The right hand side of eq.(38) is equivalent to PY |X = PY |Z , wherePY |X and PY |Z are the conditional probability of Y given X and given Z,respectively. Taking the expectation of the well-known equality

VY |Z [g(Y )|Z] = EW |Z[VY |Z,W [g(Y )|Z, W ]

]+ VW |Z

[EY |Z,W [g(Y )|Z, W ]

](39)

with respect to Z, we derive

EZ

[VY |Z [g(Y )|Z]

]= EX

[VY |X [g(Y )|X]

]+EZ

[VW |Z [EY |X [g(Y )|X]]

]. (40)

Since the last term of eq.(40) is non-negative, we Proposition 5 gives eq.(37).The equality holds if and only if VW |Z [EY |X [g(Y )|X]] = 0 for almost

every Z, which means EY |X [g(Y )|X] does not depend on W almost surely.This is equivalent to

EY |X [g(Y )|X] = EY |Z [g(Y )|Z] (41)

for almost every Z and W . Because H2 is probability-determining, thismeans PY |X = PY |Z .

A.3 Conditional covariance and conditional independence

Theorem 7 shows a condition of conditional independence using the condi-tional covariance operator. Another formulation is possible with a condi-tional cross-covariance operator.

Let (Ω1,B1), (Ω2,B2), and (Ω3,B3) be measurable spaces, and (X, Y, Z)be a random vector on Ω1×Ω2×Ω3 with law PXY Z . We define a probabilityEZ [PX|Z ⊗ PY |Z ] on Ω1 × Ω2 by

EZ [PX|Z ⊗ PY |Z ](A × B) = EZ

[EX|Z [χA|Z] EY |Z [χB|Z]

], (42)

where χA is the characteristic function of a measurable set A. It is canoni-cally extended to any product-measurable sets in Ω1 × Ω2.

23

Theorem 8. Let (Ωi,Bi) (i = 1, 2, 3) be a measurable space, (Hi, ki) bea RKHS on Ωi with kernel measurable and bounded, and (X, Y, Z) be arandom vector on Ω1 × Ω2 × Ω3. It is assumed that EX|Z [f(X)|Z] andEY |Z [g(Y )|Z] belong to H3 for all f ∈ H1 and g ∈ H2, and H1 ⊗ H2 isprobability-determining. Then, we have

ΣY X|Z = O ⇐⇒ PXY = EZ [PX|Z ⊗ PY |Z ]. (43)

Proof. The right-to-left direction is trivial from Theorem 5 and the definitionof EZ [PX|Z⊗PY |Z ]. The left hand side leads EZ [EX|Z [f(X)|Z]EY |Z [g(Y )|Z]] =EXY [f(X)g(Y )] for all f ∈ H1 and g ∈ H2. By the definition of H1 ⊗H2,we have E(X′,Y ′)∼Q[h(X ′, Y ′)] = EXY [h(X, Y )] for all h ∈ H1 ⊗ H2, whereQ = EZ [PX|Z ⊗ PY |Z ]. This implies the right hand side, because H1 ⊗ H2

is probability-determining.

The right hand side of eq.(43) is weaker than the conditional indepen-dence of X and Y given Z. However, if Z is a part of X, this gives theconditional independence.

Corollary 9. Let (H11, k11), (H12, k12), and (H2, k2) be reproducing kernelHilbert spaces on measurable spaces Ω11, Ω12, and Ω2, respectively, withkernels measurable and bounded. Let (X, Y ) = (Z, W, Y ) be a random vectoron Ω11 × Ω12 × Ω2, where X = (Z, W ), and H1 = H11 ⊗ H12 be the directproduct. It is assumed that EX|Z [f(X)|Z] and EY |Z [g(Y )|Z] belong to H11

for all f ∈ H1 and g ∈ H2, and H1 ⊗H2 is probability-determining. Then,we have

ΣY X|Z = O ⇐⇒ Y ⊥⊥W |Z. (44)

Proof. For any measurable sets A ⊂ Ω11, B ⊂ Ω12, and C ⊂ Ω2, we have,in general,

EZ

[EX|Z [χA×B(Z, W )|Z]EY |Z [χC(Y )|Z]

] − EXY [χA×B(Z, W )χC(Y )]

= EZ

[EW |Z [χB(W )|Z]χA(Z)EY |Z [χC(Y )|Z]

] − EZ

[EWY |Z [χB(W )χC(Y )|Z]χA(Z)

]=

∫A

PW |Z(B|z)PY |Z(C|z) − PWY |Z(B × C|z)

dPZ(z). (45)

From Theorem 8, the left hand side of eq.(44) is equivalent to EZ [PX|Z ⊗PY |Z ] = PXY , which concludes the last integral in eq.(45) is zero for all A.This means PW |Z(B|z)PY |Z(C|z) − PWY |Z(B × C|z) = 0 for almost everyz-PZ . Thus, Y and W are conditional independent given Z. The converseis trivial.

Note that the left hand side of eq.(44) is not ΣY W |Z but ΣY X|Z , whichis defined on the direct product H11 ⊗H12.

24

References

Alpay, D. (2001). The Schur Algorithm, Reproducing Kernel Spaces andSystem Theory. American Mathematical Society.

Aronszajn, N. (1950). Theory of reproducing kernels. Trnas. Amer. Math.Soc. 69 (3), 337–404.

Bach, F. and M. Jordan (2002a). Kernel independent component analysis.Journal of Machine Learning Research 3, 1–48.

Bach, F. and M. Jordan (2002b). Tree-dependent component analysis. InArtificial Intelligence: Proceedings of the Eighteenth Conference (UAI-2002).

Bach, F. and M. Jordan (2003). Learning graphical models with Mercerkernels. In Advances in Neural Information Processing Systems 15.MIT Press.

Baker, C. (1973). Joint measures and cross-covariance operators. Trans.Amer. Math. Soc. 186, 273–289.

Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992). A training algorithmfor optimal margin classifiers. In D. Haussler (Ed.), 5th Annual ACMWorkshop on COLT, Pittsburgh, PA, pp. 144–152. ACM Press.

Breiman, L. and J. H. Friedman (1985). Estimating optimal transforma-tions for multiple regression and correlation. Journal of the AmericanStatistical Association 80, 580–598.

Cook, R. and H. Lee (1999). Dimension reduction in regressions with abinary response. Journal of the American Statistical Association 94,1187–1200.

Cook, R. and S. Weisberg (1991). Discussion of li (1991). Journal of theAmerican Statistical Association 86, 328–332.

Cook, R. D. (1998). Regression Graphics. Wiley Inter-Science.

Cook, R. D. and X. Yin (2001). Dimension reduction and visualizationin discriminant analysis (with discussion). Australian & New ZealandJournal of Statistics 43 (2), 147–199.

Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Furnas, andR. A. Harshman (1990). Indexing by latent semantic analysis. Journalof the American Society of Information Science 41 (6), 391–407.

Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression.Journal of the American Statistical Association 76, 817–823.

25

Fung, W., X. He, L. Liu, and P. Shi (2002). Dimension reduction basedon canonical correlation. Statistica Sinica 12 (4), 1093–1114.

Gilley, O. and K. R. Pace (1996). On the Harrison and Rubingeld data.Journal of Environmental Economics Management 31, 403–405.

Golub, T. R., D. Slonim, P. tamayo, C. Huard, M. Gaasenbeek,J. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri,C. D. Bloomfield, and E. S. Lander (1999). Molecular classification ofcancer: Class discovery and class prediction by gene expression moni-toring. Science 286, 531–537.

Harrison, D. and D. Rubinfeld (1978). Hedonic housing prices and thedemand for clean air. Journal of Environmental Economics Manage-ment 5, 81–102.

Hastie, T. and R. Tibshirani (1986). Generalized additive models. Statis-tical Science 1, 297–318.

Helland, I. S. (1988). On the structure of partial least squares. Communi-cations in Statistics - Simulations and Computation 17 (2), 581–607.

Hoskuldsson, A. (1988). PLS regression methods. Journal of Chemomet-rics 2, 211–228.

Hristache, M., A. Juditsky, J. Polzehl, and V. Spokoiny (2001). Struc-ture adaptive approach for dimension reduction. The Annals of Statis-tics 29 (6), 1537–1566.

Kambhatla, N. and T. K. Leen (1997). Dimension reduction by local prin-cipal component analysis. Neural Computation 9, 1493–1516.

Li, K.-C. (1991). Sliced inverse regression for dimension reduction (withdiscussions). Journal of American Statistical Association 86, 316–342.

Li, K.-C. (1992). On principal Hessian directions for data visualization anddimension reduction: another application of Stein’s lemma. Journal ofAmerican Statistical Association 87, 1025–1039.

Li, K.-C., H.-H. Lue, and C.-H. Chen (2000). Interactive tree-structuredregression via principal Hessian directions. Journal of the AmericanStatistical Association 95 (450), 547–560.

Murphy, P. and D. Aha (1994). UCI repository of machinelearning databases. Technical report, University of Califor-nia, Irvine, Department of Information and Computer Science.http://www.ics.uci.edu/˜mlearn/MLRepository.html.

26

Nguyen, D. V. and D. M. Rocke (2002). Tumor classification by par-tial least squares using microarray gene expression data. Bioinformat-ics 18 (1), 39–50.

Reed, M. and B. Simon (1980). Functional Analysis. Academic Press.

Samarov, A. M. (1993). Exploring regression structure using nonparamet-ric functional estimation. Journal of the American Statistical Associ-ation 88 (423), 836–847.

Scholkopf, B., A. Smola, and K.-R. Muller (1998). Nonlinear componentanalysis as a kernel eigenvalue problem. Neural Computation 10, 1299–1319.

Vakhania, N., V. Tarieladze, and S. Chobanyan (1987). Probability Dis-tributions on Banach Spaces. D.Reidel Publishing Company.

Vapnik, V., S. Golowich, and A. Smola (1997). Support vector method forfunction approximation, regression estimation, and signal processing.In M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in NeuralInformation Processing Systems 9, Cambridge, MA, pp. 281–287. MITPress.

Weisberg, S. (2002). Dimension reduction regression in R. Journal of Sta-tistical Software 7 (1).

27

Documents

Dimension Reduction for Regression with Reproducing Kernel ...fukumizu/papers/kernelDRR.pdf · sider only real Hilbert spaces for simplicity. The most important point on reproducing