14
Optimal Transport in Reproducing Kernel Hilbert Spaces: Theory and Applications Zhen Zhang , Student Member, IEEE, Mianzhi Wang , Student Member, IEEE, and Arye Nehorai , Life Fellow, IEEE Abstract—In this paper, we present a mathematical and computational framework for comparing and matching distributions in reproducing kernel Hilbert spaces (RKHS). This framework, called optimal transport in RKHS, is a generalization of the optimal transport problem in input spaces to (potentially) infinite-dimensional feature spaces. We provide a computable formulation of Kantorovich’s optimal transport in RKHS. In particular, we explore the case in which data distributions in RKHS are Gaussian, obtaining closed-form expressions of both the estimated Wasserstein distance and optimal transport map via kernel matrices. Based on these expressions, we generalize the Bures metric on covariance matrices to infinite-dimensional settings, providing a new metric between covariance operators. Moreover, we extend the correlation alignment problem to Hilbert spaces, giving a new strategy for matching distributions in RKHS. Empirically, we apply the derived formulas under the Gaussianity assumption to image classification and domain adaptation. In both tasks, our algorithms yield state-of-the-art performances, demonstrating the effectiveness and potential of our framework. Index Terms—Optimal transport, reproducing kernel hilbert spaces, kernel methods, optimal transport map, Wasserstein distance, Wasserstein geometry, covariance operator, image classification, domain adaptation Ç 1 INTRODUCTION T HE popularity of optimal transport (OT) has grown dra- matically in recent years. Techniques built upon optimal transport have achieved great success in many applications, including computer vision [1], [2], [3], [4], [5], statistical machine learning [6], [7], [8], [9], geometry processing [10], [11], [12], fluid mechanics [13], and optimal control. As the name suggests, OT aims at finding an optimal strategy of transporting the mass from source locations to target locations. More specifically, assume we are given a pile of sand, modeled by the probability measure m, and a hole with the same volume, modeled by the probability measure n (see Fig. 1). We also have a cost function cðx; yÞ (usually a distance function named the “ground distance”) describing how much it costs to move one unit of mass from location x to location y. The OT problem corresponds to finding the optimal transport map T (or plan) to mini- mize the total cost of filling up the hole. Given the two probability measures m and n, the optimal transport map can be considered as the most efficient map transferring m to n, in the sense of minimizing the total transport cost. This map has been successfully applied to color transfer [3], Bayesian inference [14], and domain adap- tation [7], [8]. The total minimal cost can be viewed as the discrepancy, the so-called Wasserstein distance, between m and n. Intuitively, if m and n are similar, the transportation cost will be small. Different from other discrepancies, such as K-L divergence and the L 2 distance, the Wasserstein dis- tance incorporates the geometry information of the underly- ing support through the cost function. Because of its geometric characteristics, the Wasserstein distance provides a powerful framework for comparing and analyzing proba- bility distributions [1], [15]. Moreover, in some machine learning problems, it also has been used to define a loss function for generative models to improve their stability and interpretability [6], [16]. There are references exploiting the case where OT operates on Gaussian measures. In [17], textures are modeled by Gaussian measures, and synthetic textures are obtained via OT mixing. In [18], an elegant framework is proposed for comparing and interpolating Gaussian mixture models. All the works mentioned above exploit the machinery of OT in original input spaces (usually Euclidean spaces R n ). However, the OT problem in reproducing kernel Hilbert spaces (RKHS) has not been widely investigated. In this paper, we propose a theoretical and computational frame- work to bridge this gap. The motivations are the following. 1) There are various ways to represent data, such as strings [19], graphs [20], proteins [21], automata [22], and lattices [23]. For some of these representations, we have access to only the data-dependent kernel functions characterizing the affinity relations between examples, instead of the ground distance or the cost function. Thus, it is not straightforward to formulate the OT problem for such datasets. Sometimes, even The authors are with the Department of Electrical and Systems Engineer- ing, Washington University in St.Louis, Saint Louis, MO 63130. E-mail: {zhen.zhang, mianzhi.wang, nehorai}@wustl.edu. Manuscript received 2 Oct. 2017; revised 13 Jan. 2019; accepted 27 Feb. 2019. Date of publication 4 Mar. 2019; date of current version 3 June 2020. (Corresponding author: Arye Nehorai.) Recommended for acceptance by C. H. Lampert. Digital Object Identifier no. 10.1109/TPAMI.2019.2903050 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020 1741 0162-8828 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tps://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...nehorai/paper/Zhang... · nel Hilbert spaces [31] and optimal transport [32]. The topo-logical properties of RKHS, which are the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Optimal Transport in Reproducing Kernel HilbertSpaces: Theory and Applications

    Zhen Zhang , Student Member, IEEE, Mianzhi Wang , Student Member, IEEE,

    and Arye Nehorai , Life Fellow, IEEE

    Abstract—In this paper, we present a mathematical and computational framework for comparing and matching distributions in

    reproducing kernel Hilbert spaces (RKHS). This framework, called optimal transport in RKHS, is a generalization of the optimal

    transport problem in input spaces to (potentially) infinite-dimensional feature spaces. We provide a computable formulation of

    Kantorovich’s optimal transport in RKHS. In particular, we explore the case in which data distributions in RKHS are Gaussian, obtaining

    closed-form expressions of both the estimated Wasserstein distance and optimal transport map via kernel matrices. Based on these

    expressions, we generalize the Bures metric on covariance matrices to infinite-dimensional settings, providing a new metric between

    covariance operators. Moreover, we extend the correlation alignment problem to Hilbert spaces, giving a new strategy for matching

    distributions in RKHS. Empirically, we apply the derived formulas under the Gaussianity assumption to image classification and domain

    adaptation. In both tasks, our algorithms yield state-of-the-art performances, demonstrating the effectiveness and potential of our

    framework.

    Index Terms—Optimal transport, reproducing kernel hilbert spaces, kernel methods, optimal transport map, Wasserstein distance,

    Wasserstein geometry, covariance operator, image classification, domain adaptation

    Ç

    1 INTRODUCTION

    THE popularity of optimal transport (OT) has grown dra-matically in recent years. Techniques built upon optimaltransport have achieved great success in many applications,including computer vision [1], [2], [3], [4], [5], statisticalmachine learning [6], [7], [8], [9], geometry processing [10],[11], [12], fluidmechanics [13], and optimal control.

    As the name suggests, OT aims at finding an optimalstrategy of transporting the mass from source locations totarget locations. More specifically, assume we are given apile of sand, modeled by the probability measure m, and ahole with the same volume, modeled by the probabilitymeasure n (see Fig. 1). We also have a cost function cðx; yÞ(usually a distance function named the “ground distance”)describing how much it costs to move one unit of massfrom location x to location y. The OT problem correspondsto finding the optimal transport map T (or plan) to mini-mize the total cost of filling up the hole.

    Given the two probability measures m and n, the optimaltransport map can be considered as the most efficient maptransferring m to n, in the sense of minimizing the totaltransport cost. This map has been successfully applied tocolor transfer [3], Bayesian inference [14], and domain adap-tation [7], [8]. The total minimal cost can be viewed as the

    discrepancy, the so-called Wasserstein distance, between mand n. Intuitively, if m and n are similar, the transportationcost will be small. Different from other discrepancies, suchas K-L divergence and the L2 distance, the Wasserstein dis-tance incorporates the geometry information of the underly-ing support through the cost function. Because of itsgeometric characteristics, the Wasserstein distance providesa powerful framework for comparing and analyzing proba-bility distributions [1], [15]. Moreover, in some machinelearning problems, it also has been used to define a lossfunction for generative models to improve their stabilityand interpretability [6], [16]. There are references exploitingthe case where OT operates on Gaussian measures. In [17],textures are modeled by Gaussian measures, and synthetictextures are obtained via OT mixing. In [18], an elegantframework is proposed for comparing and interpolatingGaussian mixture models.

    All the works mentioned above exploit the machinery ofOT in original input spaces (usually Euclidean spaces Rn).However, the OT problem in reproducing kernel Hilbertspaces (RKHS) has not been widely investigated. In thispaper, we propose a theoretical and computational frame-work to bridge this gap. The motivations are the following.

    1) There are various ways to represent data, such asstrings [19], graphs [20], proteins [21], automata [22],and lattices [23]. For some of these representations,we have access to only the data-dependent kernelfunctions characterizing the affinity relations betweenexamples, instead of the ground distance or the costfunction. Thus, it is not straightforward to formulatethe OT problem for such datasets. Sometimes, even

    � The authors are with the Department of Electrical and Systems Engineer-ing, Washington University in St.Louis, Saint Louis, MO 63130.E-mail: {zhen.zhang, mianzhi.wang, nehorai}@wustl.edu.

    Manuscript received 2 Oct. 2017; revised 13 Jan. 2019; accepted 27 Feb. 2019.Date of publication 4 Mar. 2019; date of current version 3 June 2020.(Corresponding author: Arye Nehorai.)Recommended for acceptance by C. H. Lampert.Digital Object Identifier no. 10.1109/TPAMI.2019.2903050

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020 1741

    0162-8828� 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tps://www.ieee.org/publications/rights/index.html for more information.

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

    https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0001-8104-5785https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-3317-7035https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865https://orcid.org/0000-0002-9055-9865mailto:

  • for metric spaces (like Riemannian manifolds), ker-nels are more powerful than distance functions inmeasuring the similarity between points [24].

    2) There is a huge number of machine learning algo-rithms formulated in RKHS, due to its capability ofcapturing nonlinear structures. The performance ofthese algorithms depends highly on data distribu-tions in feature spaces. Hence, it is of vital impor-tance to develop a general framework to analyze andmatch RKHS probability measures.

    Following the common procedures for kernel-based meth-ods, we first map data into a RKHS through a feature mapf, then formulate the OT in the resulting space. Because ofthe implicity of feature maps, we have no access to thepushforward measures1 on the RKHS, which makes it dif-ferent from the OT problem in the original input spaces. Thekey point of our work is taking advantage of the interplay betweenkernel functions and probability measures to develop computableformulations and expressions.

    Since the straightforward formulation of OT in RKHSinvolves the implicit feature map, we propose an equivalentand computable formulation in which the problem of OTbetween pushforward measures on RKHS can be fullydetermined by the kernel function. It will be seen that thealternative formulation can be viewed as the OT problem inthe original space, with the cost function induced by thekernel. We name the corresponding Wasserstein distancethe “kernel Wasserstein distance” (KW for short).

    For the case in which pushforward measures areGaussian, we use kernel matrices to derive closed formexpressions of the empirical Wasserstein distance and opti-mal transport map, which we term the “kernel Gauss-Wasserstein distance” (KGW for short) and the “kernelGauss-optimal transport map” (KGOT for short), respec-tively. If the expectations of two Gaussian measures arethe same, then KGW introduces a distance between covari-ance operators, generalizing the Bures metric on covariancematrices to infinite-dimensional settings. We term thisdistance the “kernel Bures distance” (KB for short). Moreinterestingly, the KB distance does not require covarianceoperators to be strictly positive (or invertible), which makesit rather appealing since the estimated covariance operatorsfrom finite samples are always rank-deficient. The KGOTmap is a continuous linear operator. It introduces a new

    alignment strategy for RKHS distributions by forcing thecovariance operator of the source distribution to approachthat of the target distribution.

    Empirically, we apply the tools developed under the Gaus-sianity assumption to image classification and domain adap-tation tasks. In image classification, we represent each imagewith a collection of feature samples (the so-called “ensemble”[25]), then employ theKGWorKBdistance to quantify the dif-ference between them. In domain adaptation, we solve thedomain shift issue in RKHS. That is, we use the KGOTmap totransport the samples in the source domain to the targetdomain to reduce the distribution difference. The promisingresults for both tasks demonstrate the strong capability of ourframework in comparing andmatching distributions.

    Here, we provide insights on our strategy in the aboveapplications. Our approaches are based on the resultsobtained from optimal transport between Gaussian distribu-tions on RKHS. As mentioned above, one favorable propertyof RKHS Gaussian distribution is that we can obtain closedform solutions. Moreover, it has been both numerically andtheoretically justified that after nonlinear kernel (e.g., RBFkernels) transformations, data are more likely to be Gaussian[26]. This phenomenon is exploited by many kernel basedmethods. For example, in [27], [28], the probabilistic kernelPCA is formulated based on the latent Gaussian model inRKHS. In [26], Fisher discriminative analysis is implementedin feature spaces by assuming that RKHS samples belongingto different classes follow Gaussian distributions with thesame covariance operator but different means. In [29], theGaussianity of RKHS data is assumed in order to computethe mutual information. More detailed discussion of thisassumption can be found in [25], [30], and [26]. On the otherhand, our approaches can also be interpreted from the per-spective of Hilbert space embeddings, without the Gaussian-ity assumption in RKHS. The KGW distance and the KGOTmap operate only on RKHSmeans and covariance operators,which are informative enough to characterize data distribu-tions. Therefore, the problem of comparing and matchingdistributions can be naturally solved by comparing andaligning kernel means and covariance operators.

    Contributions. The contributions of our work are summa-rized as follows. (1) We introduce a systematic frameworkfor optimal transport in RKHS, including both theoreticaland computational formulations. (2) Assuming Gaussianityin RKHS, we derive closed-form expressions of the esti-mated Wasserstein distance and optimal transport map viaGram matrices. (3) We apply our formulations to the tasksof image classification and domain adaptation. On severalchallenging datasets, our methods outperform state-of-the-art approaches, demonstrating the effectiveness and poten-tial of our framework.

    Related Work. From the mathematical perspective, ourwork lies at the intersection of two topics: reproducing ker-nel Hilbert spaces [31] and optimal transport [32]. The topo-logical properties of RKHS, which are the cornerstones ofour work, are systematically characterized in [31]. Formulat-ing OT in abstract spaces is considered in [33], [34], and [35].In [34] and [35], general expressions of the Wasserstein dis-tance between Gaussian measures on Hilbert spaces arederived. All the works above provide rigorous foundationsfor our framework. We will show how the theorems from

    Fig. 1. Illustration of the optimal transport problem.

    1. Given a probability measure m on the input space, mapping thedata through the implicit map f, we are interested in the data distribu-tion in RKHS. Such distribution is called the pushforward measure,denoted as f#m, satisfying that for any subset A in RKHS,

    f#mðAÞ ¼ mðf�1ðAÞÞ.

    1742 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • RKHS and OT elegantly interact with each other to advancethe construction of “OT in RKHS”. In fact, RKHS provides aplatform for the theory of OT in abstract spaces to beapplied in real-world problems. A recent quite relevantwork can be found in [36], where the authors proposed aWasserstein distance based framework for statistical analy-sis of the Gaussian process (GP). They formulated OT prob-lem in the space of GPs, which is essentially a RKHS.

    From the empirical perspective, there are several relatedapproaches for image classification and domain adaptation.

    In image classification, the strategy of representing imageswith collections of feature vectors has attracted increasingattention. The subsequent procedure of quantifying the dis-similarities between such ensembles is actually the crucialproblem in image classification. The related algorithms deal-ing with this problem can be approximately categorized intotwo classes: covariance matrix-based approaches and covari-ance operator-based approaches. The methods belonging tothe first class, such as [37] and [38], exploit the second-orderstatistics constructed in the original input spaces, characteriz-ing the differences by comparing covariance matrices. Themethods in the second class, such as [39], [40], and [41],encode ensembles with infinite-dimensional RKHS cov-ariance operators, and compute the kernelized versionsof divergences or distances between them. Covarianceoperator-based approaches usually achieve better perfor-mance since covariance operators can capture nonlinearcorrelations. Remarkably, all the above approaches takeadvantage of the non-Euclidean geometry of covariancematrices and covariance operators, which is usually prettyfavorable in computer vision problems [42]. In our work, wederive the computable expression of the kernel Bures dis-tance between covariance operators, which generalizes theWasserstein geometry to the infinite-dimensional RKHS.Moreover, the KB distance also achieve promising results.

    Domain shift, which occurswhen the training (source) andtesting (target) datasets follow different distributions, usually

    results in poor performance of the trainedmodel on the targetdomain. It is a fundamental problem in statistics andmachinelearning, and usually happens in real world applications.There are many strategies to deal with this issue. For exam-ple, the methods in [43], [44] aim at identifying a domain-invariant subspace where the source and target distributionsare similar. The works in [45] and [46] exploit the immediatesubspaces treated as points on the geodesic curve of theGrassmann manifold. The authors either sample a finitenumber of subspaces or integrate along the geodesics tomodel the domain shift. In [47], an algorithm is introducedfor minimizing the distributions difference through reweigh-ing samples. More recently, OT-based methods [7], [8] havebeen proposed. In [7], the authors use OT to find a non-rigidtransformation to align source and target distributions. Theypropose several regularization schemes to improve the regu-larity of the learnedmapping. In [8], the authors formulate anoptimization problem to learn an explicit transformation toapproximate the OT map, so it can generalize to out-of-sam-ples patterns. We develop our method from a significantlydifferent view. The methodological difference is that wematch distributions in RKHS, while all the works mentionedabove attempt to reduce the dissimilarity of distributions inthe original input space. Thanks to the Gaussianity of data inRKHS, we can conduct the alignment with the KGOT map, acontinuous linear operator having an explicit expression.Regularity can be guaranteed by its continuity and linearity.In [48], the task of matching RKHS distributions is formu-lated as aligning kernel matrices. However, kernel matricesmay have different sizes, and their rows/columns do not nec-essarily correspond. To tackle such problems, the authorsintroduce the “surrogate kernel”. Different from [48], ourKGOT map directly operates on covariance operators, whichis more intuitive and straightforward, totally avoiding theabove problems. In addition, if we select the linear kernel,i.e., kðxx; yyÞ ¼ xxTyy, our approach degenerates to aligningcovariancematrices, which is similar to CORAL [49].

    Organization. In Section 2, we provide the background ofRKHS and optimal transport in Euclidean spaces. Sections 3and 4 form the core of our work, where we develop thecomputational framework of OT in RKHS, together withclosed-form expressions of the empirical KGW distance andKGOT map. In Section 5, we describe details of applyingthe derived formulas to image classification and domainadaptation, respectively. In Section 6, we report the experi-mental results on real datasets. In the Supplementary Mate-rial, which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109//TPAMI.2019.2903050, we provide the proofs of all mathe-matical results in this paper, along with further technicaldiscussions and more experimental results. In Table. 1, welist the notations introduced in the paper.

    2 BACKGROUND

    In this section, we first introduce the reproducing kernelHilbert space. Next, we review two classical formulations ofthe optimal transport problem in Rn. We then discuss therelevant conclusions of the special case, in which the proba-bility measures are Gaussian.

    We use k � k2 to denote Euclidean distance. We usePrðRnÞ to indicate the set of Borel probability measures on

    TABLE 1Notations

    Symbol Acronym Meaning

    dWðm; nÞ – The Wasserstein distance betweenprobability measures m and n on Rn.

    dGaWðm; nÞ GaW A pseudo metric on measures with finitefirst and second order moments. If m and nare Gaussian, dGaWðm; nÞ is just thecorresponding Wasserstein distancebetween m and n.

    dBðS1S1;S2S2Þ – The Bures metric between positivesemidefinte matrices S1S1 and S2S2.

    TG – The optimal transport map betweenGaussian measures on Rn.

    dHWðm; nÞ KW The Wasserstein distance betweenprobability measures f#m and f#n onRKHS.

    dHGWðm; nÞ KGW The Wasserstein distance betweenGaussian measures f#m and f#n onRKHS.

    dHB ðR1; R2Þ KB The kernel Bures distance between RKHScovariances operators R1 and R2.

    THG KGOT The optimal transport map betweenGaussian measures on RKHS.

    ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT SPACES: THEORY AND APPLICATIONS 1743

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • Rn, and use PrðRn �RnÞ to indicate the set of Borel proba-bility measures on the product space Rn �Rn.

    2.1 Reproducing Kernel Hilbert Spaces

    Let X be a nonempty set, and let H be a Hilbert space ofR-valued function defined on X . A function k : X � X iscalled a reproducing kernel of H, and H is a reproducingkernel Hilbert space, if k satisfies:

    1) 8x 2 X , kð�; xÞ 2 H,2) 8x 2 X , f 2 H, hf; kð�; xÞiH ¼ fðxÞ.

    Define the implicit feature map f : X ! H as fðxÞ ¼ kð�; xÞ.Then, we have hfðxÞ;fðyÞiH ¼ kðx; yÞ, 8x; y 2 X .

    It can be easily shown that k is positive definite. On theother hand, the Moore-Aronszajn theorem says that anypositive definite kernel k is associated with a unique RKHS.

    2.2 Optimal Transport in Rn

    2.2.1 Two Formulations

    Monge’s Formulation. Given two probability measuresm; n 2 PrðRnÞ, Monge’s problem is to find a transport mapT : Rn ! Rn that pushes m to n (denoted as T#m ¼ n) to min-imize the total transport cost. The problem is formulated as

    infT#m¼n

    ZRn

    k~xx� T ð~xxÞk22dmð~xxÞ; (1)

    where k~xx� T ð~xxÞk22 is the cost function, reflecting the geome-try information of the underlying supports. The physicalmeaning of Monge’s formulation is illustrated in Fig. 1.

    However, in some cases, this formulation is ill-posed, inthe sense that the existence of T cannot be guaranteed. Atypical example is where m is a Dirac measure but n is not.There is no such T transferring m to n. To tackle this issue,Kantorovich gives a relaxed version of OT.

    Kantorovich’s Formulation. Kantorovich’s formulation ofOT is a relaxation of Monge’s. In Kantorovich’s formulation,the objective function is minimized over all transport plansinstead of transport maps. It can be written as follows:

    infp2Pðm;nÞ

    ZRn�Rn

    k~xx�~yyk22dpð~xx;~yyÞ; (2)

    where Pðm; nÞ is the set of joint probability measures onRn �Rn, with marginals m and n.

    The transport plan pð~xx;~yyÞ is a joint probability measuredescribing the amount of mass transported from location ~xxto location ~yy. Different from Monge’s problem, Kant-orovich’s formulation allows splitting the mass. That is, themass at one location can be divided and transported to mul-tiple destinations. It can be proved [32] that the square rootof the minimal cost of (2) defines a metric on PrðRnÞ. Thismetric is the so-called Wasserstein distance, denoted asdWðm; nÞ. That is,

    dWðm; nÞ , infp2Pðm;nÞ

    h ZRn�Rn

    k~xx�~yyk22dpð~xx;~yyÞi12: (3)

    2.2.2 OT between Gaussian Measures on Rn

    The following theorem provides a lower bound for the Was-serstein distance between arbitrary measures m and n,

    together with a condition under which the lower bound isachieved. The lower bound is just the Wasserstein distancebetween Gaussianmeasures, named the “Gauss-Wassersteindistance” (GaW for short, and not be confused with theGromov-Wasserstein distance, denoted byGW).

    Theorem 1 (See [50]). Let m and n be two probability measureson Rn with finite first and second order moments. Let ~mmm and~mmn, and SSm and SSn be the corresponding expectations andcovariance matrices, respectively. Write

    dGaWðm; nÞ ¼hk~mmm � ~mmnk22 þ trðSSm þ SSn � 2SSmnÞ

    i12;

    (4)where SSmn ¼ ðSS

    12mSSnSS

    12mÞ

    12. Then,

    1) dGaWðm; nÞ � dWðm; nÞ, and2) The equality will be valid if both m and n are Gaussian.

    Remark 1. ð�Þ12 denotes the principle matrix square root, i.e.,for any positive semi-definite (PSD) matrix SS, write the

    eigendecomposition SS ¼ UULLUUT , and then SS12 ¼ UULL12UUT .The function dGaW can be considered as a pseudo metric

    on probability measures with finite first and second ordermoments. Based on conclusion (2), we see that if m and n areGaussian, then dGaWðm; nÞ is just the corresponding Wasser-stein distance. Hence, dGaW defines a metric on the set of allGaussian measures, which are uniquely characterized bythe first two order statistics. In the case that m and n havethe same expectation, dGaW introduces a metric on covari-ance matrices, which is known as the Bures metric [51].

    Corollary 1. Let SymþðnÞ be the set of all positive semi-definitematrices of size n� n. Then,

    dBðSS1;SS2Þ ¼htrðSS1 þ SS2 � 2SS12Þ

    i12; (5)

    defines a metric on SymþðnÞ.Note that dB defines a metric on PSD matrices, including

    rank-deficient ones. It is a rather desirable property in prac-tice, because the dimension of samples is sometimes largerthan their size, which will result in rank-deficiency of theestimated covariance matrices. dB is well-defined on suchmatrices, without any regularization operations.

    Usually, given PSD matrices SS1 and SS2, we havedBðSS1;SS2Þ 6¼ dBð00;SS1 � SS2Þ, which implies that the Buresmetric exploits the non-Euclidean geometry of SymþðnÞ.Such a geometry is the so-called Wasserstein geometry [52],in which dB is just the geodesic distance function.

    Different from the Wasserstein distance, the optimaltransport map between Gaussian measures usually need toconsider the rank of the covariance matrices. We start fromthe ideal case where both the covariance matrices of m and nare of full-rank.

    Theorem 2 (See [50]). Let m and n be two Gaussian measureson Rn whose covariance matrices are of full rank. Let ~mmm and~mmn, and SSm and SSn, denote the respective expectations andcovariance matrices. Then the optimal transport map TGbetween m and n exists, and can be written as

    TGð~xxÞ ¼ SS�12

    m SSmnSS�12m ð~xx� ~mmmÞ þ ~mmn: (6)

    1744 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • We can see that in the full-rank case, the most “efficient”map transferring one Gaussian measure to another is affine.However, if the covariance matrix is rank-deficient, whichcorresponds to the case where the Gaussian measure con-centrates on a low-dimensional affine subspace of Rn, theconclusions in the above theorem do not necessarily hold.Even the existence of the optimal transport map cannot beguaranteed. A simple example is that if SSm is rank-deficient,but SSn is of full rank, it is impossible to find an affine mapto transfer Gaussian measures m to n. To tackle this issue,we first project the data with distribution n onto the rangespace of SSm, where the Gaussian measure m concentratesand SSm is regular, and then formulate the OT problem, asdescribed in the next theorem.

    Theorem 3. Let m and n be two Gaussian measures defined onRn. Let �m and �n be the corresponding centered Gaussian meas-ures which are derived from m and n, respectively, by transla-tion. Let PPm be the projection matrix onto ImðSSmÞ. Then theoptimal transport map TG from �m to PPm#�n is linear and self-adjoint, and can be written as

    TGð~xxÞ ¼ ðSS12mÞySSmnðSS

    12mÞy~xx: (7)

    Remark 2. “y” denotes the Moore-Penrose inverse. ImðSSÞdenotes the image of the linear transform SS, i.e.,ImðSSÞ ¼ fSS~xx;~xx 2 Rng.Generally speaking, different from Theorem 2, the map

    TG in (7) in fact transfers �m to PPm#�n, the projected version of�n, instead of to �n itself. In the special case where the Gauss-ian measures m and n satisfy ImðSSnÞ � ImðSSmÞ, �n remainsthe same under the projection onto ImðSSmÞ, i.e., PPm#�n ¼ �n.So TG in (7) is just the optimal transport map from �m to �n.This result, as an extended version of Theorem 2, was alsodeveloped in [35].

    3 KANTOROVICH’S OT IN RKHS

    This section introduces Kantorovich’s optimal transportproblem in RKHS. In the first part, we provide an equiva-lent and computable formulation of this problem. In the sec-ond part, we discuss the OT optimization problem onempirical distributions.

    3.1 The Formulation of OT in RKHS

    Let the input space ðX ;BXÞ be a measurable space with aBorel s�algebra BX , and PrðXÞ be the set of Borel probabil-ity measures on X . Let k be a positive definite kernel onX � X , and ðHK;BHKÞ be the reproducing kernel Hilbertspace generated by k. Let f : X ! HK be the correspondingfeature map. For any m 2 PrðXÞ, let f#m be the pushforwardmeasure of m.

    Given m; n 2 PrðXÞ, the Kantorovich optimal transportbetween pushforward measures f#m and f#n onHK is writ-ten as

    dWðf#m;f#nÞ

    ¼h

    infpK2Pðf#m;f#nÞ

    ZHK�HK

    ku� vk2HKdpKðu; vÞi12;

    (8)

    where Pðf#m;f#nÞ is the set of joint probability measuresonHK �HK, with marginals f#m and f#n.

    Eq. (8) is a natural analogy of (3). However, (8) is formu-lated through an implicit nonlinear map, whose expressionwe usually cannot access, making it difficult to use directly.We next provide an equivalent and computable formula-tion, the form of which is fully determined by the kernelfunction.

    Theorem 4. Let ðX ;BXÞ be a Borel space, and let the reproduc-ing kernel k be measurable. Given m; n 2 PrðXÞ, we write

    dHWðm; nÞ ¼h

    infp2Pðm;nÞ

    ZX�X

    d2ðx; yÞdpðx; yÞi12; (9)

    where d2ðx; yÞ ¼ kfðxÞ � fðyÞk2HK ¼ kðx; xÞ þ kðy; yÞ � 2kðx; yÞ.Then,

    1) dHWðm; nÞ ¼ dWðf#m;f#nÞ, and2) If p� is a minimizer of (9), then ðf;fÞ#p� is a mini-

    mizer of (8), where ðf;fÞ : X � X ! HK �HK isdefined as ðf;fÞðx; yÞ ¼ ðfðxÞ;fðyÞÞ.

    If the feature map f is injective, the equivalence between(9) and (8) can be easily justified by applying the measuretransform formula twice. In addition, with injective f,dðx; yÞ is a distance function on X . Consequently, dHWðm; nÞdefines a metric on PrðXÞ. However, in many cases, featuremaps are not injective. For example, consider the kernel

    kð~xx;~yyÞ ¼ expð� kAA~xx�AA~yyk22

    2s2Þ satisfying kerðAAÞ 6¼ f00g, which is

    pretty common in the setting of Mahalanobis metric learn-ing. The corresponding feature map f is non-injective, sincefor any ~xx, ~yy satisfying ~xx�~yy 2 kerðAAÞ, we have kfð~xxÞ �fð~yyÞk2HK ¼ 0. In Theorem 4, we in fact present a more gen-eral conclusion, only requiring the feature map to be mea-surable. The central idea of obtaining Theorem 4 is applyingthe “transformation-invariant property of minimal metrics”[33]. We provide the detailed proof process in the supple-mentary material.

    3.2 Discrete Optimal Transport

    In most applications, we have access to only the empiricalmeasures or histograms, m̂ ¼ Pni¼1 m̂idxi and n̂ ¼ Pmj¼1 n̂jdyj ,where dxi (or dyj ) is the Dirac measure centering at xi ( or yj),and m̂i (or n̂j) is the probability of mass being associatedwith xi (or yj). The discrete version of (9) can be written as

    MinPP2Unm

    trðPPTDDÞ; (10)

    where Unm denotes the set of n�m nonnegative matricesrepresenting the probabilistic couplings, whose marginalsare m̂ and n̂, i.e., Unm ¼ fPP 2 Rn�mþ jPP~11m ¼ ~̂mm; PPT~11n ¼ ~̂nng,and DD denotes the n�m cost matrix, with DDi;j ¼ kðxi; xiÞþkðyj; yjÞ � 2kðxi; yjÞ.

    4 OT BETWEEN GAUSSIAN MEASURES ON RKHS

    In this section, we provide the mathematical computationsof OT under the condition that the pushforward measureson RKHS are Gaussian.

    Let m be a Borel probability measure on X . We assumethat the mean, mm ¼ EXmðfðXÞÞ, and the covariance

    ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT SPACES: THEORY AND APPLICATIONS 1745

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • operator, Rm ¼ EXmððfðXÞ �mmÞ ðfðXÞ �mmÞÞ,2 existand bounded with respect to the Hilbert norm and Hilbert-Schmidt norm (see [53]), respectively. We note that mm is anelement in HK, and Rm is a self-adjoint, nonnegative opera-tor on HK, belonging to the tensor product space HK HK.If the data distributions in RKHS (the corresponding push-forward measures) are Gaussian, the conclusions in RKHSare similar to the ones in Euclidean spaces.

    Proposition 1. Assume that the hypotheses in Theorem 4 hold.Let m, n 2 PrðXÞ. Letmm andmn, and Rm and Rn, be the corre-sponding means and covariance operators, respectively. Write

    dHGWðm; nÞ ¼hkmm �mnk2HK þ trðRm þRn � 2RmnÞ

    i12;

    (11)where Rmn ¼ ðR

    12mRnR

    12mÞ12. Then,

    (1) dHGWðm; nÞ � dHWðm; nÞ, and(2) The equality will be valid if both f#m and f#n are

    Gaussian.

    Remark 3.

    (1) The square root of an nonnegative, self-adjoint,and compact operator G is defined as G

    12 ¼P

    i¼1ffiffiffiffiffiffiffiffiffiffiffiffi�iðGÞ

    p’iðGÞ ’iðGÞ, where �iðGÞ and

    ’iðGÞ are eigenvalues and eigenfunctions of G.(2) The trace of an trace-class operator G on a separa-

    ble Hilbert space H is defined as trðGÞ ¼PdimðHÞi¼1 hGei; eii, where feigdimðHÞi¼1 is an orthonor-

    mal system ofH.It can be seen that KGW serves as a lower bound for KW,

    which reveals the connection between the general andGaussian cases of the Wasserstein distance in RKHS. Analo-gous to Corollary 1, we generalize the Wasserstein geome-try assigned on PSD to infinite-dimensional settings, andobtain the kernel Bures distance, dHB , between RKHS covari-ance operators.

    Corollary 2. Let SymþðHKÞ � HK HK be the set of nonnega-tive, self-adjoint, and trace-class operators inHK. Then

    dHB ðR1; R2Þ ¼htrðR1 þR2 � 2R12Þ

    i12; (12)

    defines a metric on SymþðHKÞ.The kernel Gauss-Wasserstein distance, dHGW, consists of

    two terms. The first term is just the squared maximummean discrepancy (MMD) [54], i.e., MMD2ðm; nÞ ¼ kmm�mnk2HK , measuring the distance between the centers of thedata in RKHS. The second term, dHB , quantifies the differ-ence between the dispersions of the data in RKHS.

    If the kernel k is characteristic [54], KGW actuallyinduces a metric on PrðXÞ, which can be concluded fromthe perspective of kernel embedding of distributions.Because k is characteristic [54], the kernel mean embedd-ing of any m 2 PrðXÞ, i.e., m ! mm 2 HK, is injective,which leads to the injectiveness of the embedding

    m ! ðmm; RmÞ 2 HK� SymþðHKÞ. Since KGW is a metric onHK � SymþðHKÞ, KGW induces a metric on PrðXÞ. In thenext part, we explore the informativeness of covarianceoperators, and discuss how dHB quantifies the discrepancybetween distributions. To do this, we first introduce the3-splitting property of measures.

    Definition 1. Let m 2 PrðXÞ. If there exist disjoint subsets V1,V2, and V3, satisfying X ¼ V1

    SV2

    SV3, and mðV1Þ;

    mðV2Þ;mðV3Þ > 0, then we say m satisfies the 3-splittingproperty.

    Note that the 3-splitting property is rather mild in thesense that it precludes only measures concentrating on oneor two singletons, i.e., m ¼ �dx þ ð1� �Þdy, � 2 ½0; 1�. LetPrsðXÞ be the set of Borel measures satisfying the 3-splittingproperty. The following theorem presents the injectivenessof the mapping from PrsðXÞ to SymþðHKÞ. Consequently,combining with the fact that KB is a metric on SymþðHKÞ,we conclude that KB induces a metric on PrsðXÞ.Theorem 5. Let the measurable space ðX ;BXÞ be locally compact

    and Hausdorff. Let k be a c0-universal reproducing kernel.3

    Then, the embedding m ! Rm, 8m 2 PrsðXÞ is injective.To demonstrate why the 3-splitting property is required,

    we provide a counterexample. For � 2 ½0; 1�, let m ¼ �dx þð1� �Þdy and n ¼ ð1� �Þdx þ �dy. Clearly, both m and ndon’t satisfy the 3-splitting property, and the correspondingRKHS covariance operators are the same, i.e., Rm ¼Rn ¼ �ð1� �Þ

    �fðxÞ � fðyÞ� �fðxÞ � fðyÞ�. Thus, in this

    case, dHB cannot distinguish m and n. We also note that if X isRn, many popular kernels, such as Gaussian, Laplacian, andB1-spline are c0-universal [55].

    As for the optimal transport map, we consider the rank-deficient case, since the ranks of the estimated covarianceoperators are always finite. The conclusions are quite simi-lar to those of Theorem 3. The only difference is that we areworking on the pushforward measures on RKHS.

    Proposition 2. Given m; n 2 PrðXÞ, assume the pushforwardmeasures f#m and f#n on RKHS are Gaussian. Let �mf and �nfbe the respective centered measures of f#m and f#n. Let Pm bethe projection operator on ImðRmÞ. Then the kernel Gauss-opti-mal transport map THG between �mf and Pm#ð�nfÞ is a linear andself-adjoint operator, and can be written as

    THG ðuÞ ¼ ðR12mÞyRmnðR

    12mÞyu; 8u 2 HK: (13)

    For almost all kernel methods, the core task is transfer-ring the expressions involving implicit feature maps to thekernel-based expressions. After doing this, one can carryout computations using the “kernel trick”. In the next twosubsections, we provide explicit expressions of the estimatedKGW distance (11) and KGOT map (13) via kernel matrices,which are two of the main contributions of this paper.

    4.1 The Empirical Estimation of the KGW Distance

    Let XX ¼ ½x1; x2; . . . ; xn� and YY ¼ ½y1; y2; . . . ; ym� be two sam-ple matrices from two probability measures m and n,

    2. The tensor product of Hilbert spaces H is isomorphic to the spaceof Hilbert-Schmidt operators, and is defined such that ðu vÞw ¼hv; wiHu, 8u; v; w 2 H.

    3. We refer to [55] or the supplementary material, available online,for the definition of the c0�universal kernel.

    1746 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • respectively. Let FX ¼ ½fðx1Þ;fðx2Þ; . . . ;fðxnÞ� and FY ¼½fðy1Þ;fðy2Þ; . . . ;fðymÞ� be two corresponding mapped datamatrices. Let KKXX, KKXY , and KKYY be the kernel matricesdefined by ðKKXXÞij ¼ kðxi; xjÞ, ðKKXY Þij ¼ kðxi; yjÞ, andðKKYY Þij ¼ kðyi; yjÞ. Let HHn ¼ IIn�n � 1n~11n~11Tn and HHm ¼IIm�m � 1m~11m~11Tm be two centering matrices. Then the empiri-cal means, m̂m and m̂n, are estimated as m̂m ¼ 1nFX~11n andm̂n ¼ 1mFY~11m. The empirical covariance operators, R̂m andR̂n, are estimated as R̂m ¼ 1nFXHHnFTX and R̂n ¼ 1mFY HHmFTY .Proposition 3. The empirical kernel Gauss-Wasserstein distance

    is

    d̂HGWðm; nÞ ¼h 1ntrðKKXXÞ þ 1

    mtrðKKYY Þ � 2

    mn~11TnKKXY

    ~11m

    � 2ffiffiffiffiffiffiffiffimn

    p kHHnKKXYHHmk�i12:

    (14)The kernel Bures distance between R̂m and R̂n is

    dHB ðR̂m; R̂nÞ ¼h 1ntrðKKXXHHnÞ þ 1

    mtrðKKYYHHmÞ

    � 2ffiffiffiffiffiffiffiffimn

    p kHHnKKXYHHmk�i12:

    (15)

    Remark 4. k � k� denotes the nuclear norm, i.e., kAAk� ¼Pri¼1 siðAAÞ, where siðAAÞ are the singular values of matrix

    AA.

    Computational Complexity. For convenience, we assumethe sample sizes are the same, i.e., m ¼ n. It takes Oðn2Þoperations to compute the first three terms of d̂HGW. If wewrite trðKXXKXXHHnÞ ¼ trðKXXKXXÞ þ~11TnKKXX~11n (similarly forKKYY ), it takes Oðn2Þ operations to compute the first twoterms of d̂HB . Now we consider last term kHHnKKXYHHnk� inboth d̂HGW and d̂

    HB . To avoid large-scale matrix multiplica-

    tions, we write HHnKKXYHHn ¼ KKXY þ 1n2 ð~11TnKKYX~11nÞ~11n~11Tn�1n~11nð~11TnKKXY Þ � 1n ðKKXY~11nÞ~11Tn , whose complexity is Oðn2Þ.Moreover, the nuclear norm requires Oðn3Þ operations.

    4.2 The Empirical Estimation of the KGOT Map

    Proposition 4. Let XX and YY be data matrices sampled from mand n, respectively. Then the empirical projection operator onImðR̂mÞ is

    P̂m ¼ FXHHnCCyXXHHnFTX; (16)and the empirical Gauss-optimal transport map from �mf andPm#ð�nfÞ is

    T̂HG ¼ffiffiffiffiffin

    m

    rFXHHnCC

    yXXCC

    12XYYXCC

    yXXHHnF

    TX; (17)

    whereCCXX ¼ HHnKKXXHHn; (18a)

    CCXYYX ¼ HHnKKXYHHmKKYXHHn: (18b)

    Both (16) and (17) are computable expressions. That is, givenany element u 2 HK, we can directly apply our formulationsto obtain the corresponding images P̂mðuÞ and T̂HG ðuÞ. More-over, we emphasize that the estimated KGOT plays the role

    of aligning RKHS covariance operators, as summarized inthe next proposition.

    Proposition 5.

    T̂HG R̂mT̂HG ¼ P̂mR̂nP̂m: (19)

    Eq. (19) can be interpreted in the following way. First,data sampled from �nf are projected onto ImðR̂mÞ, the imageof R̂m. The resultant covariance operator is P̂mR̂nP̂m. Next,data sampled from �mf, which already concentrates onImðR̂mÞ, are “transported” by the KGOT map T̂HG . The corre-sponding covariance operator becomes T̂HG R̂mT̂

    HG . By doing

    this, the covariance operators are aligned, which leads tothe similar distributions of these two transformed datasetsin RKHS.

    Regularization. Since the smallest eigenvalues of kernelmatrices CXXCXX are usually close to zero, the Moore-Penroseinverse CyXXC

    yXX is ill-conditioned. There are several methods

    that can be used to deal with this issue. (1) One can take theinversion of the top d eigenvalues ofCXXCXX, and set other eigen-values to be zero. However, the drawback is that it is usuallydifficult to select such a cutoff. (2) One can use the regularizedversion of CXXCXX. That is, we can use ðCXXCXX þ �IIÞ�1 to replaceCyXXC

    yXX. This is a adhoc strategy in practice because it is more

    efficient to select the regularizer �. However, this methoddestroys the low-rank structure of CyXXC

    yXX. (3) One can also use

    ðC2XX þ �IIÞ�1CXXCXXC2XX þ �IIÞ�1CXXCXX [56] to approximateCyXXCyXX, based on the factthat lim�!0ðC2XX þ �IIÞ�1C2XX þ �IIÞ�1CXXCXX ¼ CyXXCyXX. In our experiments, weuse this strategy, since it not only is efficient to implement,but also preservesCyXXC

    yXX’s low-rank structure.

    5 APPLICATIONS

    In this section, we apply the developed formulas (14) and(15), and (17) to image classification and domain adaptation,respectively.

    5.1 Image Classification

    5.1.1 Proposed Approach

    Each image is represented by a collection of (pixel-wise) fea-ture samples, which can be low-level features or learnedfeatures extracted from deep neural networks. We apply theKGW (or the KB) distance to solve the core problem of mea-suring the difference between image representations. Inother words, the distance between two images is the kernelGauss-Wasserstein (or the kernel Bures) distance betweenthe two corresponding feature collections. After obtainingthe distances between any pair of images, we employ thekernel SVM as the final classifier. Our approach is schema-tized in Fig. 2. Note that the above procedures make up atwo-layer kernel machine. The first-layer kernel, K1, is usedto compute the KGW (or the KB) distance, while the sec-ond-layer kernel,K2, is for the kernel SVM.

    5.2 Domain Adaptation

    5.2.1 Problem Formulation

    A domain adaptation task involves two data domains: thesource domain and the target domain. The source domain iscomposed of labeled data fXXs; llsg ¼ fðxsi ; lsi ÞgNsi¼1, which can

    ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT SPACES: THEORY AND APPLICATIONS 1747

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • be used to train a reliable classifier. The target domain refersto the unlabeled data YY t ¼ fytjgNtj¼1, whose statistical andgeometrical characteristics are different. Domain adaptationaims to adapt the classifier trained on the source domain tothe target domain.

    5.2.2 Proposed Approach

    Central Idea.Weemploy theOTmap to transport the RKHSdatain the sourcedomain to the target domain, thenwe train a classi-fier based on the transported data. We adopt the Gaussianityassumption in RKHS.Hence, the problem ofmatching distribu-tions in RKHS can be solved by aligning the correspondingcovariance operators. Because of the rank-deficiency issue, wefirst project the target data onto ImðR̂sÞ, and then apply the esti-matedKGOTmap (17). The procedure is schematized in Fig. 3.

    Preprocessing with PCA. The source and target data areacquired in different scenarios, which probably results ingeometrical distortions, especially for visual datasets [44]. Toalleviate this issue, we first apply Principal Component Anal-ysis (PCA) to the raw data to construct consistent feature rep-resentations. That is, we concatenate the source and targetsamples to form a large data matrix, from which we obtainthe joint principal components. We use the scores of datapoints on these principal components as the new representa-tions. Note that many state-of-the-art algorithms for visualdatasets, like domain invariant projection [44], subspacealignment [57], accept PCA as a preprocessing procedure.And some algorithms, like transfer subspace learning [58],joint distribution alignment [43], are formulated by solvingoptimizations, which are motivated by PCA and its variants.We emphasize that with the PCA-preprocessing procedure,the subspace mismatch issues might be reduced, since thejoint principal subspaces involve the geometrical informationof both the source and target samples. However, the statisti-cal distribution differencemay still be large.We solve the dis-tributionmismatch problem in RKHSwith the KGOTmap.

    Technical Details.4 Let FsX and FtY denote the source and

    target samples in RKHS, respectively. Then FsXHHNs andFtY HHNt are the corresponding centered samples. After beingprojected onto ImðR̂sÞ, the projection of the target data is

    P̂sðFtY HHNtÞ ¼ FsXHHNsCCyXXCCXY : (20)

    After being transported to the target domain, the sourcedata becomes

    T̂HG ðFsXHHNsÞ ¼ffiffiffiffiffiffiNsNt

    sFsXHHNsCC

    yXXCC

    12XYYXCC

    yXXCCXX: (21)

    Then, the inner product matrix between the projected targetsamples and the transported source samples is

    InnInnts ¼ �P̂sðFtY HHNtÞ�T �T̂HG ðFsXHHNsÞ�¼

    ffiffiffiffiffiffiNsNt

    sCCYXCC

    yXXCC

    12XYYXCC

    yXXCCXX:

    (22)

    Similarly,

    InnInnss ¼ NsNt

    CC12XYXY CC

    yXXCC

    12XYXY ; (23a)

    InnInntt ¼ CCYXCCyXXCCXY : (23b)So, after distribution matching, we obtain a domain-invariant kernel matrix, KKNew, and a distance matrix,DistDistts, i.e.,

    KKNew ¼ InnInnss ðInnInntsÞT

    InnInnts InnInntt

    � �; (24)

    DistDistts ¼~11Nt�diagðInnInnssÞ�T þ �diagðInnInnttÞ�~11TNs � 2InnInnts:

    (25)

    Domain-Invariant Kernel Machines. After nonlinear corre-lation alignment, the new kernel matrix (24) can be used inany kernel-based learning algorithms. For example, in ker-nel ridge regression, the predicted labels for the target data-set YY t are

    ~llY ¼ ðInnInntsÞðInnInnss þ gIINsÞ�1~llX: (26)

    Fig. 2. We first represent each image Ii by a collection of feature sam-ples Ai. Next, compute the KGW (or the KB) distances between any pairof images. Finally, we apply kernel SVM to conduct classification.

    Fig. 3. (a) The labeled dataset, XXs, in the source domain and the unla-beled dataset, YY t, in the target domain. Dots and stars represent differ-ent classes; (b) Map XXs and YY t to the RKHS HK, and centralize themapped data. (The centered source dataset FsXHHNs lies in ImðR̂sÞ); (c)Project the target dataset FtY HHNt onto ImðR̂sÞ. The projection isP̂sðFtY HHNt Þ; (d) Apply the KGOT map to transport the source data to thetarget domain. The transported data is T̂HG ðFsXHHNs Þ. Now T̂HG ðFsXHHNs Þand P̂sðFtY HHNt Þ are similarly distributed on ImðR̂sÞ � HK. Finally, train aclassifier using T̂HG ðFsXHHNs Þ, and apply the resultant classifier toP̂sðFtY HHNt Þ .

    4. We provide the detailed derivation of the mathematical results(20), (21), (22), and (23) in the Supplementary Material, which can befound on the Computer Society Digital Library at http://doi.ieeecom-putersociety.org/10.1109/TPAMI.2019.2903050.

    1748 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • For the kernel support vector machine, after training a clas-sifier on the source partition ðInnInnss;~llXÞ, we can predict thelabels of the target by

    ~llY ¼ ðInnInntsÞð~aa�~llXÞ þ~bb; (27)

    where ~aa is the Lagrangian multiplier, � is the Hadamardproduct, and~bb is the bias.

    With (24) or (25), the K-nearest neighbors classifier inRKHS can also be constructed. There are two ways to quan-tify the affinity between points in RKHS: the inner productand the distance. That is, given any target data ytj, we canidentify its nearest neighbors by finding the maximal valuesin the jth row of InnInnts or the minimal values in the jth rowof DistDistts. In practice, given different datasets and kernelfunctions, the best choices for the affinity characterizationmethods are also different. We view the choice as a“hyperparameter” and use the cross-validation strategy tochoose InnInnts or DistDistts.

    6 EXPERIMENTS

    This section is divided into three parts. The experiments inthe first part numerically demonstrate and validate themathematical results developed in the paper. In the secondpart, we study the behavior of the kernel Gauss-Wassersteindistance and the kernel Bures distance in viruses, textures,materials, and scenes classification. In the third part, weevaluate our approach for domain adaptation on threebenchmark datasets in the context of object recognition anddocument classification.

    6.1 Toy Examples

    This section includes two experiments with simulated data.In the first experiment, we numerically demonstrate ourclaim that the KGW distance is a lower bound of the KWdistance (see Proposition 1). In the second experiment, wedemonstrate that the KGOT map can match the data distri-butions in RKHS. In both experiments, we use the RBF

    kernel kð~xx;~yyÞ ¼ expð� k~xx�~yyk22

    2s2Þ, and choose s ¼ 2.

    6.1.1 Synthetic Data I

    We consider two classes of Gaussian distributions mðmÞ ¼Nðm~11; IIÞ and nðmÞ ¼ Nð�m~11; IIÞ on R2, parameterized by

    a real number m taking values in f0:1; 0:2; . . . ; 3g. For eachm, we draw 100 independent samples from mðmÞ, denotedas XXðmÞ ¼ ½~xx1;~xx2 . . . ;~xx100�ðmÞ, and 100 independent sam-ples from nðmÞ, denoted as YY ðmÞ ¼ ½~yy1;~yy2 . . . ;~yy100�ðmÞ.We use expression (14) to compute the empirical KGWdistance, and use (10) to compute the empirical KW dis-tance. For each m, the results are averaged over 50 repeti-tions. The results are shown in Fig. 4. Clearly, KGW isless than KW.

    6.1.2 Synthetic Data II

    In this experiment, we construct the source data matrixXXs ¼ ½~xx1;~xx2 . . . ;~xx500� 2 R3�500 by independently drawing1500 samples from the exponential distribution pðxÞ ¼expð�xÞ and arranging them in a 3� 500 matrix. We con-struct the target data matrix YY t ¼ ½~yy1;~yy2 . . . ;~yy500� 2 R3�500by independently drawing samples from the uniform dis-tribution on ½�2;�1�3. These two datasets are visualized inFig. 5a. Mapping these samples to the RKHS, we investi-gate the performance of the KGOT map. The centeredsource and target sample sets in RKHS are FsXHH500 andFtY HH500, respectively. We aim to numerically demonstratethat T̂HG ðFsXHH500Þ (see (21)) and P̂sðFtY HH500Þ (see (20)) aresimilarly distributed in RKHS. For the sake of visualiza-tion, we choose a coordinate system ðliÞ3i¼1, in which li istaken to be the evaluation functional at point xi, i.e.,liðfÞ ¼ fðxiÞ ¼ hfðxiÞ; fiHK 8f 2 HK. The coordinates ofT̂HG ðFsXHH500Þ are

    Fig. 4. The estimated KGW and KW distances between Gaussian distri-butionsNðm~11; IIÞ andNð�m~11; IIÞ.

    Fig. 5. (a) The source dataset XXs, and the target dataset YY t; (b) Therepresentations of the datasets T̂HG ðFsXHHNs Þ and P̂sðFtY HHNt Þ under thecoordinate sysmtem ðliÞ3i¼1.

    ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT SPACES: THEORY AND APPLICATIONS 1749

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • ~XXs ¼fT ðx1ÞfT ðx2ÞfT ðx3Þ

    264

    375FsXHH500CCyXXCC12XYYXCCyXXCCXX

    ¼ KK13XXHH500CCyXXCC12XYYXCC

    yXXCCXX 2 R3�500;

    (28)

    whereKK13XX denotes the first three rows ofKKXX . The coordi-nates of P̂sðFtY HH500Þ are

    ~YY t ¼fT ðx1ÞfT ðx2ÞfT ðx3Þ

    264

    375FsXHH500ðCCXXÞyCCXY

    ¼ KK13XXHH500CCyXXCCXY 2 R3�500:

    (29)

    We visualize the new data points ~XXs and ~YY t in Fig. 5b. It canbe seen that the distributions of ~XXs and ~YY t are quite close toeach other.

    6.2 Image Classification

    In this section, we evaluate the performance of the KGW dis-tance and the KB distance on the multiple-categories imageclassification task. As described in Section 5.1, our methodinvolves two kernels, for both of which we employ the RBF

    kernel, i.e., k1ð~xx;~yyÞ ¼ expð� k~xx�~yyk22

    s21

    Þ and k2ðI1; I2Þ ¼ exp��

    ðd̂HGW

    Þ2s22

    �(or k2ðI1; I2Þ ¼ exp

    �� ðd̂HB Þ2s22

    �). Note that k2 is not nec-

    essarily positive definite. We regularize the correspondingkernel matrices by adding a small diagonal term, gII, as in[59]. For the hyperparameters, we set s21 to be the median ofthe squared Euclidean distances between all the samples,and we set g ¼ 10�4. In each experiment, we choose a smallsubset in the training dataset to tune s22, which takes valuesin f0:1; 0:2; 0:6; 1; 2; 4g �M, whereM is the median of all thesquared KGW or KB distances. The tradeoff parameter C ofSVM is taken in f0:1; 1; 10; 100; 1000g.

    6.2.1 Data Preparation

    We use four benchmark image datasets: the Kylberg virusdataset [60], the Kylberg texture dataset [61], the UIUC data-set [62], and the TinyGraz03 dataset [63]. We consider boththe low-level features and the deep features.

    Low-level features.TheKylberg virus dataset contains 15 clas-ses of virus. Each class has 100 grayscale images of 41� 41pixels. We follow the experimental protocol in [39]. At eachpixel ðu; vÞ, we extract a 25-dimensional feature vector, i.e.,

    ~xxu;v ¼ Iu;v; @I@u

    ; @I@v

    ; @2I@u2

    ; @2I@v2

    ;G0;0u;v

    ; . . . ;G3;4u;v

    � �T

    ;

    where Iu;v is the intensity at ðu; vÞ, @I@u (@I@vÞ is the derivative of Iat the horizontal (vertical) direction, @

    2I@u2

    (@2I@v2

    ) is the second

    order derivative at the horizontal (vertical) direction, GO;Su;v is

    the response of theGaborwavelet [64] with orientationO tak-ing values in f0; 1; 2; 3g and scale S taking values inf0; 1; 2; 3; 4g, and j � j denotes the magnitude. To reduce thecomputational burden, for each image, we use 1000 samplesout of the total 41� 41 ¼ 1681 observations as the representa-tion. In each class of virus, we randomly select 90 images asthe training set and use the remaining ones as the testing set.

    The Kylberg texture dataset contains 28 categories of tex-tures. Each category has 160 grayscale images taken withand without rotation. Following the protocol in [39], weresize each image to 128� 128 pixels, and compute 1024observations on a coarse grid (i.e., every 4 pixels in the hori-zontal and vertical directions ). At each pixel ðu; vÞ, weextract a 5-dimensional feature vector, i.e.,

    ~xxu;v ¼ Iu;v; @I@u

    ; @I@v

    ; @2I@u2

    ; @2I@v2

    � �T

    :

    We randomly select five images in each category as thetraining set and use the remaining ones as the testing set.

    The UIUC dataset contains 18 categories of materials,each of which has 12 images. Following [65], at each pixel,we extract a 19-dimensional feature vector, i.e,

    ~xxu;v ¼ IRu;v; IGu;v; IBu;v; @I@u

    ; @I@v

    ; @2I@u2

    ; @2I@v2

    ;G0;0u;v

    ; :::;G2;3u;v

    � �T

    :

    We randomly select 1000 feature vectors of each image as itsrepresentation. As in [65], we randomly select half of theimages in each class as the training data, and use the rest asthe testing data.

    For all the above three datasets, we repeat the corre-sponding random training/testing split procedure 10 timesand report the average accuracy and the standard deviation.

    The TinyGraz03 dataset contains 20 classes of outdoorscenes, each of which has at least 40 images of size 32� 32.Following [65], at each pixel, we extract a 7-dimensional fea-ture vector, i.e,

    ~xxu;v ¼ IRu;v; IGu;v; IBu;v; @I@u

    ; @I@v

    ; @2I@u2

    ; @2I@v2

    � �T

    :

    We use the training/testing split recommended in [63].Deep features. For the Kylberg virus, UIUC, and Tiny-

    Graz03 dataset, we also conduct experiments using thehypercolumn descriptor [66] extracted from the deep con-volutional neural network. To obtain the hypercolumndescriptors, we normalize and resize each image to a fixedsize of 224� 224� 3 (in the format of W �H � C) and feedit into a pre-trained AlexNet [67], [68]. We then extract thefeature maps from the maxpool2 layer and conv4 layer. Thesizes of these features are 13� 13� 192, and 13� 13� 256,respectively. We concatenate the feature maps extractedfrom the maxpool2 and conv4 layers. As a result, each imageis represented by 13� 13 ¼ 169 feature vectors of thedimension 192þ 256 ¼ 448.

    6.2.2 Experimental Results

    We compare our approaches with the following state-of-the-art approaches: (1) MMD-based methods [69], denoted asMMD1 andMMD2, where the level-1 kernels (i.e., the embed-ding kernels) of both MMD1 and MMD2 are the RBF kernel,and the level-2 kernels of MMD1 and MMD2 are linear andRBF, respectively; (2) RKHS Bregman-divergence [39],denoted as SH; (3) covariance discriminative learning [37],denoted as CDL. For all the approaches, we use SVM as thefinal classifier.

    We report the classification results in Table 2. For mosttasks, our approaches KGW-SVM and KB-SVM outperform

    1750 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • baseline methods. We see that covariance operator basedapproaches, like the kernel Bures distance and the kernelBregman divergence, have superior performance to MMD-basedmethods. One reason is that covariance operators basedmethods exploit the intrinsic Riemannian structure of positiveoperators, which is usually favorable for computer vision.Note that by integrating with the deep hypercolumn descrip-tor, our KB-SVM approach achieves a very high classificationaccuracy of 72 percent on the challenging TinyGraz03 dataset,whose correct recognition rate by humans is 30 percent.

    6.3 Domain Adaptation

    In this section, we conduct experiments for visual objectrecognition and document classification to evaluate ourapproach.

    6.3.1 Datasets

    Three benchmark datasets are considered: COIL20, Office-Caltech, and Reuters-21578. In total, we have 32 adaptationtasks.

    The COIL20 dataset contains a total of 1,440 grayscaleimages of 20 classes of objects. The images of each objectwere taken at a pose interval of 5 degrees. Consequently,each object has 72 images. Each image in COIL20 is 32� 32pixels with 256 gray levels. We adopt the public datasetreleased by Long [43]. The total dataset is partitioned intotwo subsets, COIL1 and COIL2. COIL1 consists of all theimages taken in the directions of ½0; 85� or ½180; 265�.COIL2 consists of all the images taken in the directions of½90; 175� or ½270; 355�. There are two domain adaptationtasks, i.e., C1 ! C2 and C2 ! C1.

    The Office-Caltech dataset is an increasingly popularbenchmark dataset for visual domain adaptation. It containsthe images of ten classes of objects taken from four domains:958 images downloaded from Amazon, 1,123 images gath-ered from the web image search (Caltech-256), 157 imagestaken with a DSLR camera, and 295 images from Webcams.In total, they form 12 domain adaptation tasks, e.g., A ! C,A ! D, . . . ,W ! D. We consider two types of features: theSURF features and the DeCAF6 deep learning features. TheSURF features represent each image with an 800-bin nor-malized histogram whose codebook is trained from a subsetof Amazon images. We use the public dataset released byGong [45]. The DeCAF6 features [70], extracted from the 6thlayers of a convolutional neural network, represent eachimage with a 4,096-dimensional vector.

    The Reuters-21578 dataset has three top categories, i.e.,orgs, places and people, each of which has many

    subcategories. Samples that belong to different subcatego-ries are treated as drawn from different domains. Therefore,we can construct six cross-domain document datasets: orgsvs people, people vs orgs, orgs vs places, places vs orgs, people vsplaces, and places vs people. We adopt the preprocessed ver-sion of Reuters-21578, which contains 3,461 documents rep-resented by 4,771-dimensional features.

    In summary, we have constructed 2þ 12� 2þ 6 ¼ 32domain adaptation tasks.

    6.3.2 Methods

    We compare our approach with many state-of-the-art algo-rithms: (1) 1-Nearest neighbor classifier without adaptation(NN), (2) standard support vector machine (SVM), (3) Prin-cipal components analysis (PCA), (4) Optimal transportwith entropy regularization (OT-IT) [7], (5) Geodesic flowkernel (GFK) [45], (6) Joint distribution alignment (JDA)[43], (7) Correlation alignment (CORAL) [49], (8) Transfer-able component analysis (TCA) [71], (9) Subspace alignment(SA) [57], (10) Domain invariant projection (DIP) [44], (11)Surrogate kernel machine (SKM) [48], and (12) Kernel meanmatching (KMM) [47].

    In the object recognition tasks, we apply all the algo-rithms to the data after PCA-preprocessing, and use NN asthe final classifier. Note that the choice of InnInnts or DistDistts forKGOT and CORAL is marked by a subscript. In the docu-ment classification tasks, we apply all the algorithms to theraw data, and use SVM as the final classifier.

    6.3.3 Implementation Details

    In order to fairly compare the above methods, we adopt theevaluation protocol introduced in [71] and [43]. That is, weuse the whole labeled data in the source domain for traininga classifier (“full training” protocol). To choose hyperpara-meters for all the methods, we randomly select a very smallsubset of the target samples to tune parameters. We con-sider the following parameter ranges. For algorithmsinvolving subspace learning, we search for the best dimen-sion k in f10; 15; . . . ; 40g. For algorithms involving regulari-zation parameters, we search for the best ones in f0:01;0:1; 1; 2; 10; 100g. For the tradeoff parameter C in SVM, weselect the best C in f0:01; 0:1; 1; 10; 50; 100; 1000g.

    6.3.4 Experimental Results

    The experimental results on 32 domain adaptation tasks arereported in Tables 3, 4, 5, and 6. For each task, the best resultis highlighted in bold. Overall, our KGOT-based approachesachieve better performance than the baseline methods. On

    TABLE 2Classification Accuracy (in %) on the Kylberg Virus, Texture, UIUC, and TinyGraz03 Datasets

    Methodslow-level features deep features

    MeanVirus Texture UIUC TinyGraz03 Virus UIUC TinyGraz03

    KGW-SVM 80:4� 3:8 93:1� 0:9 48:4� 3:648:4� 3:6 6161 84:1� 1:384:1� 1:3 60:3� 1:760:3� 1:7 70 71.0KB-SVM 78:7� 3:1 93:9� 1:293:9� 1:2 48:2� 2:5 58 82:4� 1:8 59:1� 2:0 72 70.3MMD1-SVM 43:3� 6:2 59:3� 2:3 23:3� 3:8 32 68:7� 0:7 51:4� 3:7 47 46.4MMD2-SVM 71:8� 3:1 93:3� 1:1 43:3� 6:1 51 82:9� 1:5 56:5� 3:9 70 67.0Bregman-SVM 81:2� 2:981:2� 2:9 91:4� 1:3 45:4� 3:0 59 82:7� 1:4 53:9� 2:8 68 68.8CDL-SVM 69:5� 3:1 79:9� 1:1 36:3� 2:0 41 83:7� 1:3 58:3� 2:9 70 62.7

    ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT SPACES: THEORY AND APPLICATIONS 1751

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • theOffice-Caltech dataset with the SURF features, the averagerecognition accuracy of our approach is 18.04 percent higherthan that of the 1NN algorithm without domain adaptation,which demonstrates the power of aligning RKHS covarianceoperators in tackling domain shift issues. On the Reuters-21578 dataset, KGOT’s average classification accuracy signifi-cantly exceeds the best competitive method’s by 3.62 percent.On average, KGOT has superior performance to CORAL,

    because KGOT aligns the covariance descriptors in the non-linear feature space, which can capture high-order statistics.

    7 CONCLUSION AND FUTURE WORK

    In this paper, we presented a novel, theoretically robust,and computational framework, namely optimal transport inreproducing kernel Hilbert spaces, for comparing and

    TABLE 3Recognition Accuracies ( in % ) on COIL20 Dataset

    Task NN PCA OT-IT GFK JDA CORALDist TCA SA KGOTDist

    C1 ! C2 83.33 86.53 85.69 88.06 90.28 89.31 89.31 88.19 93.06C2 ! C1 84.72 87.22 86.39 88.33 88.06 89.17 89.72 88.75 89.72Mean 84.03 86.88 86.04 88.20 89.17 89.24 89.52 88.47 91.39

    TABLE 4Recognition Accuracies ( in % ) on Office-Caltech Dataset with the SURF Features

    Task NN PCA OT-IT GFK JDA CORALInn TCA DIP KGOTInn

    A ! C 26.00 35.98 36.24 42.68 37.67 34.73 42.74 39.98 39.89A ! D 25.48 32.48 35.03 40.52 36.94 28.66 37.58 39.49 42.04A ! W 29.83 34.24 42.71 42.37 40.34 35.93 40.00 38.64 42.03C ! A 23.70 37.89 45.41 41.13 43.42 46.45 46.76 41.75 49.37C ! D 25.48 39.49 45.86 45.22 52.87 43.95 47.13 45.22 50.96C ! W 25.76 34.92 42.71 40.34 43.05 36.27 40.68 37.29 43.05D ! A 28.50 33.72 33.92 34.34 33.29 34.13 34.86 33.82 37.06D ! C 26.27 31.17 31.43 33.48 30.99 31.61 32.95 30.99 34.64D ! W 63.39 81.36 87.46 85.08 92.20 83.73 91.19 84.41 87.46W ! A 22.96 30.79 37.58 33.09 37.06 39.46 29.44 30.38 38.00W ! C 19.86 30.19 32.41 30.90 29.92 33.66 30.72 26.09 36.60W ! D 59.24 80.89 89.17 90.45 89.81 78.34 89.15 91.72 91.72Mean 31.37 41.93 46.66 46.63 47.30 43.91 46.93 44.98 49.41

    TABLE 5Recognition Accuracies ( in % ) on Office-Caltech Dataset with the Deep Features

    Task NN PCA OT-IT GFK JDA CORALInn TCA SA KGOTInn

    A ! C 83.70 79.43 83.26 78.09 83.26 85.31 83.08 80.59 85.66A ! D 80.25 80.89 84.08 84.71 80.25 80.80 82.17 89.17 86.62A ! W 74.58 70.85 77.29 76.27 77.97 76.27 80.34 83.05 82.37C ! A 89.98 89.46 88.73 89.14 90.08 91.13 90.50 89.35 91.44C ! D 86.62 87.90 90.45 88.54 91.08 86.62 86.62 90.45 92.36C ! W 78.64 81.36 88.47 80.34 83.73 81.12 79.66 81.36 87.12D ! A 85.70 89.14 83.30 89.04 91.54 88.73 91.65 87.06 91.75D ! C 79.16 78.01 83.97 78.36 82.37 80.41 83.53 81.39 85.57D ! W 99.66 98.64 98.31 99.32 100 99.32 98.98 99.32 99.32W ! A 77.14 83.30 88.94 83.92 88.62 82.05 83.72 83.72 89.67W ! C 74.80 78.72 79.07 76.22 81.30 78.72 79.79 79.79 84.95W ! D 100 100 99.36 100 100 100 100 100 100Mean 84.19 84.81 87.10 85.33 87.52 85.87 86.67 87.10 89.74

    TABLE 6Recognition Accuracies ( in % ) on the Reuters-21578 Dataset

    Tasks SVM PCA OT-IT GFK TCA CORAL SKM KMM KGOT

    Orgs vs People 77.57 80.87 80.96 82.04 81.37 76.07 79.55 77.81 82.04People vs Orgs 80.43 81.81 84.31 82.30 84.39 76.71 82.22 82.94 85.77Orgs vs Places 69.89 73.63 76.22 73.92 74.59 73.15 74.50 71.52 76.22Places vs Orgs 65.16 65.35 75.10 76.57 73.33 64.66 69.88 67.91 77.95People vs Places 61.19 62.95 66.85 64.90 62.12 59.05 63.60 61.65 72.98Places vs People 60.26 68.89 61.75 65.55 60.54 60.26 60.35 59.52 71.96Mean 69.08 72.25 74.20 74.21 72.72 68.32 71.68 70.23 77.82

    1752 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • matching distributions in RKHS. Assuming Gaussianity inRKHS, we obtained closed-form expressions of both theempirical Wasserstein distance and optimal transport map,which respectively generalize the covariance descriptorcomparison and alignment problems from Euclidean spacesto (potentially) infinite-dimensional feature spaces. Empiri-cally, we apply our formulations to image classification anddomain adaptation. For both tasks, our approaches achievestate-of-the-art results.

    Our approaches are rather flexible in the sense that theycan be naturally integrated with other machine learningtopics, such as kernel learning, metric learning and subspace/manifold learning. Moreover, our approaches support vari-ous data representations, such as proteins, strings, andgraphs. Therefore, they have great potential to succeed inmany applicationswhere kernel functions arewell-defined.

    In future work, we intend to conduct ensemble classifica-tion and transfer learning on other types of datasets. We arealso interested in further improving the performance of theproposed approaches for domain adaptation. We plan tomodify our formulations of OT in RKHS, enabling it to alignthe joint distributions of features and labels between differ-ent domains.

    ACKNOWLEDGMENTS

    This work was supported in part by the AFOSR grantFA9550-16-1-0386.

    REFERENCES[1] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s dis-

    tance as a metric for image retrieval,” Int. J. Comput. Vis., vol. 40,no. 2, pp. 99–121, 2000.

    [2] O. Pele and M. Werman, “Fast and robust earth mover’s dis-tances,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 460–467.

    [3] S. Ferradans, N. Papadakis, J. Rabin, G. Peyr�e, and J.-F. Aujol,“Regularized discrete optimal transport,” in Proc. Int. Conf. ScaleSpace Variational Methods Comput. Vis., 2013, pp. 428–439.

    [4] S. Kolouri, Y. Zou, and G. K. Rohde, “Sliced wasserstein kernelsfor probability distributions,” in Proc. IEEE Conf. Comput. Vis. Pat-tern Recognit., 2016, pp. 5258–5267.

    [5] A. Gramfort, G. Peyr�e, and M. Cuturi, “Fast optimal transportaveraging of neuroimaging data,” in Proc. Int. Conf. Inf. Process.Med. Imag., 2015, pp. 261–272.

    [6] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generativeadversarial networks,” in Proc. Int. Conf. Mach. Learn., 2017,pp. 214–223.

    [7] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimaltransport for domain adaptation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 39, no. 9, pp. 1853–1865, Sep. 2017.

    [8] M. Perrot, N. Courty, R. Flamary, and A. Habrard, “Mapping esti-mation for discrete optimal transport,” in Proc. Adv. Neural Inf.Process. Syst., 2016, pp. 4197–4205.

    [9] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio,“Learning with a Wasserstein loss,” in Proc. Adv. Neural Inf. Pro-cess. Syst., 2015, pp. 2053–2061.

    [10] J. Solomon, F. De Goes, G. Peyr�e, M. Cuturi, A. Butscher,A. Nguyen, T. Du, and L. Guibas, “Convolutional Wassersteindistances: Efficient optimal transportation on geometric domains,”ACMTrans. Graph., vol. 34, no. 4, 2015, Art. no. 66.

    [11] N. Bonneel, J. Rabin, G. Peyr�e, and H. Pfister, “Sliced and RadonWasserstein barycenters of measures,” J. Math. Imag. Vis., vol. 51,no. 1, pp. 22–45, 2015.

    [12] G. Peyr�e, M. Cuturi, and J. Solomon, “Gromov-Wasserstein aver-aging of kernel and distance matrices,” in Proc. Int. Conf. Mach.Learn., 2016, pp. 2664–2672.

    [13] J. A. Carrillo, L. C. Ferreira, and J. C. Precioso, “A mass-transpor-tation approach to a one dimensional fluid mechanics model withnonlocal velocity,” Adv. Math., vol. 231, no. 1, pp. 306–327, 2012.

    [14] T. A. ElMoselhy and Y.M.Marzouk, “Bayesian inferencewith opti-malmaps,” J. Comput. Phys., vol. 231, no. 23, pp. 7815–7850, 2012.

    [15] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From wordembeddings to document distances,” in Proc. Int. Conf. Mach.Learn., 2015, pp. 957–966.

    [16] G. Montavon, K.-R. M€uller, and M. Cuturi, “Wasserstein trainingof restricted Boltzmann machines,” in Proc. Adv. Neural Inf. Pro-cess. Syst., 2016, pp. 3718–3726.

    [17] S. Ferradans, G.-S. Xia, G. Peyr�e, and J.-F. Aujol, “Static anddynamic texture mixing using optimal transport,” in Proc. Int. Conf.Scale Space VariationalMethods Comput. Vis., 2013, pp. 137–148.

    [18] Y. Chen, T. T. Georgiou, and A. Tannenbaum, “Optimal transportfor gaussian mixture models,” IEEE Access, vol. 7, pp. 6269–6278,2019.

    [19] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, andC. Watkins, “Text classification using string kernels,” J. Mach.Learn. Res., vol. 2, pp. 419–444, 2002.

    [20] R. I. Kondor and J. D. Lafferty, “Diffusion kernels on graphs andother discrete input spaces,” in Proc. 19th Int. Conf. Mach. Learn.,2002, pp. 315–322.

    [21] A. Ben-Hur and W. S. Noble, “Kernel methods for predicting pro-tein-protein interactions,” Bioinf., vol. 21, Suppl 1, pp. i38–46,Jun. 2005.

    [22] C. Cortes, P. Haffner, and M. Mohri, “Rational kernels: Theoryand algorithms,” J. Mach. Learn. Res., vol. 5, pp. 1035–1062,Aug. 2004.

    [23] C. Cortes, P. Haffner, and M. Mohri, “Lattice kernels for spoken-dialog classification,” in Proc. IEEE Int. Conf. Acoust. Speech SignalProcess., Apr. 2003, vol. 1, pp. I–628.

    [24] Z. Zhang, M. Wang, Y. Xiang, and A. Nehorai, “Geometry-adapted Gaussian random field regression,” in Proc. IEEE Int.Conf. Acoust. Speech Signal Process., 2017, pp. 6528–6532.

    [25] S. K. Zhou and R. Chellappa, “From sample similarity to ensemblesimilarity: Probabilistic distance measures in reproducing kernelhilbert space,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 6,pp. 917–929, Jun. 2006.

    [26] S.-Y. Huang, and C.-R. Hwang, “Kernel Fisher discriminant analy-sis in Gaussian reproducing kernel Hilbert spaces,” Inst. Stat. Sci.,Tech. Rep., Taipei, Taiwan: Academia Sinica, 2006.

    [27] Z. Zhang, G. Wang, D.-Y. Yeung, and J. T. Kwok, “Probabilistic ker-nel principal component analysis,” Dept. Comput. Sci., The HongKong Univ. Sci. Technol., Hong Kong, Tech. Rep. HKUST-CS04-03,2004.

    [28] M. Alvarez and R. Henao, “Probabilistic kernel principal compo-nent analysis through time,” in Proc. Int. Conf. Neural Inf. Process.,2006, pp. 747–754.

    [29] F. R. Bach and M. I. Jordan, “Learning graphical models withMercer kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2003, pp.1033–1040.

    [30] R. Kondor and T. Jebara, “A kernel between sets of vectors,” inProc. 20th Int. Conf. Mach. Learn., 2003, pp. 361–368.

    [31] A. Berlinet and C. Thomas-Agnan, Reproducing Kernel HilbertSpaces in Probability and Statistics. Berlin, Germany: Springer, 2011.

    [32] C. Villani, Topics in Optimal Transportation. Providence, RI, USA:American Mathematical Society, 2003, vol. 58.

    [33] S. T. Rachev and L. Ruschendorf, “A transformation property ofminimal metrics,” Theory Probability Appl., vol. 35, no. 1, pp. 110–117, 1991.

    [34] J. Cuesta-Albertos, C. Matr�an-Bea, and A. Tuero-Diaz, “Onlower bounds for the L2-Wasserstein metric in a hilbert space,” J.Theoretical Probability, vol. 9, no. 2, pp. 263–283, 1996.

    [35] M. Gelbrich, “On a formula for the L2 Wasserstein metric betweenmeasures on Euclidean and hilbert spaces,” Mathematische Nach-richten, vol. 147, no. 1, pp. 185–203, 1990.

    [36] A. Mallasto and A. Feragen, “Learning from uncertain curves: The2-Wasserstein metric for Gaussian processes,” in Proc. Adv. NeuralInf. Process. Syst., 2017, pp. 5665–5674.

    [37] R. Wang, H. Guo, L. S. Davis, and Q. Dai, “Covariance discrimina-tive learning: A natural and efficient approach to image set classi-fication,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012,pp. 2496–2503.

    [38] O. Tuzel, F. Porikli, and P. Meer, “Pedestrian detection via classifi-cation on Riemannian manifolds,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 30, no. 10, pp. 1713–1727, Oct. 2008.

    [39] M. Harandi, M. Salzmann, and F. Porikli, “Bregman divergencesfor infinite dimensional covariance matrices,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2014, pp. 1003–1010.

    ZHANG ET AL.: OPTIMAL TRANSPORT IN REPRODUCING KERNEL HILBERT SPACES: THEORY AND APPLICATIONS 1753

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

  • [40] M. H. Quang, M. San Biagio, and V. Murino, “Log-Hilbert-Schmidt metric between positive definite operators onHilbert spaces,” in Proc. Adv. Neural Inf. Process. Syst., 2014,pp. 388–396.

    [41] H. Q. Minh, “Affine-invariant Riemannian distance between infi-nite-dimensional covariance operators,” in Proc. Int. Conf. Netw.Geometric Sci. Inf., 2015, pp. 30–38.

    [42] X. Pennec, P. Fillard, and N. Ayache, “A riemannian framework fortensor computing,” Int. J. Comput. Vis., vol. 66, no. 1, pp. 41–66, 2006.

    [43] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer featurelearning with joint distribution adaptation,” in Proc. IEEE Int.Conf. Comput. Vis., 2013, pp. 2200–2207.

    [44] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, andM. Salzmann,“Unsupervised domain adaptation by domain invariant projec-tion,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 769–776.

    [45] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernelfor unsupervised domain adaptation,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2012, pp. 2066–2073.

    [46] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation forobject recognition: An unsupervised approach,” in Proc. IEEE Int.Conf. Comput. Vis., 2011, pp. 999–1006.

    [47] J. Huang, A. Gretton, K. M. Borgwardt, B. Sch€olkopf, and A. J.Smola, “Correcting sample selection bias by unlabeled data,” inProc. Adv. Neural Inf. Process. Syst., 2007, pp. 601–608.

    [48] K. Zhang, V. Zheng, Q. Wang, J. Kwok, Q. Yang, and I. Marsic,“Covariate shift in Hilbert space: A solution via sorrogate ker-nels,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 388–395.

    [49] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easydomain adaptation,” in Proc. 30th AAAI Conf. Artif. Intell., 2016,pp. 2058–2065.

    [50] D. Dowson and B. Landau, “The Fr�echet distance between multi-variate normal distributions,” J. Multivariate Anal., vol. 12, no. 3,pp. 450–455, 1982.

    [51] R. Bhatia, T. Jain, and Y. Lim, “On the Bures Wasserstein distancebetween positive definitematrices,”ExpositionesMathematicae, 2018.

    [52] A. Takatsu, et al., “Wasserstein geometry of gaussian measures,”Osaka J. Math., vol. 48, no. 4, pp. 1005–1026, 2011.

    [53] A. Gretton, O. Bousquet, A. Smola, and B. Sch€olkopf, “Measuringstatistical dependence with hilbert-schmidt norms,” in Proc. 16thInt. Conf. Algorithmic Learn. Theory, 2005, pp. 63–77.

    [54] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch€olkopf, andA. Smola, “A kernel two-sample test,” J. Mach. Learn. Res., vol. 13,pp. 723–773, 2012.

    [55] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet,“Universality, characteristic kernels and RKHS embedding ofmeasures,” J. Mach. Learn. Res., vol. 12, pp. 2389–2410, Jul. 2011.

    [56] K. Fukumizu, L. Song, and A. Gretton, “Kernel Bayes’ rule: Bayes-ian inference with positive definite kernels,” J. Mach. Learn. Res.,vol. 14, no. 1, pp. 3753–3783, 2013.

    [57] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars,“Unsupervised visual domain adaptation using subspacealignment,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2960–2967.

    [58] Y. Xu, X. Fang, J. Wu, X. Li, and D. Zhang, “Discriminative trans-fer subspace learning via low-rank and sparse representation,”IEEE Trans. Image Process., vol. 25, no. 2, pp. 850–863, Feb. 2016.

    [59] M. Cuturi, “Sinkhorn distances: Lightspeed computation of opti-mal transport,” in Proc. Adv. Neural Inf. Process. Syst., 2013,pp. 2292–2300.

    [60] G. Kylberg, M. Uppstr€om, K.-O. Hedlund, G. Borgefors, and I.-M.Sintorn, “Segmentation of virus particle candidates in transmis-sion electron microscopy images,” J. Microscopy, vol. 245, no. 2,pp. 140–147, 2012.

    [61] G. Kylberg, “The kylberg texture dataset v. 1.0,” Centre for ImageAnalysis, Swedish Univ. Agricultural Sci. Uppsala Univ.,Uppsala, Sweden, External report (Blue series) 35, (Sept. 2011).[Online]. Available: http: //www.cb.uu.se/~gustaf/texture/

    [62] Z. Liao, J. Rock, Y.Wang, and D. Forsyth, “Non-parametric filteringfor geometric detail extraction and material representation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 963–970.

    [63] A. Wendel and A. Pinz, “Scene categorization from tiny images,”inWorkshop Austrian Assoc. Pattern Recognit., 2007, pp. 49–56.

    [64] T. S. Lee, “Image representation using 2D Gabor wavelets,” IEEETrans. Pattern Anal.Mach. Intell., vol. 18, no. 10, pp. 959–971, Oct. 1996.

    [65] M. Faraki, M. T. Harandi, and F. Porikli, “Approximate infinite-dimensional region covariance descriptors for image classi-fication,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015,pp. 1364–1368.

    [66] B. Hariharan, P. Arbelez, R. Girshick, and J. Malik, “Hypercolumnsfor object segmentation and fine-grained localization,” in Proc. Conf.Pattern Recognit., Jun. 2015, pp. 447–456.

    [67] A. Krizhevsky, “One weird trick for parallelizing convolutionalneural networks,” CoRR, vol. abs/1404.5997, Apr. 2014.

    [68] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi-cation with deep convolutional neural networks,” in Proc. Adv.Neural Inf. Process. Syst., 2012, pp. 1097–1105.

    [69] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Sch€olkopf,“Learning from distributions via support measure machines,” inProc. Adv. Neural Inf. Process. Syst., 2012, pp. 10–18.

    [70] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “Decaf: A deep convolutional activation feature forgeneric visual recognition,” in Proc. Int. Conf. Mach. Learn., 2014,pp. 647–655.

    [71] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adapta-tion via transfer component analysis,” IEEE Trans. Neural Netw.,vol. 22, no. 2, pp. 199–210, Feb. 2011.

    Zhen Zhang (S’17) received the BSc degree fromthe University of science and technology of China,in 2014. He is currently working toward the PhDdegree in the Preston M. Green Department ofElectrical and Systems Engineering, WashingtonUniversity in St. Louis, St. Louis, MO, under theguidance of Dr. A. Nehorai. His research interestsinclude the areas of machine learning and com-puter vision. He is a student member of the IEEE.

    Mianzhi Wang (S’15) received the BSc degree inelectronic engineering from Fudan University,Shanghai, China, in 2013. He is currently workingtoward the PhD degree in the Preston M. GreenDepartment of Electrical and Systems Engineering,Washington University in St. Louis, St. Louis, MO,under the guidance of Dr. A. Nehorai. His researchinterests include the areas of statistical signal proc-essing for sensor arrays, optimization, andmachinelearning. He is a studentmember of the IEEE.

    Arye Nehorai (S’80-M’83-SM’0-F’94-LF’17)received the BSc and MSc degrees from theTechnion, Israel, and the PhD degree fromStanford University, California. He is the Eugeneand Martha Lohman professor of Electrical Engi-neering in the Preston M. Green Departmentof Electrical and Systems Engineering (ESE) atWashington University in St. Louis (WUSTL). Heserved as chair of this department from 2006 to2016. Under his leadership, the undergraduateenrollment has more than tripled and the masters

    enrollment has grown seven-fold. He is also professor in the Division ofBiology and Biomedical Sciences (DBBS), the Division of Biostatistics, theDepartment of Biomedical Engineering, and Department of Computer Sci-ence and Engineering, and director of the Center for Sensor Signal andInformation Processing at WUSTL. Prior to serving at WUSTL, he was afaculty member at Yale University and the University of Illinois at Chicago.He served as editor-in-chief of the IEEE Transactions on Signal Process-ing from 2000 to 2002. From 2003 to 2005 he was the vice president (Pub-lications) of the IEEE Signal Processing Society (SPS), the chair of thePublications Board, and a member of the Executive Committee of thisSociety. He was the founding editor of the special columns on LeadershipReflections in IEEE Signal Processing Magazine from 2003 to 2006. Hereceived the 2006 IEEE SPS Technical Achievement Award and the 2010IEEE SPS Meritorious Service Award. He was elected distinguished lec-turer of the IEEE SPS for a term lasting from 2004 to 2005. He receivedseveral best paper awards in IEEE journals and conferences. In 2001 hewas named University Scholar of the University of Illinois. He was the prin-cipal investigator of the Multidisciplinary University Research Initiative(MURI) project titled Adaptive Waveform Diversity for Full Spectral Domi-nance from 2005 to 2010. He is a life fellow of the IEEE since 1994, fellowof the Royal Statistical Society since 1996, and fellow of AAAS since 2012.

    " For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/csdl.

    1754 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 7, JULY 2020

    Authorized licensed use limited to: WASHINGTON UNIVERSITY LIBRARIES. Downloaded on June 06,2020 at 03:01:16 UTC from IEEE Xplore. Restrictions apply.

    http: //www.cb.uu.se/~gustaf/texture/

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice