Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
On the Relation Between Universality,Characteristic Kernels and RKHS Embedding of
Measures
Bharath K. Sriperumbudur⋆, Kenji Fukumizu† andGert R. G. Lanckriet⋆
⋆UC San Diego †The Institute of Statistical Mathematics
AISTATS 2010
Outline
RKHS embedding of probability measures
Characteristic kernels
Universal kernels
Various notions of universality
Novel characterization of universality
Relation to RKHS embedding of signed measures
RKHS Embedding of Probability Measures
Input space : X
Feature space : H (with reproducing kernel, k)
Feature map : Φ
Φ : X → H x 7→ Φ(x) := k(·, x)
RKHS Embedding of Probability Measures
Input space : X
Feature space : H (with reproducing kernel, k)
Feature map : Φ
Φ : X → H x 7→ Φ(x) := k(·, x)
Extension to probability measures:
P 7→ Φ(P) :=
∫
X
k(·, x) dP(x)
RKHS Embeddings of Probability Measures
Input space : X
Feature space : H (with reproducing kernel, k)
Feature map : Φ
Φ : X → H x 7→ Φ(x) := k(·, x)
Extension to probability measures:
P 7→ Φ(P) :=
∫
X
k(·, x) dP(x)
︸ ︷︷ ︸EY∼P[Φ(Y )]=EY∼P[k(·,Y )]
RKHS Embeddings of Probability Measures
Input space : X
Feature space : H (with reproducing kernel, k)
Feature map : Φ
Φ : X → H x 7→ Φ(x) := k(·, x)
Extension to probability measures:
P 7→ Φ(P) :=
∫
X
k(·, x) dP(x)
Advantage: Φ(P) can distinguish P by high-order moments.
k(y , x) = c0 + c1(xy) + c2(xy)2 + · · · (ci 6= 0) e.g. k(y , x) = exy
Φ(P)(y) = c0 + c1
(∫
X
x dP(x)
)y + c2
(∫
X
x2 dP(x)
)y2 + · · ·
Applications
Two-sample problem:
Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.
Determine: are P and Q different?
Applications
Two-sample problem:
Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.
Determine: are P and Q different?
γ(P,Q) = ‖Φ(P)− Φ(Q)‖H : distance metric between P and Q.
H0 : P = Q H0 : γ(P,Q) = 0≡
H1 : P 6= Q H1 : γ(P,Q) > 0
Test: Say H0 if γ(P,Q) < ε. Otherwise say H1.
Applications
Two-sample problem:
Given random samples X1, . . . ,Xm and Y1, . . . ,Yn drawn i.i.d.from P and Q, respectively.
Determine: are P and Q different?
γ(P,Q) = ‖Φ(P)− Φ(Q)‖H : distance metric between P and Q.
H0 : P = Q H0 : γ(P,Q) = 0≡
H1 : P 6= Q H1 : γ(P,Q) > 0
Test: Say H0 if γ(P,Q) < ε. Otherwise say H1.
Other applications:
Hypothesis testing : Independence test, Goodness of fit test, etc.
Feature selection, message passing, density estimation, etc.
Characteristic Kernels
Define: k is characteristic if
P 7→
∫
X
k(·, x) dP(x) is injective.
In other words,∫
X
k(·, x) dP(x) =
∫
X
k(·, x) dQ(x) ⇔ P = Q.
When k(·, x) = e√−1〈·,x〉, Φ(P) is the characteristic function of P.
Characteristic Kernels
Define: k is characteristic if
P 7→
∫
X
k(·, x) dP(x) is injective.
In other words,∫
X
k(·, x) dP(x) =
∫
X
k(·, x) dQ(x) ⇔ P = Q.
When k(·, x) = e√−1〈·,x〉, Φ(P) is the characteristic function of P.
Not all kernels are characteristic, e.g., k(x , y) = xT y .
µP = µQ ; P = Q
When is k characteristic? [Gretton et al., 2007,Sriperumbudur et al., 2008, Fukumizu et al., 2008,Fukumizu et al., 2009, Sriperumbudur et al., 2009].
Universal Kernels Regularization approach to supervised learning
minf∈H
1
n
n∑
i=1
ℓ(f (xi ), yi ) + λΩ[f ], (1)
where λ > 0 and (xi , yi )ni=1 is the training data.
Universal Kernels Regularization approach to supervised learning
minf∈H
1
n
n∑
i=1
ℓ(f (xi ), yi ) + λΩ[f ], (1)
where λ > 0 and (xi , yi )ni=1 is the training data.
Representer theorem : The solution to (1) is of the form
f =
n∑
i=1
cik(·, xi ),
where cini=1 ⊂ R are the parameters typically obtained from the
training data.
Universal Kernels Regularization approach to supervised learning
minf∈H
1
n
n∑
i=1
ℓ(f (xi ), yi ) + λΩ[f ], (1)
where λ > 0 and (xi , yi )ni=1 is the training data.
Representer theorem : The solution to (1) is of the form
f =
n∑
i=1
cik(·, xi ),
where cini=1 ⊂ R are the parameters typically obtained from the
training data.
Question: Can f approximate any target function arbitrarily “well”as n → ∞?
Universal Kernels Regularization approach to supervised learning
minf∈H
1
n
n∑
i=1
ℓ(f (xi ), yi ) + λΩ[f ], (1)
where λ > 0 and (xi , yi )ni=1 is the training data.
Representer theorem : The solution to (1) is of the form
f =
n∑
i=1
cik(·, xi ),
where cini=1 ⊂ R are the parameters typically obtained from the
training data.
Question: Can f approximate any target function arbitrarily “well”as n → ∞?
We need H to be “dense” in the space of target functions — k isuniversal.
Various Notions of Universality
Prior work
c-universality [Steinwart, 2001]
cc-universality [Micchelli et al., 2006]
Proposed notion: c0-universality
Characterization of c-, cc- and c0-universality : Relation to RKHSembedding of measures
Translation invariant kernels on Rd
Radial kernels on Rd
c-universality [Steinwart, 2001]
X : compact metric space
k : continuous on X × X
Target function space : C (X ), continuous functions on X
Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).
c-universality [Steinwart, 2001]
X : compact metric space
k : continuous on X × X
Target function space : C (X ), continuous functions on X
Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).
Sufficient conditions are obtained based on the Stone-Weierstraßtheorem. Not easy to check!
Examples: Gaussian and Laplacian kernels on any compact subset ofRd .
c-universality [Steinwart, 2001]
X : compact metric space
k : continuous on X × X
Target function space : C (X ), continuous functions on X
Define k to be c-universal if H is dense in C (X ) w.r.t. the uniform norm(‖f ‖u := supx∈X |f (x)|).
Sufficient conditions are obtained based on the Stone-Weierstraßtheorem. Not easy to check!
Examples: Gaussian and Laplacian kernels on any compact subset ofRd .
Issue: X is compact which excludes many interesting spaces, such as Rd .
cc-universality [Micchelli et al., 2006]
X : Hausdorff space
k : continuous on X × X
Target function space : C (X )
Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.
cc-universality [Micchelli et al., 2006]
X : Hausdorff space
k : continuous on X × X
Target function space : C (X )
Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.
In other words, for any compact set Z ⊂ X , H|Z := f|Z : f ∈ H isdense in C (Z ) w.r.t. ‖ · ‖u.
cc-universality [Micchelli et al., 2006]
X : Hausdorff space
k : continuous on X × X
Target function space : C (X )
Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.
Necessary and sufficient conditions are obtained, which are relatedto the injectivity of RKHS embedding of measures.
Examples: Gaussian, Laplacian and Sinc kernels on Rd .
cc-universality [Micchelli et al., 2006]
X : Hausdorff space
k : continuous on X × X
Target function space : C (X )
Define k to be cc-universal if H is dense in C (X ) endowed with thetopology of compact convergence.
Necessary and sufficient conditions are obtained, which are relatedto the injectivity of RKHS embedding of measures.
Examples: Gaussian, Laplacian and Sinc kernels on Rd .
Issue: Topology of compact convergence is weaker than the topology ofuniform convergence.
Proposed Notion: c0-universality
X : locally compact Hausdorff (LCH) space
Target function space : C0(X ), the space of bounded continuousfunctions that “vanish at infinity” (for every ǫ > 0,x ∈ X : |f (x)| ≥ ǫ is compact).
k is bounded and k(·, x) ∈ C0(X ) for all x ∈ X .
Proposed Notion: c0-universality
X : locally compact Hausdorff (LCH) space
Target function space : C0(X ), the space of bounded continuousfunctions that “vanish at infinity” (for every ǫ > 0,x ∈ X : |f (x)| ≥ ǫ is compact).
k is bounded and k(·, x) ∈ C0(X ) for all x ∈ X .
Define k to be c0-universal if H is dense in C0(X ) w.r.t. ‖ · ‖u.
Handles non-compact X and ensures uniform convergence overentire X .
Embedding Characterization of UniversalityTheorem
k is c0-universal if and only if
µ 7→
∫
X
k(·, x) dµ(x), µ ∈ Mb(X ),
is injective. Mb(X ) is the space of finite signed Radon measures onX .
Embedding Characterization of UniversalityTheorem
k is c0-universal if and only if
µ 7→
∫
X
k(·, x) dµ(x), µ ∈ Mb(X ),
is injective. Mb(X ) is the space of finite signed Radon measures onX .
k is cc-universal if and only if
µ 7→
∫
X
k(·, x) dµ(x), µ ∈ Mbc(X ),
is injective. Mbc(X ) = µ ∈ Mb(X ) | supp(µ) is compact.
Embedding Characterization of UniversalityTheorem
k is c0-universal if and only if
µ 7→
∫
X
k(·, x) dµ(x), µ ∈ Mb(X ),
is injective. Mb(X ) is the space of finite signed Radon measures onX .
k is cc-universal if and only if
µ 7→
∫
X
k(·, x) dµ(x), µ ∈ Mbc(X ),
is injective. Mbc(X ) = µ ∈ Mb(X ) | supp(µ) is compact.
k is c-universal if and only if
µ 7→
∫
X
k(·, x) dµ(x), µ ∈ Mb(X ),
is injective.
Postive Definite Characterization of Universality
Theorem
k is c0-universal (resp. c-universal) if and only if
∫
X
∫
X
k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.
Postive Definite Characterization of Universality
Theorem
k is c0-universal (resp. c-universal) if and only if
∫
X
∫
X
k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.
k is cc-universal if and only if
∫
X
∫
X
k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mbc(X )\0.
Postive Definite Characterization of Universality
Theorem
k is c0-universal (resp. c-universal) if and only if
∫
X
∫
X
k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mb(X )\0.
k is cc-universal if and only if
∫
X
∫
X
k(x , y) dµ(x) dµ(y) > 0, ∀µ ∈ Mbc(X )\0.
If k is c-, cc- or c0-universal, then it is strictly positive definite.
X is an LCH space: Summary
Translation Invariant Kernels on Rd
X = Rd and k(x , y) = ψ(x − y), where
ψ(x) =
∫
Rd
e√−1xTω dΛ(ω), x ∈ Rd ,
and Λ is a non-negative finite Borel measure.
Translation Invariant Kernels on Rd
X = Rd and k(x , y) = ψ(x − y), where
ψ(x) =
∫
Rd
e√−1xTω dΛ(ω), x ∈ Rd ,
and Λ is a non-negative finite Borel measure.
Theorem
k is c0-universal if and only if supp(Λ) = Rd .
k is c0-universal if and only if it is characteristic.
If supp(Λ) has a non-empty interior, then k is cc-universal.[Micchelli et al., 2006]
Examples
Gaussian kernel: ψ(x) = e−x2/2σ2
; Ψ(ω) = σe−σ2ω2/2; dΛ(ω) = Ψ(ω) dω.
−4 −3 −2 −1 0 1 2 3 40
1
x
ψ(x
)
−4 −3 −2 −1 0 1 2 3 40
σ
Ψ(ω
)
Laplacian kernel: ψ(x) = e−σ|x|; Ψ(ω) =√
2π
σσ2+ω2 .
−4 −3 −2 −1 0 1 2 3 40
1
x
ψ(x
)
−4 −3 −2 −1 0 1 2 3 40
(2/πσ2)1/2
Ψ(ω
)
Examples
B1-spline kernel: ψ(x) = (1− |x |)1[−1,1](x); Ψ(ω) = 2√
2√π
sin2(ω2)
ω2 .
−3 −2 −1 0 1 2 30
1
x
ψ(x
)
0
0.2
0.4
0.6
0.8
1
−8π −6π −4π −2π 0 2π 4π 6π 8π
(2π)
1/2 Ψ
(ω)
Sinc kernel: ψ(x) = sin(σx)x
; Ψ(ω) =√
π21[−σ,σ](ω).
−0.2
0
0.2
0.4
0.6
0.8
1
−6π −5π −4π −3π −2π −π 0 π 2π 3π 4π 5π 6π
ψ(x
)
Ψ(ω
)
−3 −2 −1 0 1 2 30
(π/2)1/2
Translation Invariant Kernels on Rd : Summary
Radial Kernels on Rd
Let
k(x , y) =
∫
[0,∞)
e−t‖x−y‖22 dν(t),
where ν is a finite non-negative Borel measure on [0,∞).
Examples: Gaussian kernel, Inverse multi-quadratic kernel,k(x , y) = (c2 + ‖x − y‖22)
−β , β > d2 , c > 0, etc.
Radial Kernels on Rd
Let
k(x , y) =
∫
[0,∞)
e−t‖x−y‖22 dν(t),
where ν is a finite non-negative Borel measure on [0,∞).
Examples: Gaussian kernel, Inverse multi-quadratic kernel,k(x , y) = (c2 + ‖x − y‖22)
−β , β > d2 , c > 0, etc.
TheoremThe following conditions are equivalent.
supp(ν) 6= 0.
k is c0-universal.
k is cc-universal.
k is characteristic.
k is strictly pd.
Radial Kernels on Rd : Summary
Summary
Characteristic kernel
Injective RKHS embedding of probability measures.
Applications: Hypothesis testing, feature selection, etc.
Universal kernel
Consistency of learning algorithms.
Injective RKHS embedding of finite signed Radon measures.
Clarified the relation between various notions of universality andcharacteristic kernels.
References
Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.
Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.
Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.
Micchelli, C. A., Xu, Y., and Zhang, H. (2006).Universal kernels.Journal of Machine Learning Research, 7:2651–2667.
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R. G., and Scholkopf, B. (2009).Kernel choice and classifiability for RKHS embeddings of probability distributions.In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A., editors, Advances in Neural Information Processing
Systems 22, pages 1750–1758. MIT Press.
Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.
Steinwart, I. (2001).On the influence of the kernel on the consistency of support vector machines.Journal of Machine Learning Research, 2:67–93.