Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Kernel Choice and Classifiability for RKHS
Embeddings of Probability Distributions
Bharath K. Sriperumbudur⋆, Kenji Fukumizu†, Arthur Gretton‡,×,Gert R. G. Lanckriet⋆ and Bernhard Scholkopf×
⋆UC San Diego †The Institute of Statistical Mathematics‡ CMU ×MPI for Biological Cybernetics
NIPS 2009
RKHS Embeddings of Probability Measures
◮ Input space : X
◮ Feature space : H
◮ Feature map : Φ
Φ : X → H x 7→ Φ(x).
Extension to probability measures:
P 7→ Φ(P)
Distance between P and Q:
γ(P, Q) = ‖Φ(P) − Φ(Q)‖H.
Applications
Two-sample problem:
◮ Given random samples {X1, . . . ,Xm} and {Y1, . . . ,Yn} drawn i.i.d.from P and Q, respectively.
◮ Determine: are P and Q different?
◮ γ(P, Q) : distance metric between P and Q.
H0 : P = Q H0 : γ(P, Q) = 0≡
H1 : P 6= Q H1 : γ(P, Q) > 0
◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.
Applications
◮ Hypothesis testing
◮ Testing for independence and conditional independence
◮ Goodness of fit test
◮ Density estimation : quality of the estimate, convergence results.
◮ Central limit theorems
◮ Information theory
Popular examples:
◮ Kullback-Leibler divergence
◮ Total-variation distance (metric)
◮ Hellinger distance
◮ χ2-distance
The above examples are special instances of Csiszar’s φ-divergence.
Integral Probability Metrics
◮ The integral probability metric [Muller, 1997] between P and Q isdefined as
γF(P, Q) = supf ∈F
|EPf − EQf | .
◮ Many popular probability metrics can be obtained by appropriatelychoosing F.
◮ Total variation distance : F = {f : ‖f ‖∞ ≤ 1}.
◮ Wasserstein distance : F = {f : ‖f ‖L ≤ 1}.
◮ Dudley metric : F = {f : ‖f ‖L + ‖f ‖∞ ≤ 1}.
◮ well-studied in statistics and probability theory.
F is a Reproducing Kernel Hilbert Space
◮ H : reproducing kernel Hilbert space (RKHS).
◮ k : measurable, bounded, real-valued reproducing kernel.
◮ F : a unit ball in H, i.e., F = {f : ‖f ‖H ≤ 1}.
Maximum mean discrepancy (MMD): [Gretton et al., 2007]
γk(P, Q) := γF(P, Q) = ‖EPk − EQk‖H
,
where ‖.‖H represents the RKHS norm.
RKHS embedding of probability measures:
P 7→ EPk =: Φ(P).
Advantages
◮ Easy to compute γk unlike other F.
◮ k is measurable and bounded: γk(Pm, Qn) is a√
mnm+n
-consistent
estimator of γk(P, Q) [Gretton et al., 2007].
◮ k is translation-invariant on Rd : the rate is independent of d .
◮ Easy to handle structured domains like graphs and strings.
Characteristic Kernels
When is γk a metric?
γk(P, Q) = 0 ⇔ EPk = EQk ⇔ P = Q.
Define: k is characteristic if
EPk = EQk ⇔ P = Q.
◮ Not all kernels are characteristic, e.g. k(x , y) = xT y .
γk(P, Q) = ‖µP − µQ‖2.
◮ When is k characteristic?[Gretton et al., 2007, Sriperumbudur et al., 2008,Fukumizu et al., 2008, Fukumizu et al., 2009].
Outline
◮ Characterization of characteristic kernels (visit poster!)
◮ Choice of characteristic kernels
◮ Characteristic kernels and binary classification
Choice of Characteristic Kernels
Examples: Gaussian, Laplacian, B2l+1-splines, Poisson kernel, etc.
Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2
22σ
2 .
◮ γk is a function of σ.
◮ So γk is a family of metrics. Which one do we use in practice?
◮ Note that γk → 0 as σ → 0 or σ → ∞.
◮ Defineγ(P, Q) = sup
σ∈R+
γkσ(P, Q).
Classes of Characteristic Kernels
Generalized MMD:γ(P, Q) := sup
k∈K
γk(P, Q).
Examples for K :
◮ Kg := {e−σ‖x−y‖22 , x , y ∈ Rd : σ ∈ R+}.
◮ Krbf := {∫∞
0e−λ‖x−y‖2
2 dµσ(λ), x , y ∈ Rd , µσ ∈ M + : σ ∈ Σ ⊂Rd}, where M + is the set of all finite nonnegative Borel measures,µσ on R+ that is not concentrated at zero.
◮ Klin := {kλ =∑l
i=1 λiki |kλ is pd,∑l
i=1 λi = 1}.
◮ Kcon := {kλ =∑l
i=1 λiki |λi ≥ 0,∑l
i=1 λi = 1}.
Computation
◮
γ(P, Q) = supk∈K
[ ∫∫k(x , y) dP(x) dP(y) +
∫∫k(x , y) dQ(x) dQ(y)
−2
∫∫k(x , y) dP(x) dQ(y)
]1/2
.
◮ Suppose {Xi}mi=1
i.i.d.∼ P and {Yi}ni=1
i.i.d.∼ Q.
◮ Let Pm := 1m
∑mi=1 δXi
and Qn := 1n
∑ni=1 δYi
, where δx representsthe Dirac measure at x .
◮ The empirical estimate of γ(P, Q):
γ(Pm, Qn) = supk∈K
m∑
i,j=1
k(Xi ,Xj)
m2+
n∑
i,j=1
k(Yi ,Yj)
n2− 2
m,n∑
i,j=1
k(Xi ,Yj)
mn
1/2
.
Question
◮ When is γ a metric?
◮ Answer: If any k ∈ K is characteristic, then γ is a metric.
Question
◮ For a fixed k that is measurable and bounded, [Gretton et al., 2007]have shown that
|γk(Pm, Qn) − γk(P, Q)| = O
(√m + n
mn
).
◮ When does γ(Pm, Qn)a.s.→ γ(P, Q)? What is the rate of convergence?
Statistical Consistency: Result
TheoremFor any K and ν := supk∈K,x∈M k(x , x) < ∞, with probability at least1 − δ, the following holds:
|γ(Pm, Qn) − γ(P, Q)| ≤√
8Um(K)
m+
√8Un(K)
n
+
(√
8ν +
√36ν log
4
δ
)√m + n
mn,
where
Um(K) := E
supk∈K
∣∣∣∣∣∣1
m
m∑
i<j
ρiρjk(Xi ,Xj)
∣∣∣∣∣∣
∣∣∣X1, . . . ,Xm
,
is the Rademacher chaos complexity and ρi are Rademacher randomvariables.
Statistical Consistency: Result
Proposition
Suppose K is a VC-subgraph class. Then
|γ(Pm, Qn) − γ(P, Q)| = O
(√m + n
mn
).
In addition, γ(Pm, Qn)a.s.→ γ(P, Q).
Examples: [Ying and Campbell, 2009, Srebro and Ben-David, 2006]
◮ Kg , Krbf , Klin, Kcon, etc.
The Two-Sample Problem
◮ Given : {X1, . . . ,Xm} i.i.d.∼ P and {Y1, . . . ,Yn} i.i.d.∼ Q.
◮ Determine: are P and Q different?
◮ γ(P, Q) : distance metric between P and Q.
H0 : P = Q H0 : γ(P, Q) = 0≡
H1 : P 6= Q H1 : γ(P, Q) > 0
◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.
◮ Good Test: Low Type-II error for user-defined Type-I error.
Experiments
◮ q = N (0, σ2q).
◮ p(x) = q(x)(1 + sin νx).
−5 0 50
0.1
0.2
0.3
x
q(x)
ν = 0
−5 0 50
0.1
0.2
0.3
0.4
x
p(x)
ν = 2
−5 0 50
0.1
0.2
0.3
0.4
x
p(x)
ν = 7.5
◮ k(x , y) = exp(−(x − y)2/σ).
◮ Test statistics: γ(Pm, Qm) and γk(Pm, Qm) for various σ.
Experiments
γ(P, Q)
0.5 0.75 1 1.25 1.5
0
2
456
ν
Err
or (
in %
)
Type−I errorType−II error
Experiments
γk(P, Q)
−3 −2 −1 0 1 2 3 4 5 6
5
10
15
20
25
log σ
Typ
e−I e
rror
(in
%)
ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5
−3 −2 −1 0 1 2 3 4 5 60
50
100
log σ
Typ
e−II
erro
r (in
%)
ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5
Outline
◮ Characterization of characteristic kernels (visit poster!)
◮ Choice of characteristic kernels
◮ Characteristic kernels and binary classification
γk and Parzen Window Classifier
Let
◮ RKHS (H, k): k measurable and bounded.
◮ Fk = {f : ‖f ‖H ≤ 1}.◮ P, Q : class-conditional distributions
◮ RFk: Bayes risk of a classifier in Fk .
Then,γk(P, Q) = −RFk
.
◮ The MMD between class conditionals P and Q is negative of theBayes risk associated with a Parzen window classifier.
◮ Characteristic k is important.
γk and Support Vector Machine
◮ RKHS (H, k): k measurable and bounded.
◮ fsvm be the solution to the program,
inff ∈H
‖f ‖H
s.t. Yi f (Xi ) ≥ 1, ∀ i .
If k is characteristic, then
1
‖fsvm‖H
≤ 1
2γk(Pm, Qn).
Achievability of Bayes Risk
◮ G⋆ : set of all real-valued measurable functions on M.
◮ (H, k) : RKHS with measurable and bounded k.
◮ Achievability of Bayes risk :
infg∈H
R(g) = infg∈G⋆
R(g). (⋆⋆)
Under some technical conditions,
◮ (⋆⋆) ⇒ k is characteristic.
◮ Suppose 1 ∈ H. k is characteristic ⇒ (⋆⋆).
Summary
◮ Characteristic kernel
◮ A class of kernels that characterize the probability measure
associated with a random variable.
◮ MMD is a metric.
◮ How to choose characteristic kernels in practice?
◮ Generalized MMD.
◮ Performs better than MMD in a two-sample test.
◮ Characteristic kernels are important in binary classification.
◮ Parzen window classifier and hard-margin SVM.
◮ Achievability of Bayes risk.
Thank You
References
◮ Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.
◮ Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.
◮ Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.
◮ Muller, A. (1997).Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 29:429–443.
◮ Srebro, N. and Ben-David, S. (2006).Learning bounds for support vector machines with learned kernels.
In Lugosi, G. and Simon, H. U., editors, Proc. of the 19th Annual Conference on Learning Theory, pages 169–183.
◮ Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.
◮ Ying, Y. and Campbell, C. (2009).Generalization bounds for learning the kernel.
In Proc. of the 22nd Annual Conference on Learning Theory.