Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classiﬁability for RKHS Embeddings of Probability Distributions

Kernel Choice and Classifiability for RKHS

Embeddings of Probability Distributions

Bharath K. Sriperumbudur⋆, Kenji Fukumizu†, Arthur Gretton‡,×,Gert R. G. Lanckriet⋆ and Bernhard Scholkopf×

⋆UC San Diego †The Institute of Statistical Mathematics‡ CMU ×MPI for Biological Cybernetics

NIPS 2009

RKHS Embeddings of Probability Measures

◮ Input space : X

◮ Feature space : H

◮ Feature map : Φ

Φ : X → H x 7→ Φ(x).

Extension to probability measures:

P 7→ Φ(P)

Distance between P and Q:

γ(P, Q) = ‖Φ(P) − Φ(Q)‖H.

Applications

Two-sample problem:

◮ Given random samples {X1, . . . ,Xm} and {Y1, . . . ,Yn} drawn i.i.d.from P and Q, respectively.

◮ Determine: are P and Q different?

◮ γ(P, Q) : distance metric between P and Q.

H0 : P = Q H0 : γ(P, Q) = 0≡

H1 : P 6= Q H1 : γ(P, Q) > 0

◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.

Applications

◮ Hypothesis testing

◮ Testing for independence and conditional independence

◮ Goodness of fit test

◮ Density estimation : quality of the estimate, convergence results.

◮ Central limit theorems

◮ Information theory

Popular examples:

◮ Kullback-Leibler divergence

◮ Total-variation distance (metric)

◮ Hellinger distance

◮ χ2-distance

The above examples are special instances of Csiszar’s φ-divergence.

Integral Probability Metrics

◮ The integral probability metric [Muller, 1997] between P and Q isdefined as

γF(P, Q) = supf ∈F

|EPf − EQf | .

◮ Many popular probability metrics can be obtained by appropriatelychoosing F.

◮ Total variation distance : F = {f : ‖f ‖∞ ≤ 1}.

◮ Wasserstein distance : F = {f : ‖f ‖L ≤ 1}.

◮ Dudley metric : F = {f : ‖f ‖L + ‖f ‖∞ ≤ 1}.

◮ well-studied in statistics and probability theory.

F is a Reproducing Kernel Hilbert Space

◮ H : reproducing kernel Hilbert space (RKHS).

◮ k : measurable, bounded, real-valued reproducing kernel.

◮ F : a unit ball in H, i.e., F = {f : ‖f ‖H ≤ 1}.

Maximum mean discrepancy (MMD): [Gretton et al., 2007]

γk(P, Q) := γF(P, Q) = ‖EPk − EQk‖H

,

where ‖.‖H represents the RKHS norm.

RKHS embedding of probability measures:

P 7→ EPk =: Φ(P).

Advantages

◮ Easy to compute γk unlike other F.

◮ k is measurable and bounded: γk(Pm, Qn) is a√

mnm+n

-consistent

estimator of γk(P, Q) [Gretton et al., 2007].

◮ k is translation-invariant on Rd : the rate is independent of d .

◮ Easy to handle structured domains like graphs and strings.

Characteristic Kernels

When is γk a metric?

γk(P, Q) = 0 ⇔ EPk = EQk ⇔ P = Q.

Define: k is characteristic if

EPk = EQk ⇔ P = Q.

◮ Not all kernels are characteristic, e.g. k(x , y) = xT y .

γk(P, Q) = ‖µP − µQ‖2.

◮ When is k characteristic?[Gretton et al., 2007, Sriperumbudur et al., 2008,Fukumizu et al., 2008, Fukumizu et al., 2009].

Outline

◮ Characterization of characteristic kernels (visit poster!)

◮ Choice of characteristic kernels

◮ Characteristic kernels and binary classification

Choice of Characteristic Kernels

Examples: Gaussian, Laplacian, B2l+1-splines, Poisson kernel, etc.

Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2

22σ

2 .

◮ γk is a function of σ.

◮ So γk is a family of metrics. Which one do we use in practice?

◮ Note that γk → 0 as σ → 0 or σ → ∞.

◮ Defineγ(P, Q) = sup

σ∈R+

γkσ(P, Q).

Classes of Characteristic Kernels

Generalized MMD:γ(P, Q) := sup

k∈K

γk(P, Q).

Examples for K :

◮ Kg := {e−σ‖x−y‖22 , x , y ∈ Rd : σ ∈ R+}.

◮ Krbf := {∫∞

0e−λ‖x−y‖2

2 dµσ(λ), x , y ∈ Rd , µσ ∈ M + : σ ∈ Σ ⊂Rd}, where M + is the set of all finite nonnegative Borel measures,µσ on R+ that is not concentrated at zero.

◮ Klin := {kλ =∑l

i=1 λiki |kλ is pd,∑l

i=1 λi = 1}.

◮ Kcon := {kλ =∑l

i=1 λiki |λi ≥ 0,∑l

i=1 λi = 1}.

Computation

◮

γ(P, Q) = supk∈K

[ ∫∫k(x , y) dP(x) dP(y) +

∫∫k(x , y) dQ(x) dQ(y)

−2

∫∫k(x , y) dP(x) dQ(y)

]1/2

.

◮ Suppose {Xi}mi=1

i.i.d.∼ P and {Yi}ni=1

i.i.d.∼ Q.

◮ Let Pm := 1m

∑mi=1 δXi

and Qn := 1n

∑ni=1 δYi

, where δx representsthe Dirac measure at x .

◮ The empirical estimate of γ(P, Q):

γ(Pm, Qn) = supk∈K

m∑

i,j=1

k(Xi ,Xj)

m2+

n∑

i,j=1

k(Yi ,Yj)

n2− 2

m,n∑

i,j=1

k(Xi ,Yj)

mn

1/2

.

Question

◮ When is γ a metric?

◮ Answer: If any k ∈ K is characteristic, then γ is a metric.

Question

◮ For a fixed k that is measurable and bounded, [Gretton et al., 2007]have shown that

|γk(Pm, Qn) − γk(P, Q)| = O

(√m + n

mn

).

◮ When does γ(Pm, Qn)a.s.→ γ(P, Q)? What is the rate of convergence?

Statistical Consistency: Result

TheoremFor any K and ν := supk∈K,x∈M k(x , x) < ∞, with probability at least1 − δ, the following holds:

|γ(Pm, Qn) − γ(P, Q)| ≤√

8Um(K)

m+

√8Un(K)

n

+

(√

8ν +

√36ν log

4

δ

)√m + n

mn,

where

Um(K) := E

supk∈K

∣∣∣∣∣∣1

m

m∑

i<j

ρiρjk(Xi ,Xj)

∣∣∣∣∣∣

∣∣∣X1, . . . ,Xm

,

is the Rademacher chaos complexity and ρi are Rademacher randomvariables.

Statistical Consistency: Result

Proposition

Suppose K is a VC-subgraph class. Then

|γ(Pm, Qn) − γ(P, Q)| = O

(√m + n

mn

).

In addition, γ(Pm, Qn)a.s.→ γ(P, Q).

Examples: [Ying and Campbell, 2009, Srebro and Ben-David, 2006]

◮ Kg , Krbf , Klin, Kcon, etc.

The Two-Sample Problem

◮ Given : {X1, . . . ,Xm} i.i.d.∼ P and {Y1, . . . ,Yn} i.i.d.∼ Q.

◮ Determine: are P and Q different?

◮ γ(P, Q) : distance metric between P and Q.

H0 : P = Q H0 : γ(P, Q) = 0≡

H1 : P 6= Q H1 : γ(P, Q) > 0

◮ Test: Say H0 if γ(P, Q) < ε. Otherwise say H1.

◮ Good Test: Low Type-II error for user-defined Type-I error.

Experiments

◮ q = N (0, σ2q).

◮ p(x) = q(x)(1 + sin νx).

−5 0 50

0.1

0.2

0.3

x

q(x)

ν = 0

−5 0 50

0.1

0.2

0.3

0.4

x

p(x)

ν = 2

−5 0 50

0.1

0.2

0.3

0.4

x

p(x)

ν = 7.5

◮ k(x , y) = exp(−(x − y)2/σ).

◮ Test statistics: γ(Pm, Qm) and γk(Pm, Qm) for various σ.

Experiments

γ(P, Q)

0.5 0.75 1 1.25 1.5

0

2

456

ν

Err

or (

in %

)

Type−I errorType−II error

Experiments

γk(P, Q)

−3 −2 −1 0 1 2 3 4 5 6

5

10

15

20

25

log σ

Typ

e−I e

rror

(in

%)

ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5

−3 −2 −1 0 1 2 3 4 5 60

50

100

log σ

Typ

e−II

erro

r (in

%)

ν=0.5ν=0.75ν=1.0ν=1.25ν=1.5

Outline

◮ Characterization of characteristic kernels (visit poster!)

◮ Choice of characteristic kernels

◮ Characteristic kernels and binary classification

γk and Parzen Window Classifier

Let

◮ RKHS (H, k): k measurable and bounded.

◮ Fk = {f : ‖f ‖H ≤ 1}.◮ P, Q : class-conditional distributions

◮ RFk: Bayes risk of a classifier in Fk .

Then,γk(P, Q) = −RFk

.

◮ The MMD between class conditionals P and Q is negative of theBayes risk associated with a Parzen window classifier.

◮ Characteristic k is important.

γk and Support Vector Machine

◮ RKHS (H, k): k measurable and bounded.

◮ fsvm be the solution to the program,

inff ∈H

‖f ‖H

s.t. Yi f (Xi ) ≥ 1, ∀ i .

If k is characteristic, then

1

‖fsvm‖H

≤ 1

2γk(Pm, Qn).

Achievability of Bayes Risk

◮ G⋆ : set of all real-valued measurable functions on M.

◮ (H, k) : RKHS with measurable and bounded k.

◮ Achievability of Bayes risk :

infg∈H

R(g) = infg∈G⋆

R(g). (⋆⋆)

Under some technical conditions,

◮ (⋆⋆) ⇒ k is characteristic.

◮ Suppose 1 ∈ H. k is characteristic ⇒ (⋆⋆).

Summary

◮ Characteristic kernel

◮ A class of kernels that characterize the probability measure

associated with a random variable.

◮ MMD is a metric.

◮ How to choose characteristic kernels in practice?

◮ Generalized MMD.

◮ Performs better than MMD in a two-sample test.

◮ Characteristic kernels are important in binary classification.

◮ Parzen window classifier and hard-margin SVM.

◮ Achievability of Bayes risk.

Thank You

References

◮ Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.

◮ Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages473–480.

◮ Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 513–520. MITPress.

◮ Muller, A. (1997).Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 29:429–443.

◮ Srebro, N. and Ben-David, S. (2006).Learning bounds for support vector machines with learned kernels.

In Lugosi, G. and Simon, H. U., editors, Proc. of the 19th Annual Conference on Learning Theory, pages 169–183.

◮ Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.

◮ Ying, Y. and Campbell, C. (2009).Generalization bounds for learning the kernel.

In Proc. of the 22nd Annual Conference on Learning Theory.

Documents

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions · 2014-08-29 · Kernel Choice and Classiﬁability for RKHS Embeddings of Probability Distributions