70
. . . . . . . . PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model Taiji Suzuki The University of Tokyo Department of Mathematical Informatics 助教の会, 2012 1/1

PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

Embed Size (px)

Citation preview

Page 1: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

.

......

PAC-Bayesian Boundfor Gaussian Process Regression and

Multiple Kernel Additive Model

† Taiji Suzuki

†The University of TokyoDepartment of Mathematical Informatics

助教の会, 2012

1 / 1

Page 2: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Taiji Suzuki: PAC-Bayesian Bound for Gaussian Process Regression andMultiple Kernel Additive Model.Conference on Learning Theory (COLT2012), JMLR Workshop andConference Proceedings 23, pp. 8.1 – 8.20, 2012.

2 / 1

Page 3: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Outline

3 / 1

Page 4: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Outline

4 / 1

Page 5: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Non-parametric Regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

-0.5

0

0.5

1

1.5

5 / 1

Page 6: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Sparse Additive Model

6 / 1

Page 7: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Sparse Additive Model

7 / 1

Page 8: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Problem Setting

yi = f o(xi ) + ξi , (i = 1, . . . , n),

where f o is the true function such that E[Y |X ] = f o(X ).

f o is well approximated by a function f ∗ with a sparse representation:

f o(x) ≃ f ∗(x) =M∑

m=1

f ∗m(x(m)),

where only a few components of f ∗mMm=1 are non-zero.

8 / 1

Page 9: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Outline

9 / 1

Page 10: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Multiple Kernel Learning

minfm∈Hm

1

n

n∑i=1

(yi −

M∑m=1

fm(x(m)i )

)2

+ CM∑

m=1

∥fm∥Hm

(Hm: Reproducing Kernel Hilbert Space (RKHS), explained later)

Extension of Group Lasso: each group is infinite dimensional

Sparse solution

Reduced to finite dimensional optimization problem by therepresenter theorem (Sonnenburg et al., 2006; Rakotomamonjyet al., 2008; Suzuki & Tomioka, 2009)

10 / 1

Page 11: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Various types of regularization

L1-MKL (Lanckriet et al., 2004; Bach et al., 2004):Sparse

minfm∈Hm

L

(M∑

m=1

fm

)+ C

M∑m=1

∥fm∥Hm

L2-MKL:Dense

minfm∈Hm

L

(M∑

m=1

fm

)+ C

M∑m=1

∥fm∥2Hm

Elasticnet-MKL (Tomioka & Suzuki, 2009)

minfm∈Hm

L

(M∑

m=1

fm

)+ C1

M∑m=1

∥fm∥Hm + C2

M∑m=1

∥fm∥2Hm

ℓp-MKL (Kloft et al., 2009)

minfm∈Hm

L

(M∑

m=1

fm

)+ C1

(M∑

m=1

∥fm∥pHm

) 2p

11 / 1

Page 12: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Various types of regularization

L1-MKL (Lanckriet et al., 2004; Bach et al., 2004):Sparse

minfm∈Hm

L

(M∑

m=1

fm

)+ C

M∑m=1

∥fm∥Hm

L2-MKL:Dense

minfm∈Hm

L

(M∑

m=1

fm

)+ C

M∑m=1

∥fm∥2Hm

Elasticnet-MKL (Tomioka & Suzuki, 2009)

minfm∈Hm

L

(M∑

m=1

fm

)+ C1

M∑m=1

∥fm∥Hm + C2

M∑m=1

∥fm∥2Hm

ℓp-MKL (Kloft et al., 2009)

minfm∈Hm

L

(M∑

m=1

fm

)+ C1

(M∑

m=1

∥fm∥pHm

) 2p

11 / 1

Page 13: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Sparsity VS accuracy

Figure : Relation between accuracy and sparsity of Elasticnet-MKL for caltech101

12 / 1

Page 14: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Sparsity VS accuracy

Figure : Relation between accuracy and sparsity of ℓp-MKL (Cortes et al., 2009)

13 / 1

Page 15: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Restricted Eigenvalue Condition

To prove a fast convergence rate of MKL, we utilize the followingcondition called Restricted Eigenvalue Condition (Bickel et al., 2009;Koltchinskii & Yuan, 2010; Suzuki, 2011; Suzuki & Sugiyama, 2012).

.Restricted Eigenvalue Condition..

......

There exists a constant 0 < C such that

0 < C < β√d(I0).

βb(I ) := sup

β ≥ 0 | β ≤

∥∑M

m=1 fm∥2L2(Π)∑m∈I ∥fm∥2L2(Π)

,

∀f such that b∑m∈I

∥fm∥L2(Π) ≥∑m/∈I

∥fm∥L2(Π)

.

fms are not totally correlated inside I0 and between I0 and I c0 .14 / 1

Page 16: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

We investigate a Bayesian variant of MKL.

We show a fast learning rate of it without strong conditions on thedesign such as the restricted eigenvalue condition.

15 / 1

Page 17: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Outline

16 / 1

Page 18: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

our research = sparse aggregated estimation + Gaussian process

Aggregated Estimator, Exponential Screening, Model Averaging(Leung & Barron, 2006; Rigollet & Tsybakov, 2011)

Gaussian Process Regression (Rasmussen & Williams, 2006; van derVaart & van Zanten, 2008a; van der Vaart & van Zanten, 2008b;van der Vaart & van Zanten, 2011)

17 / 1

Page 19: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Aggregated Estimator

Assume that Y ∈ Rn is generated by the following model:

Y = µ+ ϵ

≃ Xθ0 + ϵ,

where ϵ = (ϵ1, . . . , ϵn)⊤ is i.i.d. realization of Gaussian noise (N (0, σ2)).

We would like to estimate µ using a set of model M. Each elementm ∈ M, we have a linear model with dimension dm:

Xmθm,

where θm ∈ Rdm and Xm ∈ Rn×dm consists of a subset of columns of adesign matrix. Let θm be a least squares estimator on a model m ∈ M:

θm = (X⊤m Xm)

†X⊤m Y .

18 / 1

Page 20: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Aggregated Estimator

Let rm be an unbiased estimator of the risk rm = E∥Xmθm − µ∥2:

rm = ∥Y − Xmθm∥2 + σ2(2dm − n),

(we can show that Erm = rm). Define the aggregated estimator as

µ =

∑m∈M πm exp(−rm/4)Xmθm∑

m′∈M πm′ exp(−rm′/4),

a.k.a., Akaike weight (Akaike, 1978; Akaike, 1979).

.Theorem (Risk bound of aggregated estimator (Leung & Barron, 2006))..

......E

[1

n∥µ− µ∥2

]≤ min

m∈M

rmn

+4

nlog(1/πm)

.

19 / 1

Page 21: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Application of the risk bound to sparse estimation (Rigollet& Tsybakov, 2011)

Let Xi,m = x(m)i (i = 1, . . . , n, m = 1, . . . ,M). Let M be all subsets of

[M] = 1, . . . ,M, and for I ⊂ [M]

πI :=

1C

(|I |2eM

)|I |, (if |I | ≤ R),

12 , (if |I | = M),

0, (otherwise),

where R = rank(X ) and C is a normalization constant. Under thissetting, we call the aggregated estimator θ the exponential screeningestimator, denoted by θES.

20 / 1

Page 22: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Application of the risk bound to sparse estimation (Rigollet& Tsybakov, 2011)

.Theorem (Rigollet and Tsybakov (2011))..

......

Assume that ∥X:,m∥n ≤ 1. Then we have

E[∥X θES−µ∥2n] ≤ minθ∈RM

∥Xθ − µ∥2n + C

R

n∧ ∥θ∥0

nlog

(1 +

eM

∥θ∥0 ∨ 1

)※ Note that there is no assumption on the condition of the design X !

Moreover we have.Corollary (Rigollet and Tsybakov (2011))..

......

E[∥X θES − µ∥2n] ≤ minθ∈RM

∥Xθ − µ∥2n + φn,M(θ)

+ C ′′ log(M)

n,

where φn,M(θ) = C ′ Rn ∧ ∥θ∥0

n log(1+ eM

∥θ∥0∨1

)∧ ∥θ∥ℓ1√

n

√log(1+ CM

∥θ∥ℓ1

√n

).

21 / 1

Page 23: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Application of the risk bound to sparse estimation (Rigollet& Tsybakov, 2011)

.Theorem (Rigollet and Tsybakov (2011))..

......

Assume that ∥X:,m∥n ≤ 1. Then we have

E[∥X θES−µ∥2n] ≤ minθ∈RM

∥Xθ − µ∥2n + C

R

n∧ ∥θ∥0

nlog

(1 +

eM

∥θ∥0 ∨ 1

)※ Note that there is no assumption on the condition of the design X !Moreover we have.Corollary (Rigollet and Tsybakov (2011))..

......

E[∥X θES − µ∥2n] ≤ minθ∈RM

∥Xθ − µ∥2n + φn,M(θ)

+ C ′′ log(M)

n,

where φn,M(θ) = C ′ Rn ∧ ∥θ∥0

n log(1+ eM

∥θ∥0∨1

)∧ ∥θ∥ℓ1√

n

√log(1+ CM

∥θ∥ℓ1

√n

).

21 / 1

Page 24: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Comparison with BIC type estimator

θBIC := argminθ∈RM

∥Y − Xθ∥2n +

2σ2∥θ∥0n

(1 +

2 + a

1 + a

√L(θ) +

1 + a

aL(θ)

),

where L(θ) = 2 log(

eM∥θ∥0∨1

), and a > 0 is a parameter.

.Theorem (Bunea et al. (2007), Rigollet and Tsybakov (2011))..

......

Assume that ∥X:,m∥n ≤ 1. Then we have

E[∥X θBIC − µ∥2n] ≤ (1 + a) minθ∈RM

∥Xθ − µ∥2n + C

1 + a

aφn,M(θ)

+

Cσ2

n

The exponential screening estimator gives a smaller bound.

22 / 1

Page 25: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Comparison with Lasso estimator

θLasso := argminθ∈RM

∥Y − Xθ∥2n +

λ∥θ∥ℓ1n

.

.Theorem (Rigollet and Tsybakov (2011))..

......

Assume that ∥X:,m∥n ≤ 1. Then, if we set λ = Aσ√

log(M)n , then we have

E[∥X θLasso − µ∥2n] ≤ minθ∈RM

∥Xθ − µ∥2n + C

∥θ∥ℓ1√n

√log(M)

.

The exponential screening estimator gives a smaller bound.

If the restricted eigenvalue condition is satisfied, then

E[∥X θLasso − µ∥2n] ≤ (1 + a) minθ∈RM

∥Xθ − µ∥2n +

C (1 + a)

a

∥θ∥0 log(M)

n

.

23 / 1

Page 26: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Aggregated Estimator

Comparison with Lasso estimator

θLasso := argminθ∈RM

∥Y − Xθ∥2n +

λ∥θ∥ℓ1n

.

.Theorem (Rigollet and Tsybakov (2011))..

......

Assume that ∥X:,m∥n ≤ 1. Then, if we set λ = Aσ√

log(M)n , then we have

E[∥X θLasso − µ∥2n] ≤ minθ∈RM

∥Xθ − µ∥2n + C

∥θ∥ℓ1√n

√log(M)

.

The exponential screening estimator gives a smaller bound.If the restricted eigenvalue condition is satisfied, then

E[∥X θLasso − µ∥2n] ≤ (1 + a) minθ∈RM

∥Xθ − µ∥2n +

C (1 + a)

a

∥θ∥0 log(M)

n

.

23 / 1

Page 27: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Gaussian Process Regression

24 / 1

Page 28: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Bayes’ theorem

Statistical model: p(x |θ) | θ ∈ ΘPrior distribution: π(θ)

Likelihood: p(Dn|θ) =∏n

i=1 p(xi |θ)Posterior distribution (Bayes’ theorem):

p(θ|Dn) =p(Dn|θ)π(θ)∫p(Dn|θ)π(θ)dθ

.

Bayes estimator:

g := argming

∫EDn:θ[R(θ, g(Dn))]π(θ)dθ

= argming

∫ (∫R(θ, g(Dn))p(θ|Dn)dθ

)p(Dn)dDn

If R(θ, θ′) = ∥θ − θ′∥2, then g(Dn) =∫θp(θ|Dn)dθ.

25 / 1

Page 29: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Gaussian Process Regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-6

-4

-2

0

2

4

6

8

26 / 1

Page 30: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Gaussian Process Regression

Gaussian process prior: a prior on a functions f = (f (x) : x ∈ X ).

f ∼ GP

means that each finite subset (f (x1), f (x2), . . . , f (xj)) (j = 1, 2, . . . )obeys a zero-mean multivariate normal distribution.We assume that supx |f (x)| <∞ and f : Ω → ℓ∞(X ) is tight and Borelmeasurable.

Kernel function:

k(x , x ′) := Ef∼GP [f (x)f (x′)].

Examples:

Linear kernel: k(x , x ′) = x⊤x ′.

Gaussian kernel: k(x , x ′) = exp(−∥x − x ′∥2/2σ2).

polynomial kernel: k(x , x ′) = (1 + x⊤x ′)d .

27 / 1

Page 31: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Gaussian Process Prior

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-2

-1

0

1

2

3

4

(a) Gaussian kernel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-2

-1.5

-1

-0.5

0

0.5

1

(b) Linear kernel

28 / 1

Page 32: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Estimation

Suppose yini=1 are generated from the following model:

yi = f o(xi ) + ξi ,

where ξi is i.i.d. Gaussian noise N (0, σ2).Posterior distribution: let f = (f (x1), . . . , f (xn)), then

p(f|Dn) =1

Cexp

(−n

∥f − Y ∥2n2σ2

)exp

(−1

2f⊤K−1f

)=

1

Cexp

(−1

2

∥∥f − (K + σ2In)−1KY

∥∥2(K−1+In/σ2)

),

where K ∈ Rn×n is the Gram matrix (Ki,j = k(xi , xj)).

posterior mean: f = (K + σ2In)−1KY .

posterior covariance: K − K (K + σ2In)−1K .

29 / 1

Page 33: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Gaussian Process Posterior

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

-0.5

0

0.5

1

1.5

(c) Training Data

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

-0.5

0

0.5

1

1.5

(d) Posterior Sample

30 / 1

Page 34: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Gaussian Process Posterior 2

31 / 1

Page 35: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Our interest

How fast does the posterior converge around the true?

32 / 1

Page 36: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Reproducing Kernel Hilbert Space (RKHS)

The kernel function defines Reproducing Kernel Hilbert Space (RKHS) Has a completion of the linear space spanned by all functions

x 7→I∑

i=1

αik(xi , x), (I = 1, 2, . . . )

relative to the RKHS norm ∥ · ∥H induced by the inner product⟨I∑

i=1

αik(xi , ·),J∑

j=1

α′jk(x

′j , ·)

⟩H

=I∑

i=1

J∑j=1

αiα′jk(xi , x

′j ).

Reproducibility: for f ∈ H, the function value at x is recovered as

f (x) = ⟨f , k(x , ·)⟩H.

See van der Vaart and van Zanten (2008b) for more detail.

33 / 1

Page 37: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

The work by van der Vaart and van Zanten (2011)

van der Vaart and van Zanten (2011) gave the posterior convergence ratefor Matern prior.Matern prior: for a regularity α > 0, we define

k(x , x ′) =

∫Rd

e iλ⊤(x−x′)ψ(λ)dλ,

where ψ : Rd → R is the spectral density given by

ψ(λ) =1

(1 + ∥λ∥2)α+d/2.

The support is included in a Holder space Cα′[0, 1]d for any α′ < α.

The RKHS H is included in a Sobolev space W α+d/2[0, 1]d with theregularity α+ d/2.

For infinite dimensional RKHS H, the support of the prior istypically much larger than H.

34 / 1

Page 38: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Convergence rate of posterior: Matern prior

Let f be the posterior mean.

.Theorem (van der Vaart and van Zanten (2011))..

......

Let f ∗ ∈ Cβ[0, 1]d ∩W β[0, 1]d for β > 0, then for Matern prior, we have

E[∥f − f ∗∥2n] ≤ O(n−

α∧βα+d/2

).

The optimal rate is O(n−

ββ+d/2

).

The optimal rate is achieved only when α = β.

The rate n−α∧βα+d/2 is tight.

→ If f ∗ ∈ H (β = α+ d/2), then GP does not achieve the optimalrate.

→ Scale mixture is useful (van der Vaart & van Zanten, 2009).

35 / 1

Page 39: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Gaussian Process Regression

Summary of existing results

GP is optimal only in quite restrictive situations (α = β).

In particular, if f o ∈ H, the optimal rate can not be achieved.

The analysis was given only for restricted classes such as Sobolevand Holder classes.

36 / 1

Page 40: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Outline

37 / 1

Page 41: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian-MKL = sparse aggregated estimation + Gaussian process

Condition on design: Does not require any special conditions suchas restricted eigenvalue condition.

Optimality: Adaptively achieves the optimal rate for a wide class oftrue functions. In particular, even if f o ∈ H, it achieves the optimalrate.

Generality: The analysis is given for a general class of spacesutilizing the notion of interpolation spaces and the metric entropy.

38 / 1

Page 42: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian MKL

We estimate f o in a Bayesian manner. Let f = (f1, . . . , fM).Prior of Bayesian MKL:

Π(df ) =∑

J⊆1,...,M

πJ ·∏m∈J

∫λm∈R+

GPm(dfm|λm)G(dλm) ·∏m/∈J

δ0(dfm)

GPm(·|λm) with a scale parameter λm is a scaled Gaussian processcorresponding to the kernel function km,λm where

km,λm =kmλm

,

for some fixed kernel function km.

G(λm) = exp(−λm) (Gamma distribution: conjugate prior)→ scale mixture.

Set all components fm for m /∈ J as 0.

Put a prior πJ on each sub-model J.

39 / 1

Page 43: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian MKL

We estimate f o in a Bayesian manner. Let f = (f1, . . . , fM).Prior of Bayesian MKL:

Π(df ) =∑

J⊆1,...,M

πJ ·∏m∈J

∫λm∈R+

GPm(dfm|λm)G(dλm) ·∏m/∈J

δ0(dfm)

GPm(·|λm) with a scale parameter λm is a scaled Gaussian processcorresponding to the kernel function km,λm where

km,λm =kmλm

,

for some fixed kernel function km.

G(λm) = exp(−λm) (Gamma distribution: conjugate prior)→ scale mixture.

Set all components fm for m /∈ J as 0.

Put a prior πJ on each sub-model J.

39 / 1

Page 44: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian MKL

We estimate f o in a Bayesian manner. Let f = (f1, . . . , fM).Prior of Bayesian MKL:

Π(df ) =∑

J⊆1,...,M

πJ ·∏m∈J

∫λm∈R+

GPm(dfm|λm)G(dλm) ·∏m/∈J

δ0(dfm)

GPm(·|λm) with a scale parameter λm is a scaled Gaussian processcorresponding to the kernel function km,λm where

km,λm =kmλm

,

for some fixed kernel function km.

G(λm) = exp(−λm) (Gamma distribution: conjugate prior)→ scale mixture.

Set all components fm for m /∈ J as 0.

Put a prior πJ on each sub-model J.

39 / 1

Page 45: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian MKL

We estimate f o in a Bayesian manner. Let f = (f1, . . . , fM).Prior of Bayesian MKL:

Π(df ) =∑

J⊆1,...,M

πJ ·∏m∈J

∫λm∈R+

GPm(dfm|λm)G(dλm) ·∏m/∈J

δ0(dfm)

GPm(·|λm) with a scale parameter λm is a scaled Gaussian processcorresponding to the kernel function km,λm where

km,λm =kmλm

,

for some fixed kernel function km.

G(λm) = exp(−λm) (Gamma distribution: conjugate prior)→ scale mixture.

Set all components fm for m /∈ J as 0.

Put a prior πJ on each sub-model J.

39 / 1

Page 46: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian MKL

We estimate f o in a Bayesian manner. Let f = (f1, . . . , fM).Prior of Bayesian MKL:

Π(df ) =∑

J⊆1,...,M

πJ ·∏m∈J

∫λm∈R+

GPm(dfm|λm)G(dλm) ·∏m/∈J

δ0(dfm)

GPm(·|λm) with a scale parameter λm is a scaled Gaussian processcorresponding to the kernel function km,λm where

km,λm =kmλm

,

for some fixed kernel function km.

G(λm) = exp(−λm) (Gamma distribution: conjugate prior)→ scale mixture.

Set all components fm for m /∈ J as 0.

Put a prior πJ on each sub-model J.

39 / 1

Page 47: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Bayesian MKL

We estimate f o in a Bayesian manner. Let f = (f1, . . . , fM).Prior of Bayesian MKL:

Π(df ) =∑

J⊆1,...,M

πJ ·∏m∈J

∫λm∈R+

GPm(dfm|λm)G(dλm) ·∏m/∈J

δ0(dfm)

πJ is given as

πJ =ζ |J|∑Mj=1 ζ

j

(M

|J|

)−1

,

with some ζ ∈ (0, 1).

39 / 1

Page 48: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

The estimator

The posterior: For some constant β > 0, the posterior probabilitymeasure is given as

Π(df |Dn) :=exp

(−

∑ni=1(yi−

∑Mm=1 fm(xi ))

2

β

)∫exp

(−

∑ni=1(yi−

∑Mm=1 fm(xi ))

2

β

)Π(df )

Π(df ),

for f = (f1, . . . , fM).The estimator: The Bayesian estimator f (Bayesian-MKL estimator) isgiven as the expectation of the posterior:

f =

∫ M∑m=1

fmΠ(df |y1, . . . , yn).

40 / 1

Page 49: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Point

Model averaging

Scale mixture of Gaussian process prior

41 / 1

Page 50: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Outline

42 / 1

Page 51: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

PAC-Bayesian Bound

Mean Squared Error

We want to bound the mean squared error:

∥f o − f ∥2n :=1

n

n∑i=1

(f o(xi )− f (xi ))2,

where f is the Bayesian estimator. We utilize a PAC-Bayesian bound.

43 / 1

Page 52: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

PAC-Bayesian Bound

PAC-Bayesian Bound

Under some conditions for the noise (explained in the next slide), wehave the following theorem.

.Theorem (Dalalyan and Tsybakov (2008))..

......

For all probability measure ρ, we have

EY1:n|x1:n

[∥f − f o∥2n

]≤∫

∥f − f o∥2ndρ(f ) +βK(ρ,Π)

n,

where K(ρ,Π) is the KL-divergence between ρ and Π:

K(ρ,Π) :=

∫log

(dρ

)dρ.

44 / 1

Page 53: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

PAC-Bayesian Bound

Noise Condition

Letmξ(z) := −E[ξ11ξ1 ≤ z],

where 1· is the indicator function. Then we impose the followingassumption on mξ..Assumption..

......

E[ξ21 ] <∞ and the measure mξ(z)dz is absolutely continuous withrespect to the density function pξ(z) with a bounded Radon-Nikodymderivative, i.e., there exists a bounded function gξ : R → R+ such that∫ b

a

mξ(z)dz =

∫ b

a

gξ(z)pξ(z)dz , ∀a, b ∈ R.

The Gaussian noise N (0, σ2) satisfies the assumption withgξ(z) = σ2,The uniform distribution on [−a, a] satisfies the assumption withgξ(z) = max(a2 − z2, 0)/2.

45 / 1

Page 54: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Main Result

Concentration Function

We define the concentration function as

ϕ(m)f ∗m

(ϵ, λm) := infh∈Hm:∥h−f ∗m ∥n≤ϵ

∥h∥2Hm,λm︸ ︷︷ ︸bias

− logGPm(f : ∥f ∥n ≤ ϵ|λm)︸ ︷︷ ︸variance

.

It is known that ϕ(m)f ∗m

(ϵ, λm) ∼ − logGPm(f : ∥f ∗m − f ∥n ≤ ϵ|λm) (vander Vaart & van Zanten, 2011; van der Vaart & van Zanten, 2008b).

46 / 1

Page 55: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Main Result

General Result

Let I0 := m | f ∗m = 0, I0 := m ∈ I0 | f ∗m /∈ Hm, and κ := ζ(1− ζ).

.Theorem (Convergence rate of Bayesian-MKL)..

......

The convergence rate of Bayesian-MKL is bounded as

EY1:n|x1:n

[∥f − f o∥2n

]≤ 2∥f o − f ∗∥2n

+ C1 infϵm,λm>0

∑m∈I0

(ϵ2m +

1

nϕ(m)f ∗m

(ϵm, λm) +λmn

− log(λm)

n

)

+∑

m,m′∈I0:

m =m′

ϵmϵm′ +|I0|n

log

(Me

κ|I0|

).

47 / 1

Page 56: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Main Result

Outlook of the theorem

Let ϵ2m = infϵm,λm>0

(ϵ2m + 1

nϕ(m)f ∗m

(ϵm, λm) +λm

n − log(λm)n

)and suppose

f o = f ∗. Typically ϵ2m achieves the optimal learning rate for the singlekernel learning.

If f ∗m ∈ Hm for all m, then we have

EY1:n|x1:n

[∥f − f o∥2n

]= O

[∑m∈I0

ϵ2m +|I0|n

log

(Me

κ|I0|

)].

→ Optimal learning rate for MKL.Note that we imposed no condition on the design such as restrictedeigenvalue condition.

If f ∗m /∈ Hm for all m ∈ I0, then we have

EY1:n|x1:n

[∥f − f o∥2n

]= O

(∑m∈I0

ϵm

)2

+|I0|n

log

(Me

κ|I0|

) .48 / 1

Page 57: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Main Result

Outline of the Proof

Fix ϵm, λm > 0 arbitrary. Next we define a “representer” elementhm ∈ Hm that is close to f ∗m . If f

∗m ∈ Hm, then set hm = f ∗m . Otherwise,

we take hm ∈ Hm,λm such that

∥hm∥2Hm,λm≤ 2 infh∈Hm:∥h−f ∗m ∥n≤ϵm ∥h∥2Hm,λm

. We substitute the following“dummy” posterior into ρ:

ρ(df ) =∏m∈I0

∫λm2 ≤λm≤λm

GPm(dfm−hm|λm)1∥fm−hm∥n≤ϵmGPm(∆fm:∥∆fm∥n≤ϵm|λm)

G(dλm)

G(λm : λm

2 ≤ λm ≤ λm)·∏m/∈I0

δ0(dfm).

One can show that the KL-divergence betweenρ and the prior Π isbounded as

1

nK(ρ,Π) ≤ C ′

1

∑m∈I0

(1

nϕ(m)f ∗m

(ϵm, λm) +1

nλm − 1

nlog (λm)

)+

|I0|n

log

(Me

|I0|κ

),

where C ′1 is a universal constant. A key to prove this is an infinite

dimensional extension of the Brascamp and Lieb inequality (Brascamp &Lieb, 1976; Harge, 2004). Since ϵm, λmMm=1 are arbitrary, this gives theassertion.

49 / 1

Page 58: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 1: Metric Entropy Characterization(Correctly specified)

Define the ϵ-covering number N(BHm , ϵ, ∥ · ∥n) as the number of∥ · ∥n-norm balls covering the unit ball BHm in Hm.The metric entropy is its logarithm:

logN(BHm , ϵ, ∥ · ∥n).

50 / 1

Page 59: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 1: Metric Entropy Characterization

We assume that there exits a real value 0 < sm < 1 such that

logN(BHm , ϵ, ∥ · ∥n) = O(ϵ−2sm),

where BHm is the unit ball of the RKHS Hm.

.Theorem (Correctly specified)..

......

If f ∗m ∈ Hm for all m ∈ I0, then we have

EY1:n|x1:n

[∥f − f o∥2n

]≤C

∑m∈I0

n−1

1+sm +|I0|n

log

(Me

|I0|κ

)+2∥f o−f ∗∥2n

It is known that, if there is no scale mixture prior, the optimal rate

n−1

1+sm can not be achieved (Castillo, 2008).

51 / 1

Page 60: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 1: Metric Entropy Characterization (Misspecified)

Let [L2(Pn),Hm]θ,∞ be the interpolation space equipped with the norm,

∥f ∥(m)θ,∞ := sup

t>0t−θ inf

gm∈Hm

∥f − gm∥n + t∥gm∥Hm.

One hasHm → [L2(Pn),Hm]θ,∞ → L2(Pn).

Interpolation space

L2

Hm

.Theorem (Misspecified)..

......

If f ∗m ∈ [L2(Pn),Hm]θ,∞ with 0 < θ ≤ 1, then we have

EY1:n|x1:n

[∥f − f o∥2n

]≤C

(∑

m∈I0

n−1

2(1+sm/θ)

)2

+|I0|n

log

(Me

|I0|κ

)+ 2∥f o − f ∗∥2n

Thanks to the scale mixture prior, the estimator adaptively achieves theoptimal rate for θ ∈ (sm, 1].

52 / 1

Page 61: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 1: Metric Entropy Characterization (Misspecified)

Let [L2(Pn),Hm]θ,∞ be the interpolation space equipped with the norm,

∥f ∥(m)θ,∞ := sup

t>0t−θ inf

gm∈Hm

∥f − gm∥n + t∥gm∥Hm.

One hasHm → [L2(Pn),Hm]θ,∞ → L2(Pn).

.Theorem (Misspecified)..

......

If f ∗m ∈ [L2(Pn),Hm]θ,∞ with 0 < θ ≤ 1, then we have

EY1:n|x1:n

[∥f − f o∥2n

]≤C

(∑

m∈I0

n−1

2(1+sm/θ)

)2

+|I0|n

log

(Me

|I0|κ

)+ 2∥f o − f ∗∥2n

Thanks to the scale mixture prior, the estimator adaptively achieves theoptimal rate for θ ∈ (sm, 1].

52 / 1

Page 62: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 2: Matern prior

Suppose that Xm = [0, 1]dm . The Matern priors on Xm correspond to thekernel function defined as

km(z , z′) =

∫Rdm

eis⊤(z−z′)ψm(s)ds,

where ψm(s) is the spectral density given byψm(s) = (1 + ∥s∥2)−(αm+dm/2), for a smoothness parameter αm > 0.

.Theorem (Matern prior, correctly specified)..

......

If f ∗m ∈ Hm, then we have that

EY1:n|x1:n

[∥f − f o∥2n

]≤C

∑m∈I0

n− 1

1+dm

2αm+dm +|I0|n

log

(Me

|I0|κ

)+ 2∥f o − f ∗∥2n

53 / 1

Page 63: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 2: Matern prior

.Theorem (Matern prior, Misspecified)..

......

If f ∗m ∈ Cβm [0, 1]dm ∩W βm [0, 1]dm and βm < αm + dm/2, then we havethat

EY1:n|x1:n

[∥f − f o∥2n

]≤C

(∑

m∈I0

n−βm

2βm+dm

)2

+|I0|n

log

(Me

|I0|κ

)+ 2∥f o − f ∗∥2n

Although f ∗m /∈ Hm, the convergence rate achieves the optimal rateadaptively.

54 / 1

Page 64: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Example 3: Group Lasso

Xm is a finite dimensional Euclidean space: Xm = Rdm . The kernelfunction corresponding to the Gaussian process prior is km(x , x

′) = x⊤x ′:

fm(x) = µ⊤x , µ ∼ N (0, Idm).

.Theorem (Group Lasso)..

......

If f ∗m = µ⊤mx, then we have that

EY1:n|x1:n

[∥f − f o∥2n

]=C

∑m∈I0

dm log(n)

n+

|I0|n

log

(Me

|I0|κ

)+ 2∥f o − f ∗∥2n

This is rate optimal up to log(n)-order.

55 / 1

Page 65: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Gaussian correlation conjecture

We use an infinite dimensional version of the following inequality(Brascamp-Lieb inequality (Brascamp & Lieb, 1976; Harge, 2004)):

E[⟨X , ϕ⟩2|X ∈ A] ≤ E[⟨X , ϕ⟩2],

where X ∼ N (0,Σ) and A is a symmetric convex set centered on theorigin.

Gaussian correlation conjecture:

µ(A ∩ B) ≥ µ(A)µ(B),

where µ is any centered Gaussian measure and A and B are any twosymmetric convex sets.

Brascamp-Lieb inequality can be seen as an application of a particularcase of the Gaussian correlation conjecture. See the survey by Li andShao (2001) for more details.

56 / 1

Page 66: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Conclusion

We developed a PAC-Bayesian bound for Gaussian process modeland generalized it to sparse additive model.

The optimal rate is achieved without any conditions on the design.

We have observed that Gaussian processes with scale mixtureadaptively achieve the minimax optimal rate on bothcorrectly-specified and misspecified situations.

57 / 1

Page 67: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure.Annals of the Institute of Statistical Mathematics, 30, 9–14.

Akaike, H. (1979). A bayesian extension of the minimum aic procedure ofautoregressive model fitting. Biometrika, 66, 237–242.

Bach, F., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning,conic duality, and the SMO algorithm. the 21st InternationalConference on Machine Learning (pp. 41–48).

Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysisof Lasso and Dantzig selector. The Annals of Statistics, 37, 1705–1732.

Brascamp, H. J., & Lieb, E. H. (1976). On extensions of thebrunn-minkowski and prekopa-leindler theorem, including inequalitiesfor log concave functions, and with an application to the diffusionequation. Journal of Functional Analysis, 22, 366–389.

Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007). Aggregation forgaussian regression. The Annals of Statistics, 35, 1674–1697.

Castillo, I. (2008). Lower bounds for posterior rates with Gaussianprocess priors. Electronic Journal of Statistics, 2, 1281–1299.

Cortes, C., Mohri, M., & Rostamizadeh, A. (2009). L2 regularization forlearning kernels. the 25th Conference on Uncertainty in ArtificialIntelligence (UAI 2009). Montreal, Canada.

57 / 1

Page 68: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Dalalyan, A., & Tsybakov, A. B. (2008). Aggregation by exponentialweighting sharp PAC-Bayesian bounds and sparsity. Machine Learning,72, 39–61.

Harge, G. (2004). A convex/log-concave correlation inequality forgaussian measure and an application to abstract wiener spaces.Probability Theory and Related Fields, 130, 415–440.

Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Muller, K.-R., & Zien,A. (2009). Efficient and accurate ℓp-norm multiple kernel learning.Advances in Neural Information Processing Systems 22 (pp.997–1005). Cambridge, MA: MIT Press.

Koltchinskii, V., & Yuan, M. (2010). Sparsity in multiple kernel learning.The Annals of Statistics, 38, 3660–3695.

Lanckriet, G., Cristianini, N., Ghaoui, L. E., Bartlett, P., & Jordan, M.(2004). Learning the kernel matrix with semi-definite programming.Journal of Machine Learning Research, 5, 27–72.

Leung, G., & Barron, A. R. (2006). Information theory and mixingleast-squares regressions. IEEE Transactions on Information Theory,52, 3396–3410.

57 / 1

Page 69: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Li, W. V., & Shao, Q.-M. (2001). Gaussian processes: inequalities, smallball probabilities and applications. Stochastic Processes: Theory andMethods, 19, 533–597.

Rakotomamonjy, A., Bach, F., Canu, S., & Y., G. (2008). SimpleMKL.Journal of Machine Learning Research, 9, 2491–2521.

Rasmussen, C. E., & Williams, C. (2006). Gaussian processes formachine learning. MIT Press.

Rigollet, P., & Tsybakov, A. (2011). Exponential screening and optimalrates of sparse estimation. The Annals of Statistics, 39, 731–771.

Sonnenburg, S., Ratsch, G., Schafer, C., & Scholkopf, B. (2006). Largescale multiple kernel learning. Journal of Machine Learning Research,7, 1531–1565.

Suzuki, T. (2011). Unifying framework for fast learning rate ofnon-sparse multiple kernel learning. Advances in Neural InformationProcessing Systems 24 (pp. 1575–1583). NIPS2011.

Suzuki, T., & Sugiyama, M. (2012). Fast learning rate of multiple kernellearning: Trade-off between sparsity and smoothness. JMLR Workshopand Conference Proceedings 22 (pp. 1152–1183). FifteenthInternational Conference on Artificial Intelligence and Statistics(AISTATS2012).

57 / 1

Page 70: PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model

. . . . . .

Applications to Some Examples

Suzuki, T., & Tomioka, R. (2009). SpicyMKL. arXiv:0909.5026.

Tomioka, R., & Suzuki, T. (2009). Sparsity-accuracy trade-off in MKL.NIPS 2009 Workshop:: Understanding Multiple Kernel LearningMethods. Whistler. arXiv:1001.2615.

van der Vaart, A. W., & van Zanten, J. H. (2008a). Rates of contractionof posterior distributions based on Gaussian process priors. The Annalsof Statistics, 36, 1435–1463.

van der Vaart, A. W., & van Zanten, J. H. (2008b). Reproducing kernelHilbert spaces of Gaussian priors. Pushing the Limits of ContemporaryStatistics: Contributions in Honor of Jayanta K. Ghosh, 3, 200–222.IMS Collections.

van der Vaart, A. W., & van Zanten, J. H. (2009). Adaptive Bayesianestimation using a Gaussian random field with inverse Gammabandwidth. The Annals of Statistics, 37, 2655–2675.

van der Vaart, A. W., & van Zanten, J. H. (2011). Information rates ofnonparametric gaussian process methods. Journal of Machine LearningResearch, 12, 2095–2119.

57 / 1