Intelligent Systems I - Max Planck SocietyIntelligent Systems I 09 UNDERSTANDING KERNELS – SOME ADVANCED ASPECTS – Philipp Hennig & Stefan Harmeling 19. December 2013 Max Planck

Intelligent Systems I

09 UNDERSTANDING KERNELS

– SOME ADVANCED ASPECTS –

Philipp Hennig & Stefan Harmeling

19. December 2013

Max Planck Institutefor Intelligent SystemsDptmt. of Empirical Inference

1 ,

Recapthe course so far

1. intro: intelligence is reasoning under uncertainty

2. probability theory specifies the mathematics of uncertainty

3. graphical models: independence is crucial for probabilisticcomputations

4. Gaussians map probabilistic reasoning onto linear algebra

5. regression: Gaussian algebra allows learning functional relationships

6. kernels: it is possible to learn “infinitely complex” functions7. classification:

▸ non-Gaussian likelihoods extend the functionality of Gaussianregression (e.g. to classification)

▸ inference is not analytical any more, requires approximations (e.g.Laplace)

8. SVM / kPCA: the feature map idea (“kernel trick”) can be used for allalgorithms relying on inner products

2 ,

Todayconclusion of the nonparametric / kernel part

connections between “frequentist” and “Bayesian Models”▸ connection between Gaussian and other regression methods

how powerful are nonparametric models?▸ what connects kernels and positive definite matrices?▸ what is the space of posterior means given the kernel?▸ what is the connection between samples from a GP and the kernel?

nonparametric models have infinitely many parameters.Can they learn every function?

3 ,

Gaussian posterior means are least-squares estimatesmaximum a posterior inference is regularised loss minimization

p(fX ∣ y) = p(y ∣ fX)p(fX)p(y) = N (y; fX , σ

2I)N (fX ;mX , kXX)N (y;mX , kXX + σ2I)−2 log p(f ∣ y) = (y − fX)⊺σ−2I(y − fX) + (fX −mX)⊺k−1XX(fX −mX) + const.

= σ−2∥y − fX∥2I + ∥fX −mX∥2kXX+ const.

▸ the GP posterior mean is identical to the weighted `2-regularisedleast-squares estimate

▸ this estimate also known as kernel ridge regression▸ the kernel provides a weighting on fX▸ more generally regularizers are connected priors in this sense▸ this also means a lot of theoretical concepts translate.

But not all of them. . .

4 ,

Reproducing Kernel Hilbert Spacesthe very rough story

▸ posterior mean kxX(kXX + σ2I)−1y = kxXα▸ so we are interested in the space of functions

f(x) = N∑i

αik(x,Xi) for various Xi, N , α.

5 ,

So What?summary of lecture so far:

▸ Gaussian process posterior means are identical to kernelleast-squares.

▸ If you dislike having priors, you can’t use least squares!▸ Gaussian process means are not magical!

6 ,

The space “spanned by the kernel”

slightly sloppy definition: The reproducing kernel Hilbert space (RKHS) isthe space of all functions

f(x) = N∑i

αik(x,Xi) for various Xi, N , α.

the space spanned by the kernel.▸ what does this mean?▸ we need to understand how kernels connect to matrices

cov(f(x), f(x′)) = φ⊺(x)φ(x′) =∑i

φi(x)φi(x′)_ cov(f(x), f(x′)) = k(x,x′) = ∫ φc(x)φc(x′) dc

▸ for example, do kernels have “eigenvectors”?

7 ,

Eigenfunctionsthe infinite extension of eigenvectors

k(x,x′) = exp [−(x − x′)22

]x = linspace(-100,100,400); [U,D] = eig(k(x,x)); plot(x,U);

0 20 40 60 80 1000

2

4

i

λi

8 ,


k(x,x′) = [1 + (x − x′)22

]−1x = linspace(-100,100,400); [U,D] = eig(k(x,x)); plot(x,U);

0 20 40 60 80 1000

2

4

i

λi

8 ,


Definition 1 (Eigenfunction)

A function φ that obeys

∫ k(x,x′)φ(x) dν(x) = λφ(x′)is called an eigenfunction of k with eigenvalue λ, with respect to themeasure ν.

▸ This is analogous to ∑j Aijvj = λvi.▸ There are often infinitely many eigenfunctions φi for a given (k, ν).We assume, w.l.o.g., they are sorted λ1 ≥ λ2 ≥ . . .▸ eigenfunctions with differing eigenvalues are orthogonal:

∬ φi(x′)k(x′, x)φj(x) dx′ dx = λi ∫ φi(x′)φj(x′) dx′= λj ∫ φi(x)φj(x) dx

= δijλi ⋅ const.

9 ,

EigendecompositionsMercer’s theorem James Mercer (1883–1932)

▸ so is there something like an eigendecomposition?

A = UDU⊺ for all positive semidefinite A ∈ RN×N .

Theorem 9.1 (Mercer)

Let (X , ν) be a finite measure space and k ∈ L∞(X ×X , ν × ν) be apositive semidefinite kernel wrt. ν. Let φi be the normalizedeigenfunctions of k wrt. ν associated with λi > 0. Then

▸ the {λi}∞i=1 are absolutely summable▸ and

k(x,x′) = ∞∑i

φi(x)λiφi(x′)holds µ×µ almost everywhere where the series converges absolutelyand uniformly µ2 almost everywhere.

Mercer, J. (1909), "Functions of positive and negative type and their connection with the theory ofintegral equations", Philosophical Transactions of the Royal Society A 209: 415–446,

10 ,

Bochner’s Theoremstationary kernels are all “quite similar” Salomon Bochner (1899–1982)

Theorem 9.2 (Bochner actually, a special case of it)

A Mercer kernel k is the covariance function of a stationary mean squarecontinuous random process on RD if and only if it can be representedusing τ = x − x′ as

k(τ) = ∫RDe2πis

⊺τdν(s).

▸ If ν has a density S(s), it is known as the spectral density. In thiscase, the kernel is the Fourier dual of the spectral density:

k(τ) = ∫ S(s)e2πis⊺τds▸ The functions e2πiθ = cos θ + i sin θ are the eigenfunctions of every

stationary kernel, because they are orthogonal and

k(x,x′) = ∫ e2πis⊺(x−x′)S(s)ds = ∫ (e2πis⊺x) (e2πis⊺x′)∗ S(s)ds.

11 ,




▸ Mercer kernels are like infinite positive definite matrices:▸ they have eigenfunctions ∫ k(x,x

′)φ(x′) = λφ(x)

▸ they have eigen-decompositions k(x,x′) = ∑i φi(x)λiφ(x′)

▸ all stationary kernels have the same eigenfunctions cos θ(+i sin θ).They just differ in the eigenvalues

12 ,

The reproducing kernel Hilbert space (RKHS)the elegant but unwieldy definition

Definition 9.3 (reproducing kernel Hilbert space, proper definition)

Let H be a Hilbert space of functions f ∶ X_R with inner product ⟨⋅, ⋅⟩(and norm ∥f∥ = ⟨f, f⟩1/2). Then H is called a reproducing kernel Hilbertspace if there exists a function k ∶ X ×X_R such that

1. for every x, k(x,x′), as a function of x′, belongs to H.

2. k has the reproducing property: ⟨f(⋅), k(⋅, x)⟩ = f(x)▸ k is like a probe picking out f ’s from the inner product. It allows to

reproduce all f ’s making up H.

Theorem 9.4 (Moore-Aronszajn)

Given an index set X , for every positive definite k on X ×X , there existsone and only one RKHS, and vice versa.

▸ kernels are directly connected with RKHS (no measure required!)13 ,

The reproducing kernel Hilbert space (RKHS)a more manageable definition

Definition 9.5 (RKHS, the simple definition)

Let k be a Mercer kernel, with eigenfunctions φi relative to ν, [i.e.k(x,x′) = ∑i λiφi(x)φi(x′)]. The RKHS is the space of functions

f(x) = ∞∑i

fiφi(x) such that∞∑i

f2i /λi <∞, with ⟨f, g⟩ = ∞∑i

figiλi

To see that this is equivalent to the former definition, note

1. k ∈H, because

⟨k(x, ⋅), k(x, ⋅)⟩ =∑i

λiφi(x)λiφ(x′)λi

=∑i

λiφi(x)φ(x) = k(x,x) <∞2. reproducing property:

⟨f(⋅), k(⋅, x)⟩ =∑i

fiλiφi(x)λi

=∑i

fiφi(x) = f(x)14 ,




▸ Mercer kernels are like infinite positive definite matrices:▸ they have eigenfunctions ∫ k(x,x

′)φ(x′) = λφ(x)

▸ they have eigen-decompositions k(x,x′) = ∑i φi(x)λiφ(x′)

▸ all stationary kernels have the same eigenfunctions cos θ. They justdiffer in the eigenvalues

▸ Each kernel is associated with a Hilbert space of functions “spanned”(reproduced) by it. We can study this space to learn more about thepower of kernel methods.

15 ,

What about the posterior distribution?

▸ reminder: We were interested in the RKHS because the posteriormean is µ(x) = k(x,X)⊺α, so it lies in the RKHS.

▸ what about the probability distribution p(f) = GP(0, k)?

16 ,

Sampling from a Gaussian using the Eigendecomposition[V,D]=eig(Sigma); x = bsxfun(@plus,V * sqrt(D) * randn(N,S),mu);

−4 −2 0 2 4 6 8−4−20

2

4

Σ = UDU⊺ x = u ∼ N (0, I)17 ,


−4 −2 0 2 4 6 8−4−20

2

4

Σ = UDU⊺ x =D1/2u ∼ N (0,D)17 ,


−4 −2 0 2 4 6 8−4−20

2

4

Σ = UDU⊺ x = UD1/2u ∼ N (0,Σ)17 ,


−4 −2 0 2 4 6 8−4−20

2

4

Σ = UDU⊺ x = UD1/2u + µ ∼ N (µ,Σ)17 ,

Drawing from a Gaussian processGP draws are not in the RKHS!

▸ to sample f ∼ GP(0, k), draw fi ∼ N (0, λi),∀i = 1, . . . ,N , then

f(x) = N∑i

fiφi(x) ⇒ E[∥f∥2H] = E[⟨f, f⟩H] = N∑i=1

E[f2i ]λi

= N∑i=11

▸ for nondegenerate kernels (N =∞), GP samples are almost surelynot in the RKHS!

▸ The posterior mean is more regular (usually: smoother) than almostall samples.

▸ samples from a GP are “just outside” of the RKHS in that they arealmost surely not of finite norm, but of the right algebraic form.

18 ,

GP’s are not distributions on the RKHS!This is not just a technical point. Example: linear splines

▸ RKHS: piecewise linear, i.e. smooth almost everywhere▸ GP samples: non-differentiable almost everywhere▸ when you think about the mean, think of the RKHS. But remember

that samples from a GP can be very different from the mean.19 ,

Estimation Power of Gaussian Regressionwhich functions can we learn?

An example thought process: Consider the square-exponential kernel

k(x,x′) = exp(−(x − x′)22

)▸ Bochner: the eigenfunctions are the Fourier basis {cos(ω)}ω▸ the spectral density has support on all frequencies!

F[k(x − x′)] = exp(−ω2

2)

▸ lots of continuous functions f have an integrable Fourier transform1

▸ so can we learn all such functions with k?

_ is GP(0, k) a consistent estimator on S?

1e.g. the Schwartz space SN = {f ∈ C∞(RN) ∣ supx∈RD ∣xa∂bf(x)∣ <∞∀a, b}

20 ,

Consistency of Kernel Methods

Definition 9.6 (universal consistency)

A procedure mapping (x, y)_ f is said to be consistent for the probabilitymeasure µ(x, y) and the loss function L if the method’s risk converges tothe minimal risk as the sample size increases:

∫ L[y, f] dµ(x, y)_arg minf

∫ L[y, f(x)] dµ(x, y) as n_∞Methods that are consistent for every Borel probability measure µ(x, y)are called universally consistent.

▸ e.g. Bartlett et. al.: JASA 2005, Steinwart, 2005: various consistencyresults for SVMs, GP regression, logistic regression.

21 ,

Universal Kernelsa slightly different statement

Definition 9.7 (Universal Kernel)

A kernel k acting on X is said to be universal if its RKHS lies dense in thespace of all continuous functions.

Journal of Machine Learning Research 7 (2006) 2651-2667 Submitted 7/06; Revised 10/06; Published 12/06

Universal Kernels

Charles A. Micchelli [email protected] of Mathematics and StatisticsState University of New YorkThe University at AlbanyAlbany, New York 12222, USA

Yuesheng Xu [email protected] Zhang [email protected] of MathematicsSyracuse UniversitySyracuse, NY 13244, USA

Editor: Gabor Lugosi

AbstractIn this paper we investigate conditions on the features of a continuous kernel so that it may approx-imate an arbitrary continuous target function uniformly on any compact subset of the input space.A number of concrete examples are given of kernels with this universal approximating property.Keywords: density, translation invariant kernels, radial kernels

1. Introduction

Let X be a prescribed input space and set Nn := {1,2, . . . ,n}. We shall call a function K from X ×Xto C a kernel on X provided that for any finite sequence of inputs x := {x j : j ∈ Nn} ⊆ X the matrix

Kx := (K(x j,xk) : j,k ∈ Nn) (1)

is Hermitian and positive semi-definite. Kernels are an essential component in a multitude of novelalgorithms for pattern analysis (Bishop, 1995; Hastie et al., 2001; Scholkopf and Smola, 2002).Besides their superior performance on a wide spectrum of learning tasks from data, they have asubstantial theoretical basis, as they are reproducing kernels of Hilbert spaces of functions on Xfor which point evaluation is always continuous (Aronszajn, 1950). Such spaces are called Repro-ducing Kernel Hilbert Spaces (RKHS) and an important reason for the interest in kernels is the(essentially) unique correspondence between them and RKHS. This relationship leads, by means ofthe regularization approach to learning, functions having the representation

f := ∑j∈Nn

c jK(·,x j) (2)

where {c j : j ∈ Nn} ⊆ C are parameters typically obtained from training data (Bishop, 1995; Ev-geniou et al., 2000; Hastie et al., 2001; Scholkopf and Smola, 2002). This useful fact is known asthe Representer Theorem and has wide applicability (Scholkopf et al., 1999; Scholkopf and Smola,2002; Shawe-Taylor and Cristianini, 2004; Wahba, 1990). We shall refer to the function in the sumon the right hand side of (2) as sections of the kernel K.

c⃝2006 Charles A. Micchelli, Yuesheng Xu and Haizhang Zhang.

22 ,

So can we learn every continuous function f ∶ X_R using aGP prior with the square-exponential kernel?

23 ,

Universal RKHSsan experiment – prior

−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,

Universal RKHSsan experiment – 1 evaluation

−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,

Universal RKHSsan experiment – 2 evaluations

−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,


−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,


−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,


−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,


−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,


−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,


−8 −6 −4 −2 0 2 4 6 8

−5

0

5

24 ,

Convergence Rates are Importantnon-obvious aspects of f can ruin convergence v.d.Vaart & v.Zanten, 2011

100 101 102 103 104

10−2

100

# function evaluations

∥f−f∥2

If f is “not well represented” by the kernel (has low prior density), thenumber of datapoints required to achieve ε error can be exponential in ε.Outside of the observation range, there are no guarantees at all.

25 ,

An Analogyrepresenting π in Q

▸ Q is dense in R

π = 3 ⋅ 1

1+ 1 ⋅ 1

10+ 4 ⋅ 1

100+ 1 ⋅ 1

1000+ . . . decimal

= 4 ⋅ 1

1− 4 ⋅ 1

3+ 4 ⋅ 1

5− 4 ⋅ 1

7+ . . . Gregory-Leibniz

= 3 ⋅ 1

1+ 4 ⋅ 1

2 ⋅ 3 ⋅ 4 − 4 ⋅ 1

4 ⋅ 5 ⋅ 6 + 4 ⋅ 1

6 ⋅ 7 ⋅ 8 Nilakantha

0 2 4 6 8 10 12 14 16 18 20−15

−10

−5

0

‘datapoints’

log1

0er

ror

decimal

Gregory-Leibniz

Nilakantha

Chudnovsky

26 ,

Understanding Frequentist and Bayesian statementsGaussian / `2 regression is an interesting case, because the exact same method is studied on both sides.

Bayesian: If this generative model is correct, this inference is optimal!

Frequentist: This estimator can learn everything given enough data!

27 ,

Understanding Frequentist and Bayesian statementsGaussian / `2 regression is an interesting case, because the exact same method is studied on both sides.

−8 −6 −4 −2 0 2 4 6 8

−5

0

5

Bayesian: Well, you haven’t used the right prior!Frequentist: Well, you haven’t collected ∞ samples yet!

27 ,

Both views are useful, neither are perfectprobabilistic (Bayesian) vs asymptotic (frequentist) analysis

The “Bayesian” (probabilistic) view▸ is particularly helpful for small datasets and extrapolation▸ gives an intuition for model properties, assumptions▸ allows hierarchical extension, “complete toolbox”▸ can help build good models

The “frequentist” (asymptotic) view▸ is particularly helpful for the large dataset limit, interpolation▸ gives an intuition for model limitations▸ can offer efficient computational “shortcuts”▸ can help build general models

Frequentist: “If the assumptions are correct,this is the worst that could happen.”

Bayesian: “If the (slightly stronger) assumptions are correct,the posterior is the exact, optimal answer.”

28 ,

Never say the following things:

▸ “Frequentist methods are better because they have no prior, so theymake no assumptions. They let the data speak for itself.”

▸ “Bayesian methods are better because they tell you exactly howuncertain they are.”

29 ,

Summary

▸ Mercer kernels are like infinite positive definite matrices▸ each kernel is uniquely identified with a unique space called the

reproducing kernel Hilbert space (RKHS).▸ the posterior mean is in the RKHS▸ samples from the GP are not in the RKHS▸ some kernels have universal RKHS’s. This means they can

approximate every continuous function arbitrarily well. It does notmean they can represent every function equally well. Since nodataset is infinite, this matters.

▸ Some intuitions do not cary over well from the finite case. Forexample, kernels with the same eigenfunctions but differenteigenvalue spectra can span quite different RKHS’s.

▸ analogous (in some cases: identical) statements hold for logisticregression, kernel regression, SVMs, kernel PCA, etc.

How intelligent are these systems?They can learn every possible function! But they find some

functions exponentially easier to learn than others.

30 ,

31 ,

Documents

Intelligent Systems I - Max Planck SocietyIntelligent Systems I 09 UNDERSTANDING KERNELS – SOME ADVANCED ASPECTS – Philipp Hennig & Stefan Harmeling 19. December 2013 Max Planck