Reproducing kernel Hilbert spaces - SVCL · reproducing kernel Hilbert spaces to turn our pre-Hilbert space H into a Hilbert space we have to “complete it” with respect to the

Reproducing kernel Hilbert spaces

Nuno Vasconcelos ECE Department, UCSD

2

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the world• Y - state (class) of the world

Perceptron: classifier implements the linear decision rule

with

appropriate when the classes arelinearly separableto deal with non-linear separabilitywe introduce a kernel

[ ]g(x)sgn )( =xh bxwxg T +=)( w

wb

x

wxg )(

3

Kernel summary1. D not linearly separable in X, apply feature transformation Φ:X → Z,

such that dim(Z) >> dim(X)

2. computing Φ(x) too expensive:• write your learning algorithm in dot-product form• instead of Φ(xi), we only need Φ(xi)TΦ(xj) ∀ij

3. instead of computing Φ(xi)TΦ(xj) ∀ij, define the “dot-product kernel”

and compute K(xi,xj) ∀ij directly• note: the matrix

is called the “kernel” or Gram matrix

4. forget about Φ(x) and use K(x,z) from the start!

)()(),( zxzxK T ΦΦ=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

M

LL

M

),( ji xxKK

4

Polynomial kernelsthis makes a significant difference when K(x,z) is easier to compute that Φ(x)TΦ(z)e.g., we have seen that

while K(x,z) has complexity O(d), Φ(x)TΦ(z) is O(d2)for K(x,z) = (xTz)k we go from O(d) to O(dk)

( )

( )Tddddd

d

TT

xxxxxxxxxxxxx

x

zxzxzxK

,,,,,,,,

)()(),(

2112111

1

2

LLLM →⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

ℜ→ℜΦ

ΦΦ==

: with2dd

5

Questionin practice:• pick a kernel from a library of known kernels• we talked about

• the linear kernel K(x,z) = xTz• the Gaussian family

• the polynomial family

what if this is not good enough?• how do I know if a function k(x,y) is a dot-product kernel?

σ

2

),(zx

ezxK−

−=

( ) { }L,2,1,1),( ∈+= kzxzxK kT

6

Dot-product kernelslet’s start by the definitionDefinition: a mapping

k: X x X → ℜ(x,y) → k(x,y)

is a dot-product kernel if and only if

k(x,y) = <Φ(x),Φ(y)>where Φ: X → H,

H is a vector space, <.,.> is a dot-product in H

note that both H and <.,.> can be abstract, not necessarily ℜd

xx

x

x

x

xxx

xxx

x

oo

o

o

ooo

ooo

oo

x1

x2

xx

x

x

x

xxx

xxx

x

oo

o

o

oo oo

o ooox1

x3x2

xn

Φ

X

H

7

Dot product vs positive definite kernelsthat is pretty abstract. how do I turn it into something I can compute?Theorem: k(x,y), x,y∈ X, is a dot-product kernel if and only if it is a positive definite kernelthis can be checkedlet’s start by the definitionDefinition: k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ {x1, ..., xl}, xi∈ X, the Gram matrix

is positive definite.

[ ] ),( jiij xxkK =

8

Positive definite matricesrecall that (e.g. Linear Algebra and Applications, Strang)

Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be positive definite:

i) xTAx ≥ 0, ∀ x ≠ 0 ii) all eigenvalues of A satisfy λi ≥ 0iii) all upper-left submatrices Ak have non-negative determinantiv) there is a matrix R with independent rows such that

A = RTR

upper left submatrices:

L

3,32,31,3

3,22,21,2

3,12,11,1

32,21,2

2,11,121,11

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=⎥

⎦

⎤⎢⎣

⎡==

aaaaaaaaa

Aaaaa

AaA

9

Dot product vs positive definite kernelsequivalence between dot product and positive definite automatically proves some simple properties• dot-product kernels are non-negative functions

• dot-product kernels are symmetric

• Cauchy-Schwarz inequality for dot-product kernels

• note that this is just

Xxxxk ∈∀≥ 0),(

Xyxxykyxk ∈∀= ,),,(),(

Xyxyykxxkyxk ∈∀≤ ,),,(),(),( 2

1,

11),(),(

),(1

cos

≤≤⇔≤≤−yyx

-yykxxk

yxkx

444 3444 21

θ

10

Dot product vs positive definite kernelsthe proof actually gives insight on what the kernel does• we defined H as the space spanned by linear combinations of

k(.,xi)

H =

• e.g. the the space of all linear combinations of Gaussians when the we adopt the Gaussian kernel

• note: this is a function of thefirst argument, xi is fixed

( )⎭⎬⎫

⎩⎨⎧

∈∀∀= ∑=

m

iiii Xxmxkff

1,,.,(.)|(.) α

2

2.

)(., σix

i exK−

−=

11

Dot product vs positive definite kernels• if f(.) and g(.) ∈ H, with

• we defined the dot-product <.,.>* on H as

• for the Gaussian kernel this is

• a dot product in H• a non-linear measure of similarity in X, somewhat related to

likelihoods

( ) ( )∑∑==

=='

11

'.,(.).,(.)m

jjj

m

iii xkgxkf βα

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα

∑∑=

−−

=

=m

i

xxm

jji

ji

egf1

''

1*

2

2

, σβα

12

The dot-product <.,.>*

it is not hard to show thatthis means that

with

the feature transformation associated with the kernel k(x,y) sends the points xi to the functions k(.,xi)furthermore, <.,.>* is itself a dot-product kernel on H x H

( )jiji xxkxkxk ,)(.,),(.,*

=

( )*

)(),(, jiji xxxxk ΦΦ=

Φ: X → Hx → k(.,x)

13

The reproducing propertywith this definition of H and <.,.>*

∀ f ∈ H, <k(.,x),f(.)>* = f(x)

this is called the reproducing propertyan analogy is to think of linear time-invariant systems• the dot product as a convolution• k(.,x) as the Dirac delta• f(.) as a system input• the equation above is the basis of all linear time invariant systems

theory

we will see that it also plays a fundamental role in the theory (and use) of reproducing Kernel Hilbert Spaces

14

The big picturewhen we use the Gaussian kernel• the point xi ∈ X is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussians• the kernel is a dot product in H, and a non-linear

similarity on X

• reproducing property on H: analogy to linear systems

σ

2

),(ixx

i exxK−

−=

xx

x

x

x

xx

x

xx

x

x

oo

o

o

o

o oo

o oo

ox1

x3x2

xn

xx

x

x

x

xx

x

xx

x

x

oo

o

o

o

ooo

ooo

o

x1

x2

Φ

X H*

*

*

15

Hilbert spacesplay an important role in functional analysiswe will do very informal reviewH is a vector space with a dot product:• this is known as a “dot-product space” or a “pre-Hilbert” space

the difference to a Hilbert space is mostly technicalDefinition: a Hilbert space is a complete dot-product spacewhat do we mean by completeness? Definition: S is complete if all Cauchy sequences in Sconvergethis is useful mostly to prove convergence results

16

Cauchy sequencesjust for completeness (no pun intended)Definition: a sequence {x1,x2,...} in a normed space is a Cauchy sequence if

you can picture this as

why might this not converge? well, the limit point could be outside the spacewhy is this important? complete ⇒ convergent = Cauchy

εε ≤−>∀∈∃≥∀ "',",',0 nn xxnnnNn s.t.

ε

n

17

reproducing kernel Hilbert spacesto turn our pre-Hilbert space H into a Hilbert space we have to “complete it” with respect to the norm

this is just adding to H the limit points of all its Cauchy sequenceswe represent the completion of H by Hmore than an Hilbert space, H becomes a reproducing kernel Hilbert space (RKHS)

**, fff =

18

reproducing kernel Hilbert spacesDefinition: Let H be a Hilbert space of functions f: X → R. H is a RKHS endowed with dot-product <.,.>* if ∃ k: X x X → R such that

1.k spans H, i.e., ∃ {xi}, {αi}, such that

H =

2.<f(.),k(.,x)>* = f(x), ∀ f ∈ H,

in summary, • the H we built can easily be transformed into a RKHS by

adding to it the limit points (functions) of all Cauchy seqs

Question: is there a one-to-one mapping between kernels and RKHSs?

{ }⎭⎬⎫

⎩⎨⎧

== ∑i

ii xkffxkspan )(.,(.)|(.))(., α

19

kernels vs RKHSsanswer: RKHS does indeed specify kernel uniquelyproof:• assume that kernels k1 and k2 span H

• using reproducing property• k1(x’,x) = <k1(.,x), k2(.,x’)>*

• k2(x,x’) = <k2(.,x’),k1(.,x)>*

• hence, from symmetry of dot product <.,.>* we must have

k1(x’,x) = k2(x,x’)

and from symmetry of kernel it follows that

k1(x,x’) = k2(x,x’)

• since this holds for all x and x’, the kernels are the same

20

kernels vs RKHSshowever, it is not clear that a kernel uniquely specifies the RKHS• there might be multiple feature transforms and dot-products that

are consistent with a kernel

to study this we need to introduce Mercer kernelsDefinition: a symmetric mapping k: X x X → R such that

is a Mercer kernel∞<∀

≥

∫∫∫

dxxfxf

dxdyyfxfyxk2)()(

,0)()(),(

s.t.

(*)

21

Mercer kernelswhy do we care about them? Two reasons1) Theorem: a kernel is positive definite (and dot-product)

if and only if it is a Mercer kernelproof that Mercer implies PD:• consider any sequence {x1,...,xn}, xi ∈ X. Let

• since , if k(x,y) is Mercer, it follows from (*) that

∫∫ ∑∫∫

−−=

≤

dxdyxxxxwwyxk

dxdyyfxfyxk

jiij

ji )()(),(

)()(),(0

δδ

( ) ∞<−= ∑∑==

n

ii

n

iii wtsxxwxf

1

2

1.. ,)( δ

)( 2 ∞<∫ dxxf

22

Mercer kernels

Kwwxxkww

dxdyxxxxyxkww

Tji

ijji

jiij

ji

==

−−≤

∑

∫∫∑),(

)()(),(0 δδ

• since this holds for any w, k(x,y) is PD

23

Mercer kernels

proof that PD implies Mercer:• suppose there is a g(x) s.t.

• consider a partition of X xX fine enough that

• hence

with u = (g(x1), ..., g(xn))T

• K is not PD and k(x,y) not a PD kernel• the Theorem follows by contradiction

∫∫ <= 0)()(),( εdxdyygxgyxk

2)()(),()()(),( ε

≤−∆∆ ∫∫∑ dxdyygxgyxkxgxgxxk yxij

jiji

0 0)()(),( <⇔<∆∆∑ Kuuxgxgxxk Tyx

ijjiji

24

Mercer kernels2) they have the following property (without proof)Theorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions

and a set of λi ≥ 0, such that

intuition: think of this as a 2D Fourier series (+Parseval)

∑

∫∫∑∞

=

∞

=

=

=

1

2

1

2

)()(),()

),()

iiii

ii

yxyxkii

dxdyyxki

φφλ

λ

ijji dxxx δφφ =∫ )()(

(**)

25

Eigenfunctionsnote: if we define the operator

then

the functions φi(x) are the eigenfunctions of (Tf)(x)again, a connection to linear systems (LTI if k(x,y) = h(x-y))

( ) ∫= dyyfyxkxTf )(),()(

( )

)()()()(

)()()(

)(),()(

xdyyyx

dyyyx

dyyyxkxT

iik

ikkk

kikkk

ii

ij

φλφφφλ

φφφλ

φφ

δ

==

=

=

∑ ∫

∑ ∫∫

44 344 21

26

Mercer kernelsthe eigenfunction decomposition gives us another way to design the feature transformation

where l2d is the space of vectors s.t. and d the number of non-zero eigenvalues λi

clearly

i.e. there is a vector space (l2d) other than H s.t. k(x,y) is a dot product in that space

)()(),( yxyxk T ΦΦ=

∞<∑i ia 2

( )Td

xxxlX

L),(),( :

2211

2

φλφλ→

→Φ

27

The picturethis is a mapping from Rn to Rd

much more like to a multi-layer Perceptronthan beforethe kernelized Perceptronas a neural net

xx

x

x

x

xx

x

xx

x

x

oo

o

o

o

o oo

o oo

ox1

x3x2

xd

xx

x

x

x

xx

x

xx

x

x

oo

o

o

o

ooo

ooo

o

x1

x2

Φ

X l2dΦ(xi)

xi

x

Φ1(.) Φ2(.) Φd(.)

28

In summaryReproducing kernel map

HK = ( )⎭⎬⎫

⎩⎨⎧

= ∑=

m

iii xkff

1

.,(.)|(.) α

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα

)(.,: xkx →Φ

Mercer kernel map

HM=⎭⎬⎫

⎩⎨⎧

∞<= ∑=

d

ii

d xxl1

22 |

gfgf T=*

,

( )Txxx L),(),(: 2211 φλφλ→Φ

where λi,φi are the eigenvaluesand eigenfunctions of k(x,y)

two very different pictures of what the kernel doesare the two spaces really that different?

29

RK vs Mercer mapsnote that for HM we are writing

but, since the φi(.) are orthonormal, there is a 1-1 map

and we can write

hence k(.,x) maps x into M = span{φk(.)}

dexexxr

Lr

)()()( 11111 φλφλ ++=Φ

{ } (.)(.): 2

kkk

kd

espanl

φλφ

→→Γ

r

( ))(.,

(.))((.))()( 111

xkxxx ddd

=++=ΦΓ φφλφφλ Lo

from (**)

(***)

30

RK vs Mercer mapsdefine the dot product in M so that

then {φk(.)} is a basis, M is a vector space, any function in M can be written as

and

i.e., k is a reproducing kernel on M

ijj

kjj

kj dxxx δλ

φφλ

φφ 1)()(1,** ∫ ==

( )xxfk

kk∑= φα)(

)()(,)(

)((.),(.))(.,(.),

**

1

****

xfxx

xxkf

iii

ijjijji

jjjj

iii

ijj

===

=

∑∑

∑∑

φαφφφλα

φφλφα

δλ

321

31

RK vs Mercer mapsfurthermore, since k(.,x) ∈ M, any functions of the form

are in M and

note that f,g ∈ H and this is the dot product we had in H

( ) ( ) jj

jii

i xkgxkf .,(.).,(.) ∑∑ == βα

),()()(

(.)(.),)()(

)((.)),((.)

)(.,,)(.,,

**

**

****

jij

ijiij

jlill

lji

ijmljmil

lmmlji

ijjmm

mmill

llji

jjj

iii

xxkxx

xx

xx

xkxkgf

∑∑ ∑

∑ ∑

∑ ∑∑

∑∑

==

=

=

=

βαφφλβα

φφφφλλβα

φφλφφλβα

βα

32

In summaryH ⊂ M and <.,.>* in H is the same as <.,.>** in MQuestion: is M ⊂ H ?• need to show that any H

• from (***),

and, for any sequence {x1, ..., xd},

• if there is an invertible P, then H and M ⊂ H

( ) ∈= ∑ xxfk

kkφα)(

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

(.)

(.)

)()(

)()(

)(.,

)(., 1

11

11111

d

P

dddd

dd

d xx

xx

xk

xk

φ

φ

φλφλ

φλφλM

44444 344444 21

LMM

(.))((.))()(., 111 ddd xxxk φφλφφλ ++= L

( ) ∈= ∑ kk

kk .,xkx αφ )(

33

In summarysince λi > 0

is invertible when Π is. If Π is not invertible, then

if there is no sequence for which Π is invertible then

the φk(x) cannot be orthonormal. Hence there must be invertible P and M ⊂ H.

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

Π

d

i

ddd

d

xx

xxP

λλ

λ

φφ

φφ

0

0

)()(

)()( 1

1

111

M

4444 34444 21

LM

( ) 00 =≠∃∀ ∑ ik

kk xi φαα s.t.

( ) 00 =≠∃∀ ∑ xxk

kkφαα s.t.

34

In summaryH = M and <.,.>* in H is the same as <.,.>** in M

the reproducing kernel map and the Mercer kernel map lead to the same RKHS

Reproducing kernel map

HK = ( )⎭⎬⎫

⎩⎨⎧

= ∑=

m

iii xkff

1

.,(.)|(.) α

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα

)(.,: xkxr →Φ

Mercer kernel map

HM=⎭⎬⎫

⎩⎨⎧

∞<= ∑=

d

ii

d xxl1

22 |

gfgf T=*

,

( )TM xxx L),(),(: 2211 φλφλ→Φ

rM Φ=ΦΓ o { } (.)(.): 2

kkk

kd

espanl

φλφ

→→Γ

r

35

Documents

Reproducing kernel Hilbert spaces - SVCL · reproducing kernel Hilbert spaces to turn our pre-Hilbert space H into a Hilbert space we have to “complete it” with respect to the