Reproducing kernel Hilbert spaces and regularizationspaces ... · Reproducing kernel Hilbert spaces and regularizationspaces and regularization ... H is a vector space, ... the reproducing

Reproducing kernel Hilbert spaces and regularizationspaces and regularization

Nuno Vasconcelos ECE Department, UCSDp ,

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

Perceptron: classifier implements the linear decision rulep p

with [ ]g(x)sgn )( =xh bxwxg T +=)( w

appropriate when the classes arelinearly separable

b

x

xg )(

to deal with non-linear separabilitywe introduce a kernel

wb w

2

Types of kernelsthese three are equivalent

dot product kernel

positive definite kernel

⇔

positive definite kernel

⇔Mercer kernel

3

Dot-product kernelsDefinition: a mapping

k: X x X → ℜk: X x X → ℜ(x,y) → k(x,y)

is a dot-product kernel if and only ifp y

k(x,y) = <Φ(x),Φ(y)>where Φ: X → H,where Φ: X → H,

H is a vector space, <.,.> is a dot-product in H

xx

x

x

xx

xxx

xoo

x2 xx

x

x

x

xxx

xxx

x

oo

o oooΦ

X H

4

xxx

x oo

oooo

ooo o

x1

oo

ooo oo

o oox1

x3x2

xn

Φ

positive definite and Mercer kernelsDefinition: k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ x1, ..., xl, xi∈ X, the Gram matrix 1 l i

is positive definite.[ ] ),( jiij xxkK =

is positive definite.

Definition: a symmetric mapping k: X x X → R such that

<∀

≥

∫∫∫

dxxfxf

dxdyyfxfyxk2)()(

,0)()(),(

s t

is a Mercer kernel∞<∀ ∫ dxxfxf 2)()( s.t.

5

Two different pictures

different definitions lead to different interpretations of what the kernel does

Mercer kernel mapReproducing kernel map

what the kernel does

HM=HK = ( )⎭⎬⎫

⎩⎨⎧

= ∑=

m

iii xkff

1.,(.)|(.) α

m m '

⎭⎬⎫

⎩⎨⎧

∞<= ∑=

d

ii

d xxl1

22 |

( )∑∑= =

=m

iji

m

jji xxkgf

1 1*

',, βα

)(: xkx →Φ

gfgf T=*

,

( )Txxx L)()(: φλφλ→Φ)(.,: xkx →Φ ( )xxx ),(),(: 2211 φλφλ→Φ

where λi,φi are the eigenvalues and eigenfunctions of k(x,y)

6

λi > 0

The dot-product picturewhen we use the Gaussian kernel• the point xi ∈ X is mapped into the Gaussian G(x xi σI)

σ

2

),(ixx

i exxK−

−=

the point xi ∈ X is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussiansth k l i d t d t i d li• the kernel is a dot product in H, and a non-linearsimilarity on X

• reproducing property on H: analogy to linear systems

xx

x

x

xx

x

xx2

X HK*

x

xx

x

oo

o

o

o

o oo

o oo

ox1

xx

x

x

x

xx

x

xx

x

x

o

o

ooo

o Φ

*

7

o o

x3x2

xn

x oo

o

ooo

x1*

The Mercer picturethis is a mapping from Rn to Rd

xxX l2dΦ(xi)

xx

xx

xx

x

xx

x

x

o oxx

x

xx

x

x

x2

Xxi

oo

oo

o oo

o oox1

x2

xd

xx

x

xx

xo

oo

o

o

ooo

ooo

o

x

Φ

much more like to a multi-layer Perceptron

x3x2x1

y pthan beforethe kernelized Perceptronas a ne ral net

Φ1(.) Φ2(.) Φd(.)

8

as a neural net x

The reproducing propertywith this definition of HK and <.,.>*

∀ f ∈ HK, <k(.,x),f(.)>* = f(x)

this is called the reproducing propertyan analogy is to think of linear time-invariant systems• the dot product as a convolution• k(.,x) as the Dirac delta• f(.) as a system input• the equation above is the basis of all linear time invariant systemsthe equation above is the basis of all linear time invariant systems

theory

leads to reproducing Kernel Hilbert Spaces

9

reproducing kernel Hilbert spacesDefinition: a Hilbert space is a complete dot-product space (vector space + dot product + limit points of all p ( p p pCauchy sequences)

Definition: Let H be a Hilbert space of functionsDefinition: Let H be a Hilbert space of functions f: X → R. H is a RKHS endowed with dot-product <.,.>* if ∃ k: X x X → R such that

1.k spans H, i.e., ∃ xi, αi, such that

H = ⎭⎬⎫

⎩⎨⎧

== ∑i

ii xkffxkspan )(.,(.)|(.))(., α

2.<f(.),k(.,x)>* = f(x), ∀ f ∈ H,

⎭⎩ i

10

Mercer kernelshow different are the spaces HK and HM?Theorem: Let k: X x X → R be a Mercer kernel ThenTheorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions

dxxx δφφ =∫ )()(and a set of λi ≥ 0, such that

∫∫∑∞

ijji dxxx δφφ =∫ )()(

∑

∫∫∑∞

=

= 2

1

2 ),()i

i dxdyyxki λ

∑=

=1

)()(),()i

iii yxyxkii φφλ (**)

11

RK vs Mercer mapsnote that for HM we are writing

exexxrr

)()()( φλφλΦ

but, since the φi(.) are orthonormal, there is a 1-1 mapdexexx L )()()( 11111 φλφλ ++=Φ

(.)(.): 2

kkk

kd

espanl

φλφ

→→Γ

r

and we can write

( ) ( ))(( ))()( xxx ++=ΦΓ φφλφφλo (***)( ))(.,

(.))((.))()( 111

xkxxx ddd

=++=ΦΓ φφλφφλ Lo

from (**)

( )

12

hence k(.,x) maps x into M = spanφk(.)

The Mercer picturewe have

l2dΦ(xi)

x

x

xx

x

x2

Φ

Xx

xx

x

x

xx

x

xx

x

x

o o

l2

xi

xx

xx

xx

x

xx x

oo

o

o

o

ooo

ooo

o

xo

oo

o

o oo

o ooe1

ed ΤοΦ(x )=k( x )oox1 e3

e2

xx

x

x

x

x

xx

x

x MΓ

ΤοΦ(xi)=k(.,xi)

xxo

oo

o

o

o oo

o oo

oφ1

φd

Γ

13

φ3φ2

φd

RK vs Mercer mapsdefine the dot product in M so that

11∫

then φk(.) is a basis, M is a vector space, any function in

ijj

kjkj

kj dxxx δλ

φφλλ

φφ 1)()(1,** ∫ ==

φk( ) p yM can be written as

( )xxfk

kk∑= φα)(

andk

)((.),(.))(.,(.),**

**xxkf

jjjj

iii= ∑∑ φφλφα

)()(,)( 1

**

**

xfxxi

iiij

jijji

ji

ij

=== ∑∑ φαφφφλα

δ

43421

14

i.e., k is a reproducing kernel on Mij

jδ

λ

RK vs Mercer mapsfurthermore, since k(.,x) ∈ M, any functions of the form

( ) ( )

are in M and

( ) ( ) jj

jii

i xkgxkf .,(.).,(.) ∑∑ == βα

)(.,,)(.,,**

**j

jji

ii xkxkgf ∑∑= βα

( )( ))()(

)((.)),((.)**ij

jmmm

milll

lji

xx

xx

∑ ∑

∑ ∑∑=

=

φφφφλλβα

φφλφφλβα

),()()(

(.)(.),)()(**

jij

ijiij

jlill

lji

ijmljmil

lmmlji

xxkxx

xx

∑∑ ∑

∑ ∑==

=

βαφφλβα

φφφφλλβα

15

note that f,g ∈ HK and this is the dot product we had in HK

ijij l

The Mercer picturefurthermore, note that for f in HM

( )xxf ∑ φα)(

and since

( )xxfk

kk∑= φα)(

ijj

kjkj

kj dxxx δλ

φφλλ

φφ 1)()(1,** ∫ ==

the dot product on HM is

∑=gf φφβα

∑

∑

=

=

kk

kllklkgf

λβα

φφβα

****,,

16

∑k kλ


l2dΦ(xi)

x

x

xx

x

x2

Φ

Xx

xx

x

x

xx

x

xx

x

x

o o

l2

xi

xx

xx

xx

x

xx x

oo

o

o

o

ooo

ooo

o

xo

oo

o

o oo

o ooe1


e2

xx

x

x

x

x

xx

x

x M ⊂ HK

Γ

ΤοΦ(xi)=k(.,xi)

*),(,

** jij

iji xxkgf ∑= βαxx

oo

o

o

o

o oo

o oo

oφ1

φd

Γij

17

φ3φ2

φd

∑=k k

kkgfλ

βα **

,

RK vs Mercer mapsHK ⊂ M and <.,.>* in HK is the same as <.,.>** in MQuestion: is M ⊂ H ?Question: is M ⊂ HK ?• need to show that any HK

• from (***), ( ) ∈= ∑ xxf

kkkφα)(

( ),

and, for any sequence x1, ..., xd,

(.))((.))()(., 111 ddd xxxk φφλφφλ ++= L

⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

=⎥⎥⎤

⎢⎢⎡ (.))()()(., 111111 dd xxxk φφλφλ

MLMM

⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣ (.))()()(., 11 d

P

ddddd xxxk φφλφλ44444 344444 21

18

• if there is an invertible P, then HK and M ⊂ HK( ) ∈= ∑ kk

kk .,xkx αφ )(

RK vs Mercer mapssince λi > 0

⎤⎡⎤⎡ d xx λφφ 0)()(

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

d

i

ddd

d

xx

xxP

λλ

λ

φφ

φφ

0

0

)()(

)()( 1

1

111

MLM

is invertible when Π is. If Π is not invertible, then

⎦⎣⎦⎣Π

dddd φφ )()(1 4444 34444 21

( )∑if there is no sequence for which Π is invertible then

( ) 00 =≠∃∀ ∑ ik

kk xi φαα s.t.

q

th φ ( ) t b th l H th t b

( ) 00 =≠∃∀ ∑ xxk

kkφαα s.t.

19

the φk(x) cannot be orthonormal. Hence there must be invertible P and M ⊂ HK.


HM=l2dΦ(xi)

x

x

xx

x

x2

ΦΜ

Xx

xx

x

x

xx

x

xx

x

x

o o

HM l2

xi

xx

xx

xx

x

xx x

oo

o

o

o

ooo

ooo

o

Μ xo

oo

o

o oo

o ooe1


e2

xx

x

x

x

x

xx

x

x M=HK

Γ

ΤοΦ(xi)=k(.,xi)

*),(,

** jij

iji xxkgf ∑= βαxx

oo

o

o

o

o oo

o oo

oφ1

φd

Γij

20

φ3φ2

φd

∑=k k

kkgfλ

βα **

,

In summaryHK = M and <.,.>* in HK is the same as <.,.>** in M

the reproducing kernel map and the Mercer kernel map lead to the same RKHS Mercer gives us an o n basislead to the same RKHS, Mercer gives us an o.n. basis

Reproducing kernel map

H ( )⎬⎫⎨⎧ ∑

m

kff ( )|( )

Mercer kernel map

H ⎬⎫

⎨⎧ ∑

ddl 2|HK = ( )

⎭⎬⎫

⎩⎨⎧

= ∑=i

ii xkff1

.,(.)|(.) α

( )∑∑=m

iji

m

jji xxkgf

1

'

1*

',, βα

HM=⎭⎬⎫

⎩⎨⎧

∞<= ∑=i

id xxl

1

22 |

gfgf T=*

,= =i j1 1

)(.,: xkxr →Φ ( )TM xxx L),(),(: 2211 φλφλ→Φ

thi t t t b t f RKHS bt i d f k( )

(.)(.): 2

kkk

kd

espanl

φλφ

→→Γ

rrM Φ=ΦΓ o

21

this turns out to be true for any RKHS obtained from k(x,y)(1-1 relationship)

RegularizationQ: RKHS: why do we care?A: regularizationA: regularization• we want to do well outside of the

training set• minimizing empirical risk is not

enough• need to penalize complexityp p y• example: regression

• give me n points, I will give you a model of zero error (polynomialmodel of zero error (polynomial of order n-1)

• incredibly “wiggly”• poor generalization

22

• poor generalization

Regularizationwe need to “regularize” the risk

[ ] [ ] [ ]ffRfR Ω+ λ

• f is the function we are trying to learn

[ ] [ ] [ ]ffRfR empreg Ω+= λ

• λ > 0 is the regularization parameter• Ω is a complexity penalizing functional

• larger when f is more “wiggly”larger when f is more wiggly

the regularized risk can be justified in various wayst i d i i i ti• constrained minimization

• Bayesian inference• structural risk minimization

23

structural risk minimization• ...

Constrained minimizationregularized risk as the solution to the problem

[ ] [ ]

ill t lk l t b t thi

[ ] [ ] tffRf empf

≤Ω= subject to minarg*

we will talk a lot more about this• solution is the solution of the Lagrangian problem

[ ] [ ] fff λi*

• λ is a Lagrange multiplier

[ ] [ ] ffRf empf

Ω+= λminarg*

• changing λ is equivalent to changing t (the constraint on the complexity of the optimal solution)

24

Bayesian inferencein Bayesian inference we have• likelihood function PX| (x|θ)• likelihood function PX|θ(x|θ)• prior Pθ(θ)• and search for the maximum a posteriori (MAP) estimate

)()|(

)|(maxarg* |

θθ

θθ θθ

PxP

xP X=

)()|(maxarg)(

)()|(maxarg

|

|

θθ

θθ

θθ

θθ

θ

PxPxPPxP

X

X

X

=

=

[ ] )(log)|(logmaxarg

)()|(maxarg

|

|

θθ

θθ

θθθ

θθθ

PxP

PxP

X

X

+=

25

Bayesian inferenceif PX|θ(x|f) = e -Remp[f] and Pθ(f) = e -λΩ[f]

• the MAP estimate is the minimum of the regularized riskthe MAP estimate is the minimum of the regularized risk

[ ] [ ] ffRf empf

Ω+= λminarg*

• the prior Pθ(f) assigns low probability to function f with large Ω[f]• this reflects our prior belief that it is “unlikely” that the solution will

be very complexbe very complex

example:• if Ω[f] = || f-f0 ||2, the prior is a Gaussian centered at f0 withif Ω[f] || f f0 || , the prior is a Gaussian centered at f0 with

variance 1/λ1/2

• we believe that the most likely solution is f = f0th l th λ th li l ti hi h

26

• the larger the λ, the more we penalize solutions which are different from this

Structural risk minimization• start from a nested collection of families of functions

kSS ⊂⊂L1

where Si = hi(x,α), for all α• for each Si, find the function (set of parameters) that minimizes

the empirical risk

• select the function class such that∑

=

=n

kkik

iemp xhyL

nR

1

)],(,[1min αα

( ) iiempi

hRR Φ+= min*

• Φ(h) is a function of VC dimension (complexity) of the family Si

• VC have shown that this is equivalent to minimizing a bound on the risk, and provides generalization guarantees

27

, p g g• regularization with the “right” regularizer!

The regularizerwhat is a good regularizer?intuition: “wigglier” functions have a larger norm thanintuition: wigglier functions have a larger norm than smoother functionsfor example, in HK we havep K

∑=i

ii xxkxf ),()( α

∑ ∑⎤⎡

==i

ill

lli xx )()( φφλα

∑

∑ ∑=

⎥⎦

⎤⎢⎣

⎡=

ll

ll

iilil

xc

xx

)(

)()(

φ

φφαλ

28

∑=l

ll xc )( φ

The regularizerand

∑∑∑ cδ 22

with

∑∑∑ ===l l

l

lk l

lkkl

lkklkl

cccxxccxfλλ

δφφ**

2 )(),()(

( )∑= ilill xc φαλwith hence, • || f ||2 grows with the number of cl different than zero

( )∑i

ilill xc φαλ

|| || g l

• this is the case in which f is more complex, since it becomes a sum of more basis function φl(x)

• identical to what happens in Fourier type decompositions• identical to what happens in Fourier type decompositions• more coefficients means more “high-frequencies” or “less

smoothness”

29

RegularizationOK, regularization is a good idea from multiple points of viewproblem: minimizing the regularized risk

[ ] [ ] [ ]fff Ωλ

over the set of all functions seems like a nightmare

[ ] [ ] [ ]ffRfR empreg Ω+= λ

it turns out that it is not, under some reasonable conditions on the regularizer Ωthe key is the “representer theorem”

30

Representer theoremTheorem: Let • Ω:[0 ∞) → ℜ be a strictly monotonically increasing function• Ω:[0, ∞) → ℜ be a strictly monotonically increasing function,• H the RKHS associated with a kernel k(x,y)

• L[y,f(x)] a loss function

then, if

( )[ ] ( )⎥⎤⎢⎡

Ω+= ∑n

fxfyLf 2* minarg λ

f* admits a representation of the form

( )[ ] ( )⎥⎦

⎢⎣

Ω+= ∑=i

iif

fxfyLf1

,minarg λ

p

( )∑=n

iii xkf

1

* .,α (i.e. f* ∈ H )

31

=i 1

Proof• decompose any f into the part contained in the span of the k(.,xi)

and the part in the orthogonal complement

where f0 ∈ w0 and f⊥ ∈ w⊥ with

( ) )(.,)()()(1

0 xfxkxfxfxfn

iii ⊥

=⊥ +=+= ∑α

where f0 ∈ w0 and f⊥ ∈ w⊥ with

0

0

,0,|)(.,|wffggw

xkspanffw i

∈∀==

∈=

⊥

• then 0,,| gg⊥

[ ] ( )⎞

⎜⎜⎛

+Ω=Ω ∑2

2 )(n

xfxkf α[ ] ( )

( ) ( )⎞

⎜⎜⎛

Ω≥⎞

⎜⎜⎛

+Ω=

⎠⎜⎜⎝

+Ω=Ω

∑∑

∑ ⊥=

22

2

1)(.,

nn

iii

xkfxk

xfxkf

αα

α(Ω mon. increasing)

32

( ) ( )⎠

⎜⎜⎝

Ω≥⎠

⎜⎜⎝

+Ω= ∑∑=

⊥= 11

.,.,i

iii

ii xkfxk αα

Proof• this shows that the second term in

( )[ ] ( )⎥⎤⎢⎡

Ω+∑n

fxfyLf 2* i λ

is minimized by a function of the stated form. th fi t i th d i t

( )[ ] ( )⎥⎦

⎢⎣

Ω+= ∑=i

iif

fxfyLf1

,minarg λ

• the first, using the reproducing property

( ) ⊥+== iiii xkfxkfxkfxf0

0 )(.,(.),)(.,(.),)(.,(.),44 844 76

∑=

=n

jjij xxk

1

),(α

is always a function of the stated form.• hence, the minimum must be of this form as well

33

The picturedue to reproducing property: • f(xi) ∈ Hf(xi) ∈ H

if f* is the solution in H• ||f’|| > ||f*|| H|| || || ||• is a more complex function• which does not reduce the loss

f(xi)

hence, f* is the optimal solution

e1 f*

f’

e3

ed

f

34

Relevancethe remarkable consequence of the theorem is that:• we can reduce the minimization over the (infinite dimensional)we can reduce the minimization over the (infinite dimensional)

space of functions• to a minimization on a finite dimensional space!

( )n

to see this note that, becausewe have

( ),.,1

* ∑=

=i

ii xkf α

( ) ( )∑2 ( ) ( )

( )

αα

Kxxk

xkxkfff

T

ijjiji

∑

∑== .,,.,, **2*

( ) αααα Kxxkij

Tjiji∑ == ,

( ) αα Kxxkxf jiji ∑ ==)(*

35

( ) αα Kxxkxfj

jiji ∑ ,)(

Regularizationthis proves the following theoremTheorem:Theorem: • if Ω:[0, ∞) → ℜ is a strictly monotonically increasing function,• then for any dot-product kernel k(x,y) and loss function L[y,f(x)]y p ( ,y) [y, ( )]

the solution of

( )[ ] ( )⎤⎡Ω∑

n

ffLf 2* i λ( )[ ] ( )⎥⎦

⎤⎢⎣

⎡Ω+= ∑

=iii

ffxfyLf

1

2* ,minarg λ

⎤⎡ nn

is with

K the Gram matrix [k(x x )] and Y =( y )

[ ] ( )⎥⎦

⎤⎢⎣

⎡Ω+= ∑

=

n

i

T KKYL1

* ,minarg ααλααα

( )∑=

=n

iii xkf

1

** .,α

36

K the Gram matrix [k(xi,xj)] and Y =(...,yi,...)

Notethe result that f*∈ H holds for any norm, not just L2

however the exact form of the problem will no longer behowever, the exact form of the problem will no longer be

[ ] ( )⎥⎤⎢⎡

Ω+= ∑n

T KKYL* minarg ααλαα

the argument of the second term will have a different

[ ] ( )⎥⎦

⎢⎣

Ω+= ∑=i

KKYL1

,minarg ααλααα

the argument of the second term will have a different form

37

38

Projectsat this time you should have a reasonable idea of what you are going to doy g gfour questions• what are you going to do?• why should you do it?• how are you going to do it?

h t lt d t t t?• what results do you expect to get?

next thursday: • no class 10 12 minute meetings to discuss the four questions• no class, 10-12 minute meetings to discuss the four questions• send me email saying if you are available at 3:00• be prepared, no time to spare

39

Pointersbesides the books in syllabus• Proceedings Neural Information SystemsProceedings Neural Information Systems• Proceedings International Conference on Machine Learning• Neural Computation• IEEE Transactions on Pattern Analysis and Machine Inteligence• Journal of Machine Learning Research

IEEE C f C t Vi i d P tt R iti• IEEE Conference on Computer Vision and Pattern Recognition• International Conference on Computer Vision• IEEE Int. Conference Acoustics, Speech, Signal Processingt Co e e ce coust cs, Speec , S g a ocess g• IEEE International Conference Image Processing• International Journal Computer Vision

40

Also the INSPEC database and Google

Project topics (mostly undergrad)• classification of graphics vs natural images on google

i li i d t if ld• visualizing data manifolds

• comparison of LDA-based face recognition methods

41

Documents

Reproducing kernel Hilbert spaces and regularizationspaces ... · Reproducing kernel Hilbert spaces and regularizationspaces and regularization ... H is a vector space, ... the reproducing