Reproducing kernel Hilbert spaces and regularizationspaces ... · Reproducing kernel Hilbert spaces...

Reproducing kernel Hilbert spaces and regularizationspaces and regularization

Nuno Vasconcelos ECE Department, UCSDp ,

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

Perceptron: classifier implements the linear decision rulep p

with [ ]g(x)sgn )( =xh bxwxg T +=)( w

appropriate when the classes arelinearly separable

to deal with non-linear separabilitywe introduce a kernel

Types of kernelsthese three are equivalent

dot product kernel

positive definite kernel

⇔Mercer kernel

Dot-product kernelsDefinition: a mapping

k: X x X → ℜk: X x X → ℜ(x,y) → k(x,y)

is a dot-product kernel if and only ifp y

k(x,y) = <Φ(x),Φ(y)>where Φ: X → H,where Φ: X → H,

H is a vector space, <.,.> is a dot-product in H

o oooΦ

ooo oo

o oox1

positive definite and Mercer kernelsDefinition: k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ x1, ..., xl, xi∈ X, the Gram matrix 1 l i

is positive definite.[ ] ),( jiij xxkK =

is positive definite.

Definition: a symmetric mapping k: X x X → R such that

∫∫∫

dxxfxf

dxdyyfxfyxk2)()(

,0)()(),(

is a Mercer kernel∞<∀ ∫ dxxfxf 2)()( s.t.

Two different pictures

different definitions lead to different interpretations of what the kernel does

Mercer kernel mapReproducing kernel map

what the kernel does

HM=HK = ( )⎭⎬⎫

⎩⎨⎧

= ∑=

iii xkff

1.,(.)|(.) α

⎭⎬⎫

⎩⎨⎧

∞<= ∑=

d xxl1

( )∑∑= =

jji xxkgf

',, βα

)(: xkx →Φ

gfgf T=*

( )Txxx L)()(: φλφλ→Φ)(.,: xkx →Φ ( )xxx ),(),(: 2211 φλφλ→Φ

where λi,φi are the eigenvalues and eigenfunctions of k(x,y)

λi > 0

The dot-product picturewhen we use the Gaussian kernel• the point xi ∈ X is mapped into the Gaussian G(x xi σI)

),(ixx

i exxK−

the point xi ∈ X is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussiansth k l i d t d t i d li• the kernel is a dot product in H, and a non-linearsimilarity on X

• reproducing property on H: analogy to linear systems

The Mercer picturethis is a mapping from Rn to Rd

xxX l2dΦ(xi)

o oox1

much more like to a multi-layer Perceptron

x3x2x1

y pthan beforethe kernelized Perceptronas a ne ral net

Φ1(.) Φ2(.) Φd(.)

as a neural net x

The reproducing propertywith this definition of HK and <.,.>*

∀ f ∈ HK, <k(.,x),f(.)>* = f(x)

this is called the reproducing propertyan analogy is to think of linear time-invariant systems• the dot product as a convolution• k(.,x) as the Dirac delta• f(.) as a system input• the equation above is the basis of all linear time invariant systemsthe equation above is the basis of all linear time invariant systems

theory

leads to reproducing Kernel Hilbert Spaces

reproducing kernel Hilbert spacesDefinition: a Hilbert space is a complete dot-product space (vector space + dot product + limit points of all p ( p p pCauchy sequences)

Definition: Let H be a Hilbert space of functionsDefinition: Let H be a Hilbert space of functions f: X → R. H is a RKHS endowed with dot-product <.,.>* if ∃ k: X x X → R such that

1.k spans H, i.e., ∃ xi, αi, such that

H = ⎭⎬⎫

⎩⎨⎧

== ∑i

ii xkffxkspan )(.,(.)|(.))(., α

2.<f(.),k(.,x)>* = f(x), ∀ f ∈ H,

⎭⎩ i

Mercer kernelshow different are the spaces HK and HM?Theorem: Let k: X x X → R be a Mercer kernel ThenTheorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions

dxxx δφφ =∫ )()(and a set of λi ≥ 0, such that

∫∫∑∞

ijji dxxx δφφ =∫ )()(

∫∫∑∞

2 ),()i

i dxdyyxki λ

)()(),()i

iii yxyxkii φφλ (**)

RK vs Mercer mapsnote that for HM we are writing

exexxrr

)()()( φλφλΦ

but, since the φi(.) are orthonormal, there is a 1-1 mapdexexx L )()()( 11111 φλφλ ++=Φ

(.)(.): 2

espanl

φλφ

→→Γ

and we can write

( ) ( ))(( ))()( xxx ++=ΦΓ φφλφφλo (***)( ))(.,

(.))((.))()( 111

xkxxx ddd

=++=ΦΓ φφλφφλ Lo

from (**)

hence k(.,x) maps x into M = spanφk(.)

The Mercer picturewe have

l2dΦ(xi)

o ooe1

ed ΤοΦ(x )=k( x )oox1 e3

ΤοΦ(xi)=k(.,xi)

φ3φ2

RK vs Mercer mapsdefine the dot product in M so that

then φk(.) is a basis, M is a vector space, any function in

kj dxxx δλ

φφλλ

φφ 1)()(1,** ∫ ==

φk( ) p yM can be written as

( )xxfk

kk∑= φα)(

)((.),(.))(.,(.),**

**xxkf

iii= ∑∑ φφλφα

)()(,)( 1

=== ∑∑ φαφφφλα

i.e., k is a reproducing kernel on Mij

RK vs Mercer mapsfurthermore, since k(.,x) ∈ M, any functions of the form

( ) ( )

are in M and

( ) ( ) jj

i xkgxkf .,(.).,(.) ∑∑ == βα

)(.,,)(.,,**

ii xkxkgf ∑∑= βα

( )( ))()(

)((.)),((.)**ij

∑ ∑

∑ ∑∑=

φφφφλλβα

φφλφφλβα

),()()(

(.)(.),)()(**

ijmljmil

lmmlji

∑∑ ∑

∑ ∑==

βαφφλβα

φφφφλλβα

note that f,g ∈ HK and this is the dot product we had in HK

ijij l

The Mercer picturefurthermore, note that for f in HM

( )xxf ∑ φα)(

and since

( )xxfk

kk∑= φα)(

kj dxxx δλ

φφλλ

φφ 1)()(1,** ∫ ==

the dot product on HM is

∑=gf φφβα

kllklkgf

λβα

φφβα

****,,

∑k kλ

l2dΦ(xi)

o ooe1

x M ⊂ HK

ΤοΦ(xi)=k(.,xi)

** jij

iji xxkgf ∑= βαxx

φ3φ2

∑=k k

kkgfλ

βα **

RK vs Mercer mapsHK ⊂ M and <.,.>* in HK is the same as <.,.>** in MQuestion: is M ⊂ H ?Question: is M ⊂ HK ?• need to show that any HK

• from (***), ( ) ∈= ∑ xxf

kkkφα)(

and, for any sequence x1, ..., xd,

(.))((.))()(., 111 ddd xxxk φφλφφλ ++= L

⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

=⎥⎥⎤

⎢⎢⎡ (.))()()(., 111111 dd xxxk φφλφλ

⎥⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣⎥

⎥⎦⎢

⎢⎣ (.))()()(., 11 d

ddddd xxxk φφλφλ44444 344444 21

• if there is an invertible P, then HK and M ⊂ HK( ) ∈= ∑ kk

kk .,xkx αφ )(

RK vs Mercer mapssince λi > 0

⎤⎡⎤⎡ d xx λφφ 0)()(

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

)()( 1

is invertible when Π is. If Π is not invertible, then

⎦⎣⎦⎣Π

dddd φφ )()(1 4444 34444 21

( )∑if there is no sequence for which Π is invertible then

( ) 00 =≠∃∀ ∑ ik

kk xi φαα s.t.

th φ ( ) t b th l H th t b

( ) 00 =≠∃∀ ∑ xxk

kkφαα s.t.

the φk(x) cannot be orthonormal. Hence there must be invertible P and M ⊂ HK.

HM=l2dΦ(xi)

o ooe1

x M=HK

ΤοΦ(xi)=k(.,xi)

** jij

iji xxkgf ∑= βαxx

φ3φ2

∑=k k

kkgfλ

βα **

In summaryHK = M and <.,.>* in HK is the same as <.,.>** in M

the reproducing kernel map and the Mercer kernel map lead to the same RKHS Mercer gives us an o n basislead to the same RKHS, Mercer gives us an o.n. basis

Reproducing kernel map

H ( )⎬⎫⎨⎧ ∑

kff ( )|( )

Mercer kernel map

H ⎬⎫

⎨⎧ ∑

ddl 2|HK = ( )

⎭⎬⎫

⎩⎨⎧

= ∑=i

ii xkff1

.,(.)|(.) α

( )∑∑=m

jji xxkgf

',, βα

HM=⎭⎬⎫

⎩⎨⎧

∞<= ∑=i

id xxl

gfgf T=*

,= =i j1 1

)(.,: xkxr →Φ ( )TM xxx L),(),(: 2211 φλφλ→Φ

thi t t t b t f RKHS bt i d f k( )

(.)(.): 2

espanl

φλφ

→→Γ

rrM Φ=ΦΓ o

this turns out to be true for any RKHS obtained from k(x,y)(1-1 relationship)

RegularizationQ: RKHS: why do we care?A: regularizationA: regularization• we want to do well outside of the

training set• minimizing empirical risk is not

enough• need to penalize complexityp p y• example: regression

• give me n points, I will give you a model of zero error (polynomialmodel of zero error (polynomial of order n-1)

• incredibly “wiggly”• poor generalization

• poor generalization

Regularizationwe need to “regularize” the risk

[ ] [ ] [ ]ffRfR Ω+ λ

• f is the function we are trying to learn

[ ] [ ] [ ]ffRfR empreg Ω+= λ

• λ > 0 is the regularization parameter• Ω is a complexity penalizing functional

• larger when f is more “wiggly”larger when f is more wiggly

the regularized risk can be justified in various wayst i d i i i ti• constrained minimization

• Bayesian inference• structural risk minimization

structural risk minimization• ...

Constrained minimizationregularized risk as the solution to the problem

[ ] [ ]

ill t lk l t b t thi

[ ] [ ] tffRf empf

≤Ω= subject to minarg*

we will talk a lot more about this• solution is the solution of the Lagrangian problem

[ ] [ ] fff λi*

• λ is a Lagrange multiplier

[ ] [ ] ffRf empf

Ω+= λminarg*

• changing λ is equivalent to changing t (the constraint on the complexity of the optimal solution)

Bayesian inferencein Bayesian inference we have• likelihood function PX| (x|θ)• likelihood function PX|θ(x|θ)• prior Pθ(θ)• and search for the maximum a posteriori (MAP) estimate

)|(maxarg* |

θθ θθ

)()|(maxarg)(

)()|(maxarg

PxPxPPxP

[ ] )(log)|(logmaxarg

)()|(maxarg

θθθ

Bayesian inferenceif PX|θ(x|f) = e -Remp[f] and Pθ(f) = e -λΩ[f]

• the MAP estimate is the minimum of the regularized riskthe MAP estimate is the minimum of the regularized risk

[ ] [ ] ffRf empf

Ω+= λminarg*

• the prior Pθ(f) assigns low probability to function f with large Ω[f]• this reflects our prior belief that it is “unlikely” that the solution will

be very complexbe very complex

example:• if Ω[f] = || f-f0 ||2, the prior is a Gaussian centered at f0 withif Ω[f] || f f0 || , the prior is a Gaussian centered at f0 with

variance 1/λ1/2

• we believe that the most likely solution is f = f0th l th λ th li l ti hi h

• the larger the λ, the more we penalize solutions which are different from this

Structural risk minimization• start from a nested collection of families of functions

kSS ⊂⊂L1

where Si = hi(x,α), for all α• for each Si, find the function (set of parameters) that minimizes

the empirical risk

• select the function class such that∑

iemp xhyL

)],(,[1min αα

( ) iiempi

hRR Φ+= min*

• Φ(h) is a function of VC dimension (complexity) of the family Si

• VC have shown that this is equivalent to minimizing a bound on the risk, and provides generalization guarantees

, p g g• regularization with the “right” regularizer!

The regularizerwhat is a good regularizer?intuition: “wigglier” functions have a larger norm thanintuition: wigglier functions have a larger norm than smoother functionsfor example, in HK we havep K

ii xxkxf ),()( α

∑ ∑⎤⎡

lli xx )()( φφλα

∑ ∑=

⎥⎦

⎤⎢⎣

φφαλ

ll xc )( φ

The regularizerand

∑∑∑ cδ 22

∑∑∑ ===l l

lkklkl

cccxxccxfλλ

δφφ**

2 )(),()(

( )∑= ilill xc φαλwith hence, • || f ||2 grows with the number of cl different than zero

( )∑i

ilill xc φαλ

|| || g l

• this is the case in which f is more complex, since it becomes a sum of more basis function φl(x)

• identical to what happens in Fourier type decompositions• identical to what happens in Fourier type decompositions• more coefficients means more “high-frequencies” or “less

smoothness”

RegularizationOK, regularization is a good idea from multiple points of viewproblem: minimizing the regularized risk

[ ] [ ] [ ]fff Ωλ

over the set of all functions seems like a nightmare

[ ] [ ] [ ]ffRfR empreg Ω+= λ

it turns out that it is not, under some reasonable conditions on the regularizer Ωthe key is the “representer theorem”

Representer theoremTheorem: Let • Ω:[0 ∞) → ℜ be a strictly monotonically increasing function• Ω:[0, ∞) → ℜ be a strictly monotonically increasing function,• H the RKHS associated with a kernel k(x,y)

• L[y,f(x)] a loss function

then, if

( )[ ] ( )⎥⎤⎢⎡

Ω+= ∑n

fxfyLf 2* minarg λ

f* admits a representation of the form

( )[ ] ( )⎥⎦

⎢⎣

Ω+= ∑=i

fxfyLf1

,minarg λ

( )∑=n

iii xkf

* .,α (i.e. f* ∈ H )

Proof• decompose any f into the part contained in the span of the k(.,xi)

and the part in the orthogonal complement

where f0 ∈ w0 and f⊥ ∈ w⊥ with

( ) )(.,)()()(1

0 xfxkxfxfxfn

iii ⊥

=⊥ +=+= ∑α

where f0 ∈ w0 and f⊥ ∈ w⊥ with

,0,|)(.,|wffggw

xkspanffw i

∈∀==

• then 0,,| gg⊥

[ ] ( )⎞

⎜⎜⎛

+Ω=Ω ∑2

xfxkf α[ ] ( )

( ) ( )⎞

⎜⎜⎛

Ω≥⎞

⎜⎜⎛

⎠⎜⎜⎝

+Ω=Ω

∑∑

∑ ⊥=

α(Ω mon. increasing)

( ) ( )⎠

⎜⎜⎝

Ω≥⎠

⎜⎜⎝

+Ω= ∑∑=

⊥= 11

ii xkfxk αα

Proof• this shows that the second term in

( )[ ] ( )⎥⎤⎢⎡

Ω+∑n

fxfyLf 2* i λ

is minimized by a function of the stated form. th fi t i th d i t

( )[ ] ( )⎥⎦

⎢⎣

Ω+= ∑=i

fxfyLf1

,minarg λ

• the first, using the reproducing property

( ) ⊥+== iiii xkfxkfxkfxf0

0 )(.,(.),)(.,(.),)(.,(.),44 844 76

jjij xxk

is always a function of the stated form.• hence, the minimum must be of this form as well

The picturedue to reproducing property: • f(xi) ∈ Hf(xi) ∈ H

if f* is the solution in H• ||f’|| > ||f*|| H|| || || ||• is a more complex function• which does not reduce the loss

hence, f* is the optimal solution

Relevancethe remarkable consequence of the theorem is that:• we can reduce the minimization over the (infinite dimensional)we can reduce the minimization over the (infinite dimensional)

space of functions• to a minimization on a finite dimensional space!

to see this note that, becausewe have

( ),.,1

* ∑=

ii xkf α

( ) ( )∑2 ( ) ( )

xkxkfff

ijjiji

∑== .,,.,, **2*

( ) αααα Kxxkij

Tjiji∑ == ,

( ) αα Kxxkxf jiji ∑ ==)(*

( ) αα Kxxkxfj

jiji ∑ ,)(

Regularizationthis proves the following theoremTheorem:Theorem: • if Ω:[0, ∞) → ℜ is a strictly monotonically increasing function,• then for any dot-product kernel k(x,y) and loss function L[y,f(x)]y p ( ,y) [y, ( )]

the solution of

( )[ ] ( )⎤⎡Ω∑

ffLf 2* i λ( )[ ] ( )⎥⎦

⎤⎢⎣

⎡Ω+= ∑

ffxfyLf

2* ,minarg λ

⎤⎡ nn

is with

K the Gram matrix [k(x x )] and Y =( y )

[ ] ( )⎥⎦

⎤⎢⎣

⎡Ω+= ∑

T KKYL1

* ,minarg ααλααα

( )∑=

iii xkf

** .,α

K the Gram matrix [k(xi,xj)] and Y =(...,yi,...)

Notethe result that f*∈ H holds for any norm, not just L2

however the exact form of the problem will no longer behowever, the exact form of the problem will no longer be

[ ] ( )⎥⎤⎢⎡

Ω+= ∑n

T KKYL* minarg ααλαα

the argument of the second term will have a different

[ ] ( )⎥⎦

⎢⎣

Ω+= ∑=i

,minarg ααλααα

the argument of the second term will have a different form

Projectsat this time you should have a reasonable idea of what you are going to doy g gfour questions• what are you going to do?• why should you do it?• how are you going to do it?

h t lt d t t t?• what results do you expect to get?

next thursday: • no class 10 12 minute meetings to discuss the four questions• no class, 10-12 minute meetings to discuss the four questions• send me email saying if you are available at 3:00• be prepared, no time to spare

Pointersbesides the books in syllabus• Proceedings Neural Information SystemsProceedings Neural Information Systems• Proceedings International Conference on Machine Learning• Neural Computation• IEEE Transactions on Pattern Analysis and Machine Inteligence• Journal of Machine Learning Research

IEEE C f C t Vi i d P tt R iti• IEEE Conference on Computer Vision and Pattern Recognition• International Conference on Computer Vision• IEEE Int. Conference Acoustics, Speech, Signal Processingt Co e e ce coust cs, Speec , S g a ocess g• IEEE International Conference Image Processing• International Journal Computer Vision

Also the INSPEC database and Google

Project topics (mostly undergrad)• classification of graphics vs natural images on google

i li i d t if ld• visualizing data manifolds

• comparison of LDA-based face recognition methods

Reproducing kernel Hilbert spaces and regularizationspaces ... · Reproducing kernel Hilbert spaces...

Documents

Distance Functions for Reproducing Kernel Hilbert Spacesarcozzi/distances_ARSW.pdf · Distance Functions for Reproducing Kernel Hilbert Spaces N. Arcozzi, R. Rochberg, E. Sawyer,

Reproducing kernel Hilbert spaces - SVCL · reproducing kernel Hilbert spaces to turn our pre-Hilbert space H into a Hilbert space we have to “complete it” with respect to the

Generalisations of Pick’s Theorem to Reproducing Kernel Hilbert Spacesnicholas/abstracts/1994quigginphd.pdf · Generalisations of Pick’s Theorem to Reproducing Kernel Hilbert

Reproducing Kernel Hilbert Spaces and Hardy Spaces€¦ · Reproducing Kernel Hilbert Spaces and Hardy Spaces Reproducing Kernel Hilbert Spaces and Hardy Spaces Evan Camrud Iowa State

Data Visualization via Kernel Machines - GitHub Pages · the theory of reproducing kernels and reproducing kernel Hilbert spaces and Berlinet and Thomas-Agnan (2004) for their usage

Optimization in Reproducing kernel Hilbert Spaces of Spike ...arpaiva/pubs/2010a.pdf · Optimization in Reproducing kernel Hilbert Spaces of Spike Trains Ant onio R. C. Paivay, Il

Robustness and Reproducing Kernel Hilbert Spaces Grace ...pages.stat.wisc.edu/~wahba/talks1/prague10/wahba.prague...Penalized likelihood regression in reproducing kernel Hilbert spaces

Reproducing-Kernel Hilbert Space Regression with Notes on ...Reproducing-Kernel Hilbert Space Regression with Notes on the Wasserstein Distance Stephen Page Submitted for the degree

Paulsen_An Introduction to the Theory of Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space approach to functional ...my2550/papers/flr.final.pdf · A REPRODUCING KERNEL HILBERT SPACE APPROACH TO FUNCTIONAL LINEAR REGRESSION BY MING YUAN1

REPRODUCING KERNEL HILBERT SPACES - Bilkent … · of reproducing kernel Hilbert spaces, generation of new spaces and relationships ... The Hilbert space with reproducing kernel K

Distance Functions for Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert spaces Dr. M. Asaduzzaman Professor Department of Mathematics

NONLINEAR SIGNAL PROCESSING BASED ON REPRODUCING KERNEL HILBERT SPACE By

A reproducing kernel Hilbert space approach to functional

Discrete Reproducing Kernel Hilbert Spaces: Sampling and

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space ...fukumizu/H20_kernel/Kernel_2_elemen… · · 2008-05-02Elements of Positive Deﬁnite Kernel and Reproducing

A Unifying Framework in Vector-valued Reproducing Kernel Hilbert

Elements of Positive Definite Kernel and Reproducing ...fukumizu/H20_kernel/Kernel_2_elements.pdf · Elements of Positive Deﬁnite Kernel and Reproducing Kernel Hilbert Space Statistical