View
242
Download
0
Category
Preview:
Citation preview
Reproducing kernel Hilbert spaces and regularizationspaces and regularization
Nuno Vasconcelos ECE Department, UCSDp ,
Classificationa classification problem has two types of variables
• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world
Perceptron: classifier implements the linear decision rulep p
with [ ]g(x)sgn )( =xh bxwxg T +=)( w
appropriate when the classes arelinearly separable
b
x
xg )(
to deal with non-linear separabilitywe introduce a kernel
wb w
2
Types of kernelsthese three are equivalent
dot product kernel
positive definite kernel
⇔
positive definite kernel
⇔Mercer kernel
3
Dot-product kernelsDefinition: a mapping
k: X x X → ℜk: X x X → ℜ(x,y) → k(x,y)
is a dot-product kernel if and only ifp y
k(x,y) = <Φ(x),Φ(y)>where Φ: X → H,where Φ: X → H,
H is a vector space, <.,.> is a dot-product in H
xx
x
x
xx
xxx
xoo
x2 xx
x
x
x
xxx
xxx
x
oo
o oooΦ
X H
4
xxx
x oo
oooo
ooo o
x1
oo
ooo oo
o oox1
x3x2
xn
Φ
positive definite and Mercer kernelsDefinition: k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ x1, ..., xl, xi∈ X, the Gram matrix 1 l i
is positive definite.[ ] ),( jiij xxkK =
is positive definite.
Definition: a symmetric mapping k: X x X → R such that
<∀
≥
∫∫∫
dxxfxf
dxdyyfxfyxk2)()(
,0)()(),(
s t
is a Mercer kernel∞<∀ ∫ dxxfxf 2)()( s.t.
5
Two different pictures
different definitions lead to different interpretations of what the kernel does
Mercer kernel mapReproducing kernel map
what the kernel does
HM=HK = ( )⎭⎬⎫
⎩⎨⎧
= ∑=
m
iii xkff
1.,(.)|(.) α
m m '
⎭⎬⎫
⎩⎨⎧
∞<= ∑=
d
ii
d xxl1
22 |
( )∑∑= =
=m
iji
m
jji xxkgf
1 1*
',, βα
)(: xkx →Φ
gfgf T=*
,
( )Txxx L)()(: φλφλ→Φ)(.,: xkx →Φ ( )xxx ),(),(: 2211 φλφλ→Φ
where λi,φi are the eigenvalues and eigenfunctions of k(x,y)
6
λi > 0
The dot-product picturewhen we use the Gaussian kernel• the point xi ∈ X is mapped into the Gaussian G(x xi σI)
σ
2
),(ixx
i exxK−
−=
the point xi ∈ X is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of
Gaussiansth k l i d t d t i d li• the kernel is a dot product in H, and a non-linearsimilarity on X
• reproducing property on H: analogy to linear systems
xx
x
x
xx
x
xx2
X HK*
x
xx
x
oo
o
o
o
o oo
o oo
ox1
xx
x
x
x
xx
x
xx
x
x
o
o
ooo
o Φ
*
7
o o
x3x2
xn
x oo
o
ooo
x1*
The Mercer picturethis is a mapping from Rn to Rd
xxX l2dΦ(xi)
xx
xx
xx
x
xx
x
x
o oxx
x
xx
x
x
x2
Xxi
oo
oo
o oo
o oox1
x2
xd
xx
x
xx
xo
oo
o
o
ooo
ooo
o
x
Φ
much more like to a multi-layer Perceptron
x3x2x1
y pthan beforethe kernelized Perceptronas a ne ral net
Φ1(.) Φ2(.) Φd(.)
8
as a neural net x
The reproducing propertywith this definition of HK and <.,.>*
∀ f ∈ HK, <k(.,x),f(.)>* = f(x)
this is called the reproducing propertyan analogy is to think of linear time-invariant systems• the dot product as a convolution• k(.,x) as the Dirac delta• f(.) as a system input• the equation above is the basis of all linear time invariant systemsthe equation above is the basis of all linear time invariant systems
theory
leads to reproducing Kernel Hilbert Spaces
9
reproducing kernel Hilbert spacesDefinition: a Hilbert space is a complete dot-product space (vector space + dot product + limit points of all p ( p p pCauchy sequences)
Definition: Let H be a Hilbert space of functionsDefinition: Let H be a Hilbert space of functions f: X → R. H is a RKHS endowed with dot-product <.,.>* if ∃ k: X x X → R such that
1.k spans H, i.e., ∃ xi, αi, such that
H = ⎭⎬⎫
⎩⎨⎧
== ∑i
ii xkffxkspan )(.,(.)|(.))(., α
2.<f(.),k(.,x)>* = f(x), ∀ f ∈ H,
⎭⎩ i
10
Mercer kernelshow different are the spaces HK and HM?Theorem: Let k: X x X → R be a Mercer kernel ThenTheorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions
dxxx δφφ =∫ )()(and a set of λi ≥ 0, such that
∫∫∑∞
ijji dxxx δφφ =∫ )()(
∑
∫∫∑∞
=
= 2
1
2 ),()i
i dxdyyxki λ
∑=
=1
)()(),()i
iii yxyxkii φφλ (**)
11
RK vs Mercer mapsnote that for HM we are writing
exexxrr
)()()( φλφλΦ
but, since the φi(.) are orthonormal, there is a 1-1 mapdexexx L )()()( 11111 φλφλ ++=Φ
(.)(.): 2
kkk
kd
espanl
φλφ
→→Γ
r
and we can write
( ) ( ))(( ))()( xxx ++=ΦΓ φφλφφλo (***)( ))(.,
(.))((.))()( 111
xkxxx ddd
=++=ΦΓ φφλφφλ Lo
from (**)
( )
12
hence k(.,x) maps x into M = spanφk(.)
The Mercer picturewe have
l2dΦ(xi)
x
x
xx
x
x2
Φ
Xx
xx
x
x
xx
x
xx
x
x
o o
l2
xi
xx
xx
xx
x
xx x
oo
o
o
o
ooo
ooo
o
xo
oo
o
o oo
o ooe1
ed ΤοΦ(x )=k( x )oox1 e3
e2
xx
x
x
x
x
xx
x
x MΓ
ΤοΦ(xi)=k(.,xi)
xxo
oo
o
o
o oo
o oo
oφ1
φd
Γ
13
φ3φ2
φd
RK vs Mercer mapsdefine the dot product in M so that
11∫
then φk(.) is a basis, M is a vector space, any function in
ijj
kjkj
kj dxxx δλ
φφλλ
φφ 1)()(1,** ∫ ==
φk( ) p yM can be written as
( )xxfk
kk∑= φα)(
andk
)((.),(.))(.,(.),**
**xxkf
jjjj
iii= ∑∑ φφλφα
)()(,)( 1
**
**
xfxxi
iiij
jijji
ji
ij
=== ∑∑ φαφφφλα
δ
43421
14
i.e., k is a reproducing kernel on Mij
jδ
λ
RK vs Mercer mapsfurthermore, since k(.,x) ∈ M, any functions of the form
( ) ( )
are in M and
( ) ( ) jj
jii
i xkgxkf .,(.).,(.) ∑∑ == βα
)(.,,)(.,,**
**j
jji
ii xkxkgf ∑∑= βα
( )( ))()(
)((.)),((.)**ij
jmmm
milll
lji
xx
xx
∑ ∑
∑ ∑∑=
=
φφφφλλβα
φφλφφλβα
),()()(
(.)(.),)()(**
jij
ijiij
jlill
lji
ijmljmil
lmmlji
xxkxx
xx
∑∑ ∑
∑ ∑==
=
βαφφλβα
φφφφλλβα
15
note that f,g ∈ HK and this is the dot product we had in HK
ijij l
The Mercer picturefurthermore, note that for f in HM
( )xxf ∑ φα)(
and since
( )xxfk
kk∑= φα)(
ijj
kjkj
kj dxxx δλ
φφλλ
φφ 1)()(1,** ∫ ==
the dot product on HM is
∑=gf φφβα
∑
∑
=
=
kk
kllklkgf
λβα
φφβα
****,,
16
∑k kλ
The Mercer picturewe have
l2dΦ(xi)
x
x
xx
x
x2
Φ
Xx
xx
x
x
xx
x
xx
x
x
o o
l2
xi
xx
xx
xx
x
xx x
oo
o
o
o
ooo
ooo
o
xo
oo
o
o oo
o ooe1
ed ΤοΦ(x )=k( x )oox1 e3
e2
xx
x
x
x
x
xx
x
x M ⊂ HK
Γ
ΤοΦ(xi)=k(.,xi)
*),(,
** jij
iji xxkgf ∑= βαxx
oo
o
o
o
o oo
o oo
oφ1
φd
Γij
17
φ3φ2
φd
∑=k k
kkgfλ
βα **
,
RK vs Mercer mapsHK ⊂ M and <.,.>* in HK is the same as <.,.>** in MQuestion: is M ⊂ H ?Question: is M ⊂ HK ?• need to show that any HK
• from (***), ( ) ∈= ∑ xxf
kkkφα)(
( ),
and, for any sequence x1, ..., xd,
(.))((.))()(., 111 ddd xxxk φφλφφλ ++= L
⎥⎥⎤
⎢⎢⎡
⎥⎥⎤
⎢⎢⎡
=⎥⎥⎤
⎢⎢⎡ (.))()()(., 111111 dd xxxk φφλφλ
MLMM
⎥⎥⎦⎢
⎢⎣⎥
⎥⎦⎢
⎢⎣⎥
⎥⎦⎢
⎢⎣ (.))()()(., 11 d
P
ddddd xxxk φφλφλ44444 344444 21
18
• if there is an invertible P, then HK and M ⊂ HK( ) ∈= ∑ kk
kk .,xkx αφ )(
RK vs Mercer mapssince λi > 0
⎤⎡⎤⎡ d xx λφφ 0)()(
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
d
i
ddd
d
xx
xxP
λλ
λ
φφ
φφ
0
0
)()(
)()( 1
1
111
MLM
is invertible when Π is. If Π is not invertible, then
⎦⎣⎦⎣Π
dddd φφ )()(1 4444 34444 21
( )∑if there is no sequence for which Π is invertible then
( ) 00 =≠∃∀ ∑ ik
kk xi φαα s.t.
q
th φ ( ) t b th l H th t b
( ) 00 =≠∃∀ ∑ xxk
kkφαα s.t.
19
the φk(x) cannot be orthonormal. Hence there must be invertible P and M ⊂ HK.
The Mercer picturewe have
HM=l2dΦ(xi)
x
x
xx
x
x2
ΦΜ
Xx
xx
x
x
xx
x
xx
x
x
o o
HM l2
xi
xx
xx
xx
x
xx x
oo
o
o
o
ooo
ooo
o
Μ xo
oo
o
o oo
o ooe1
ed ΤοΦ(x )=k( x )oox1 e3
e2
xx
x
x
x
x
xx
x
x M=HK
Γ
ΤοΦ(xi)=k(.,xi)
*),(,
** jij
iji xxkgf ∑= βαxx
oo
o
o
o
o oo
o oo
oφ1
φd
Γij
20
φ3φ2
φd
∑=k k
kkgfλ
βα **
,
In summaryHK = M and <.,.>* in HK is the same as <.,.>** in M
the reproducing kernel map and the Mercer kernel map lead to the same RKHS Mercer gives us an o n basislead to the same RKHS, Mercer gives us an o.n. basis
Reproducing kernel map
H ( )⎬⎫⎨⎧ ∑
m
kff ( )|( )
Mercer kernel map
H ⎬⎫
⎨⎧ ∑
ddl 2|HK = ( )
⎭⎬⎫
⎩⎨⎧
= ∑=i
ii xkff1
.,(.)|(.) α
( )∑∑=m
iji
m
jji xxkgf
1
'
1*
',, βα
HM=⎭⎬⎫
⎩⎨⎧
∞<= ∑=i
id xxl
1
22 |
gfgf T=*
,= =i j1 1
)(.,: xkxr →Φ ( )TM xxx L),(),(: 2211 φλφλ→Φ
thi t t t b t f RKHS bt i d f k( )
(.)(.): 2
kkk
kd
espanl
φλφ
→→Γ
rrM Φ=ΦΓ o
21
this turns out to be true for any RKHS obtained from k(x,y)(1-1 relationship)
RegularizationQ: RKHS: why do we care?A: regularizationA: regularization• we want to do well outside of the
training set• minimizing empirical risk is not
enough• need to penalize complexityp p y• example: regression
• give me n points, I will give you a model of zero error (polynomialmodel of zero error (polynomial of order n-1)
• incredibly “wiggly”• poor generalization
22
• poor generalization
Regularizationwe need to “regularize” the risk
[ ] [ ] [ ]ffRfR Ω+ λ
• f is the function we are trying to learn
[ ] [ ] [ ]ffRfR empreg Ω+= λ
• λ > 0 is the regularization parameter• Ω is a complexity penalizing functional
• larger when f is more “wiggly”larger when f is more wiggly
the regularized risk can be justified in various wayst i d i i i ti• constrained minimization
• Bayesian inference• structural risk minimization
23
structural risk minimization• ...
Constrained minimizationregularized risk as the solution to the problem
[ ] [ ]
ill t lk l t b t thi
[ ] [ ] tffRf empf
≤Ω= subject to minarg*
we will talk a lot more about this• solution is the solution of the Lagrangian problem
[ ] [ ] fff λi*
• λ is a Lagrange multiplier
[ ] [ ] ffRf empf
Ω+= λminarg*
• changing λ is equivalent to changing t (the constraint on the complexity of the optimal solution)
24
Bayesian inferencein Bayesian inference we have• likelihood function PX| (x|θ)• likelihood function PX|θ(x|θ)• prior Pθ(θ)• and search for the maximum a posteriori (MAP) estimate
)()|(
)|(maxarg* |
θθ
θθ θθ
PxP
xP X=
)()|(maxarg)(
)()|(maxarg
|
|
θθ
θθ
θθ
θθ
θ
PxPxPPxP
X
X
X
=
=
[ ] )(log)|(logmaxarg
)()|(maxarg
|
|
θθ
θθ
θθθ
θθθ
PxP
PxP
X
X
+=
25
Bayesian inferenceif PX|θ(x|f) = e -Remp[f] and Pθ(f) = e -λΩ[f]
• the MAP estimate is the minimum of the regularized riskthe MAP estimate is the minimum of the regularized risk
[ ] [ ] ffRf empf
Ω+= λminarg*
• the prior Pθ(f) assigns low probability to function f with large Ω[f]• this reflects our prior belief that it is “unlikely” that the solution will
be very complexbe very complex
example:• if Ω[f] = || f-f0 ||2, the prior is a Gaussian centered at f0 withif Ω[f] || f f0 || , the prior is a Gaussian centered at f0 with
variance 1/λ1/2
• we believe that the most likely solution is f = f0th l th λ th li l ti hi h
26
• the larger the λ, the more we penalize solutions which are different from this
Structural risk minimization• start from a nested collection of families of functions
kSS ⊂⊂L1
where Si = hi(x,α), for all α• for each Si, find the function (set of parameters) that minimizes
the empirical risk
• select the function class such that∑
=
=n
kkik
iemp xhyL
nR
1
)],(,[1min αα
( ) iiempi
hRR Φ+= min*
• Φ(h) is a function of VC dimension (complexity) of the family Si
• VC have shown that this is equivalent to minimizing a bound on the risk, and provides generalization guarantees
27
, p g g• regularization with the “right” regularizer!
The regularizerwhat is a good regularizer?intuition: “wigglier” functions have a larger norm thanintuition: wigglier functions have a larger norm than smoother functionsfor example, in HK we havep K
∑=i
ii xxkxf ),()( α
∑ ∑⎤⎡
==i
ill
lli xx )()( φφλα
∑
∑ ∑=
⎥⎦
⎤⎢⎣
⎡=
ll
ll
iilil
xc
xx
)(
)()(
φ
φφαλ
28
∑=l
ll xc )( φ
The regularizerand
∑∑∑ cδ 22
with
∑∑∑ ===l l
l
lk l
lkkl
lkklkl
cccxxccxfλλ
δφφ**
2 )(),()(
( )∑= ilill xc φαλwith hence, • || f ||2 grows with the number of cl different than zero
( )∑i
ilill xc φαλ
|| || g l
• this is the case in which f is more complex, since it becomes a sum of more basis function φl(x)
• identical to what happens in Fourier type decompositions• identical to what happens in Fourier type decompositions• more coefficients means more “high-frequencies” or “less
smoothness”
29
RegularizationOK, regularization is a good idea from multiple points of viewproblem: minimizing the regularized risk
[ ] [ ] [ ]fff Ωλ
over the set of all functions seems like a nightmare
[ ] [ ] [ ]ffRfR empreg Ω+= λ
it turns out that it is not, under some reasonable conditions on the regularizer Ωthe key is the “representer theorem”
30
Representer theoremTheorem: Let • Ω:[0 ∞) → ℜ be a strictly monotonically increasing function• Ω:[0, ∞) → ℜ be a strictly monotonically increasing function,• H the RKHS associated with a kernel k(x,y)
• L[y,f(x)] a loss function
then, if
( )[ ] ( )⎥⎤⎢⎡
Ω+= ∑n
fxfyLf 2* minarg λ
f* admits a representation of the form
( )[ ] ( )⎥⎦
⎢⎣
Ω+= ∑=i
iif
fxfyLf1
,minarg λ
p
( )∑=n
iii xkf
1
* .,α (i.e. f* ∈ H )
31
=i 1
Proof• decompose any f into the part contained in the span of the k(.,xi)
and the part in the orthogonal complement
where f0 ∈ w0 and f⊥ ∈ w⊥ with
( ) )(.,)()()(1
0 xfxkxfxfxfn
iii ⊥
=⊥ +=+= ∑α
where f0 ∈ w0 and f⊥ ∈ w⊥ with
0
0
,0,|)(.,|wffggw
xkspanffw i
∈∀==
∈=
⊥
• then 0,,| gg⊥
[ ] ( )⎞
⎜⎜⎛
+Ω=Ω ∑2
2 )(n
xfxkf α[ ] ( )
( ) ( )⎞
⎜⎜⎛
Ω≥⎞
⎜⎜⎛
+Ω=
⎠⎜⎜⎝
+Ω=Ω
∑∑
∑ ⊥=
22
2
1)(.,
nn
iii
xkfxk
xfxkf
αα
α(Ω mon. increasing)
32
( ) ( )⎠
⎜⎜⎝
Ω≥⎠
⎜⎜⎝
+Ω= ∑∑=
⊥= 11
.,.,i
iii
ii xkfxk αα
Proof• this shows that the second term in
( )[ ] ( )⎥⎤⎢⎡
Ω+∑n
fxfyLf 2* i λ
is minimized by a function of the stated form. th fi t i th d i t
( )[ ] ( )⎥⎦
⎢⎣
Ω+= ∑=i
iif
fxfyLf1
,minarg λ
• the first, using the reproducing property
( ) ⊥+== iiii xkfxkfxkfxf0
0 )(.,(.),)(.,(.),)(.,(.),44 844 76
∑=
=n
jjij xxk
1
),(α
is always a function of the stated form.• hence, the minimum must be of this form as well
33
The picturedue to reproducing property: • f(xi) ∈ Hf(xi) ∈ H
if f* is the solution in H• ||f’|| > ||f*|| H|| || || ||• is a more complex function• which does not reduce the loss
f(xi)
hence, f* is the optimal solution
e1 f*
f’
e3
ed
f
34
Relevancethe remarkable consequence of the theorem is that:• we can reduce the minimization over the (infinite dimensional)we can reduce the minimization over the (infinite dimensional)
space of functions• to a minimization on a finite dimensional space!
( )n
to see this note that, becausewe have
( ),.,1
* ∑=
=i
ii xkf α
( ) ( )∑2 ( ) ( )
( )
αα
Kxxk
xkxkfff
T
ijjiji
∑
∑== .,,.,, **2*
( ) αααα Kxxkij
Tjiji∑ == ,
( ) αα Kxxkxf jiji ∑ ==)(*
35
( ) αα Kxxkxfj
jiji ∑ ,)(
Regularizationthis proves the following theoremTheorem:Theorem: • if Ω:[0, ∞) → ℜ is a strictly monotonically increasing function,• then for any dot-product kernel k(x,y) and loss function L[y,f(x)]y p ( ,y) [y, ( )]
the solution of
( )[ ] ( )⎤⎡Ω∑
n
ffLf 2* i λ( )[ ] ( )⎥⎦
⎤⎢⎣
⎡Ω+= ∑
=iii
ffxfyLf
1
2* ,minarg λ
⎤⎡ nn
is with
K the Gram matrix [k(x x )] and Y =( y )
[ ] ( )⎥⎦
⎤⎢⎣
⎡Ω+= ∑
=
n
i
T KKYL1
* ,minarg ααλααα
( )∑=
=n
iii xkf
1
** .,α
36
K the Gram matrix [k(xi,xj)] and Y =(...,yi,...)
Notethe result that f*∈ H holds for any norm, not just L2
however the exact form of the problem will no longer behowever, the exact form of the problem will no longer be
[ ] ( )⎥⎤⎢⎡
Ω+= ∑n
T KKYL* minarg ααλαα
the argument of the second term will have a different
[ ] ( )⎥⎦
⎢⎣
Ω+= ∑=i
KKYL1
,minarg ααλααα
the argument of the second term will have a different form
37
38
Projectsat this time you should have a reasonable idea of what you are going to doy g gfour questions• what are you going to do?• why should you do it?• how are you going to do it?
h t lt d t t t?• what results do you expect to get?
next thursday: • no class 10 12 minute meetings to discuss the four questions• no class, 10-12 minute meetings to discuss the four questions• send me email saying if you are available at 3:00• be prepared, no time to spare
39
Pointersbesides the books in syllabus• Proceedings Neural Information SystemsProceedings Neural Information Systems• Proceedings International Conference on Machine Learning• Neural Computation• IEEE Transactions on Pattern Analysis and Machine Inteligence• Journal of Machine Learning Research
IEEE C f C t Vi i d P tt R iti• IEEE Conference on Computer Vision and Pattern Recognition• International Conference on Computer Vision• IEEE Int. Conference Acoustics, Speech, Signal Processingt Co e e ce coust cs, Speec , S g a ocess g• IEEE International Conference Image Processing• International Journal Computer Vision
40
Also the INSPEC database and Google
Project topics (mostly undergrad)• classification of graphics vs natural images on google
i li i d t if ld• visualizing data manifolds
• comparison of LDA-based face recognition methods
41
Recommended