Upload
others
View
14
Download
1
Embed Size (px)
Citation preview
Reproducing kernel Hilbert spaces
Nuno Vasconcelos ECE Department, UCSD
2
Classificationa classification problem has two types of variables
• X - vector of observations (features) in the world• Y - state (class) of the world
Perceptron: classifier implements the linear decision rule
with
appropriate when the classes arelinearly separableto deal with non-linear separabilitywe introduce a kernel
[ ]g(x)sgn )( =xh bxwxg T +=)( w
wb
x
wxg )(
3
Kernel summary1. D not linearly separable in X, apply feature transformation Φ:X → Z,
such that dim(Z) >> dim(X)
2. computing Φ(x) too expensive:• write your learning algorithm in dot-product form• instead of Φ(xi), we only need Φ(xi)TΦ(xj) ∀ij
3. instead of computing Φ(xi)TΦ(xj) ∀ij, define the “dot-product kernel”
and compute K(xi,xj) ∀ij directly• note: the matrix
is called the “kernel” or Gram matrix
4. forget about Φ(x) and use K(x,z) from the start!
)()(),( zxzxK T ΦΦ=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
M
LL
M
),( ji xxKK
4
Polynomial kernelsthis makes a significant difference when K(x,z) is easier to compute that Φ(x)TΦ(z)e.g., we have seen that
while K(x,z) has complexity O(d), Φ(x)TΦ(z) is O(d2)for K(x,z) = (xTz)k we go from O(d) to O(dk)
( )
( )Tddddd
d
TT
xxxxxxxxxxxxx
x
zxzxzxK
,,,,,,,,
)()(),(
2112111
1
2
LLLM →⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
ℜ→ℜΦ
ΦΦ==
: with2dd
5
Questionin practice:• pick a kernel from a library of known kernels• we talked about
• the linear kernel K(x,z) = xTz• the Gaussian family
• the polynomial family
what if this is not good enough?• how do I know if a function k(x,y) is a dot-product kernel?
σ
2
),(zx
ezxK−
−=
( ) { }L,2,1,1),( ∈+= kzxzxK kT
6
Dot-product kernelslet’s start by the definitionDefinition: a mapping
k: X x X → ℜ(x,y) → k(x,y)
is a dot-product kernel if and only if
k(x,y) = <Φ(x),Φ(y)>where Φ: X → H,
H is a vector space, <.,.> is a dot-product in H
note that both H and <.,.> can be abstract, not necessarily ℜd
xx
x
x
x
xxx
xxx
x
oo
o
o
ooo
ooo
oo
x1
x2
xx
x
x
x
xxx
xxx
x
oo
o
o
oo oo
o ooox1
x3x2
xn
Φ
X
H
7
Dot product vs positive definite kernelsthat is pretty abstract. how do I turn it into something I can compute?Theorem: k(x,y), x,y∈ X, is a dot-product kernel if and only if it is a positive definite kernelthis can be checkedlet’s start by the definitionDefinition: k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ {x1, ..., xl}, xi∈ X, the Gram matrix
is positive definite.
[ ] ),( jiij xxkK =
8
Positive definite matricesrecall that (e.g. Linear Algebra and Applications, Strang)
Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be positive definite:
i) xTAx ≥ 0, ∀ x ≠ 0 ii) all eigenvalues of A satisfy λi ≥ 0iii) all upper-left submatrices Ak have non-negative determinantiv) there is a matrix R with independent rows such that
A = RTR
upper left submatrices:
L
3,32,31,3
3,22,21,2
3,12,11,1
32,21,2
2,11,121,11
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=⎥
⎦
⎤⎢⎣
⎡==
aaaaaaaaa
Aaaaa
AaA
9
Dot product vs positive definite kernelsequivalence between dot product and positive definite automatically proves some simple properties• dot-product kernels are non-negative functions
• dot-product kernels are symmetric
• Cauchy-Schwarz inequality for dot-product kernels
• note that this is just
Xxxxk ∈∀≥ 0),(
Xyxxykyxk ∈∀= ,),,(),(
Xyxyykxxkyxk ∈∀≤ ,),,(),(),( 2
1,
11),(),(
),(1
cos
≤≤⇔≤≤−yyx
-yykxxk
yxkx
444 3444 21
θ
10
Dot product vs positive definite kernelsthe proof actually gives insight on what the kernel does• we defined H as the space spanned by linear combinations of
k(.,xi)
H =
• e.g. the the space of all linear combinations of Gaussians when the we adopt the Gaussian kernel
• note: this is a function of thefirst argument, xi is fixed
( )⎭⎬⎫
⎩⎨⎧
∈∀∀= ∑=
m
iiii Xxmxkff
1,,.,(.)|(.) α
2
2.
)(., σix
i exK−
−=
11
Dot product vs positive definite kernels• if f(.) and g(.) ∈ H, with
• we defined the dot-product <.,.>* on H as
• for the Gaussian kernel this is
• a dot product in H• a non-linear measure of similarity in X, somewhat related to
likelihoods
( ) ( )∑∑==
=='
11
'.,(.).,(.)m
jjj
m
iii xkgxkf βα
( )∑∑= =
=m
iji
m
jji xxkgf
1
'
1*
',, βα
∑∑=
−−
=
=m
i
xxm
jji
ji
egf1
''
1*
2
2
, σβα
12
The dot-product <.,.>*
it is not hard to show thatthis means that
with
the feature transformation associated with the kernel k(x,y) sends the points xi to the functions k(.,xi)furthermore, <.,.>* is itself a dot-product kernel on H x H
( )jiji xxkxkxk ,)(.,),(.,*
=
( )*
)(),(, jiji xxxxk ΦΦ=
Φ: X → Hx → k(.,x)
13
The reproducing propertywith this definition of H and <.,.>*
∀ f ∈ H, <k(.,x),f(.)>* = f(x)
this is called the reproducing propertyan analogy is to think of linear time-invariant systems• the dot product as a convolution• k(.,x) as the Dirac delta• f(.) as a system input• the equation above is the basis of all linear time invariant systems
theory
we will see that it also plays a fundamental role in the theory (and use) of reproducing Kernel Hilbert Spaces
14
The big picturewhen we use the Gaussian kernel• the point xi ∈ X is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of
Gaussians• the kernel is a dot product in H, and a non-linear
similarity on X
• reproducing property on H: analogy to linear systems
σ
2
),(ixx
i exxK−
−=
xx
x
x
x
xx
x
xx
x
x
oo
o
o
o
o oo
o oo
ox1
x3x2
xn
xx
x
x
x
xx
x
xx
x
x
oo
o
o
o
ooo
ooo
o
x1
x2
Φ
X H*
*
*
15
Hilbert spacesplay an important role in functional analysiswe will do very informal reviewH is a vector space with a dot product:• this is known as a “dot-product space” or a “pre-Hilbert” space
the difference to a Hilbert space is mostly technicalDefinition: a Hilbert space is a complete dot-product spacewhat do we mean by completeness? Definition: S is complete if all Cauchy sequences in Sconvergethis is useful mostly to prove convergence results
16
Cauchy sequencesjust for completeness (no pun intended)Definition: a sequence {x1,x2,...} in a normed space is a Cauchy sequence if
you can picture this as
why might this not converge? well, the limit point could be outside the spacewhy is this important? complete ⇒ convergent = Cauchy
εε ≤−>∀∈∃≥∀ "',",',0 nn xxnnnNn s.t.
ε
n
17
reproducing kernel Hilbert spacesto turn our pre-Hilbert space H into a Hilbert space we have to “complete it” with respect to the norm
this is just adding to H the limit points of all its Cauchy sequenceswe represent the completion of H by Hmore than an Hilbert space, H becomes a reproducing kernel Hilbert space (RKHS)
**, fff =
18
reproducing kernel Hilbert spacesDefinition: Let H be a Hilbert space of functions f: X → R. H is a RKHS endowed with dot-product <.,.>* if ∃ k: X x X → R such that
1.k spans H, i.e., ∃ {xi}, {αi}, such that
H =
2.<f(.),k(.,x)>* = f(x), ∀ f ∈ H,
in summary, • the H we built can easily be transformed into a RKHS by
adding to it the limit points (functions) of all Cauchy seqs
Question: is there a one-to-one mapping between kernels and RKHSs?
{ }⎭⎬⎫
⎩⎨⎧
== ∑i
ii xkffxkspan )(.,(.)|(.))(., α
19
kernels vs RKHSsanswer: RKHS does indeed specify kernel uniquelyproof:• assume that kernels k1 and k2 span H
• using reproducing property• k1(x’,x) = <k1(.,x), k2(.,x’)>*
• k2(x,x’) = <k2(.,x’),k1(.,x)>*
• hence, from symmetry of dot product <.,.>* we must have
k1(x’,x) = k2(x,x’)
and from symmetry of kernel it follows that
k1(x,x’) = k2(x,x’)
• since this holds for all x and x’, the kernels are the same
20
kernels vs RKHSshowever, it is not clear that a kernel uniquely specifies the RKHS• there might be multiple feature transforms and dot-products that
are consistent with a kernel
to study this we need to introduce Mercer kernelsDefinition: a symmetric mapping k: X x X → R such that
is a Mercer kernel∞<∀
≥
∫∫∫
dxxfxf
dxdyyfxfyxk2)()(
,0)()(),(
s.t.
(*)
21
Mercer kernelswhy do we care about them? Two reasons1) Theorem: a kernel is positive definite (and dot-product)
if and only if it is a Mercer kernelproof that Mercer implies PD:• consider any sequence {x1,...,xn}, xi ∈ X. Let
• since , if k(x,y) is Mercer, it follows from (*) that
∫∫ ∑∫∫
−−=
≤
dxdyxxxxwwyxk
dxdyyfxfyxk
jiij
ji )()(),(
)()(),(0
δδ
( ) ∞<−= ∑∑==
n
ii
n
iii wtsxxwxf
1
2
1.. ,)( δ
)( 2 ∞<∫ dxxf
22
Mercer kernels
Kwwxxkww
dxdyxxxxyxkww
Tji
ijji
jiij
ji
==
−−≤
∑
∫∫∑),(
)()(),(0 δδ
• since this holds for any w, k(x,y) is PD
23
Mercer kernels
proof that PD implies Mercer:• suppose there is a g(x) s.t.
• consider a partition of X xX fine enough that
• hence
with u = (g(x1), ..., g(xn))T
• K is not PD and k(x,y) not a PD kernel• the Theorem follows by contradiction
∫∫ <= 0)()(),( εdxdyygxgyxk
2)()(),()()(),( ε
≤−∆∆ ∫∫∑ dxdyygxgyxkxgxgxxk yxij
jiji
0 0)()(),( <⇔<∆∆∑ Kuuxgxgxxk Tyx
ijjiji
24
Mercer kernels2) they have the following property (without proof)Theorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions
and a set of λi ≥ 0, such that
intuition: think of this as a 2D Fourier series (+Parseval)
∑
∫∫∑∞
=
∞
=
=
=
1
2
1
2
)()(),()
),()
iiii
ii
yxyxkii
dxdyyxki
φφλ
λ
ijji dxxx δφφ =∫ )()(
(**)
25
Eigenfunctionsnote: if we define the operator
then
the functions φi(x) are the eigenfunctions of (Tf)(x)again, a connection to linear systems (LTI if k(x,y) = h(x-y))
( ) ∫= dyyfyxkxTf )(),()(
( )
)()()()(
)()()(
)(),()(
xdyyyx
dyyyx
dyyyxkxT
iik
ikkk
kikkk
ii
ij
φλφφφλ
φφφλ
φφ
δ
==
=
=
∑ ∫
∑ ∫∫
44 344 21
26
Mercer kernelsthe eigenfunction decomposition gives us another way to design the feature transformation
where l2d is the space of vectors s.t. and d the number of non-zero eigenvalues λi
clearly
i.e. there is a vector space (l2d) other than H s.t. k(x,y) is a dot product in that space
)()(),( yxyxk T ΦΦ=
∞<∑i ia 2
( )Td
xxxlX
L),(),( :
2211
2
φλφλ→
→Φ
27
The picturethis is a mapping from Rn to Rd
much more like to a multi-layer Perceptronthan beforethe kernelized Perceptronas a neural net
xx
x
x
x
xx
x
xx
x
x
oo
o
o
o
o oo
o oo
ox1
x3x2
xd
xx
x
x
x
xx
x
xx
x
x
oo
o
o
o
ooo
ooo
o
x1
x2
Φ
X l2dΦ(xi)
xi
x
Φ1(.) Φ2(.) Φd(.)
28
In summaryReproducing kernel map
HK = ( )⎭⎬⎫
⎩⎨⎧
= ∑=
m
iii xkff
1
.,(.)|(.) α
( )∑∑= =
=m
iji
m
jji xxkgf
1
'
1*
',, βα
)(.,: xkx →Φ
Mercer kernel map
HM=⎭⎬⎫
⎩⎨⎧
∞<= ∑=
d
ii
d xxl1
22 |
gfgf T=*
,
( )Txxx L),(),(: 2211 φλφλ→Φ
where λi,φi are the eigenvaluesand eigenfunctions of k(x,y)
two very different pictures of what the kernel doesare the two spaces really that different?
29
RK vs Mercer mapsnote that for HM we are writing
but, since the φi(.) are orthonormal, there is a 1-1 map
and we can write
hence k(.,x) maps x into M = span{φk(.)}
dexexxr
Lr
)()()( 11111 φλφλ ++=Φ
{ } (.)(.): 2
kkk
kd
espanl
φλφ
→→Γ
r
( ))(.,
(.))((.))()( 111
xkxxx ddd
=++=ΦΓ φφλφφλ Lo
from (**)
(***)
30
RK vs Mercer mapsdefine the dot product in M so that
then {φk(.)} is a basis, M is a vector space, any function in M can be written as
and
i.e., k is a reproducing kernel on M
ijj
kjj
kj dxxx δλ
φφλ
φφ 1)()(1,** ∫ ==
( )xxfk
kk∑= φα)(
)()(,)(
)((.),(.))(.,(.),
**
1
****
xfxx
xxkf
iii
ijjijji
jjjj
iii
ijj
===
=
∑∑
∑∑
φαφφφλα
φφλφα
δλ
321
31
RK vs Mercer mapsfurthermore, since k(.,x) ∈ M, any functions of the form
are in M and
note that f,g ∈ H and this is the dot product we had in H
( ) ( ) jj
jii
i xkgxkf .,(.).,(.) ∑∑ == βα
),()()(
(.)(.),)()(
)((.)),((.)
)(.,,)(.,,
**
**
****
jij
ijiij
jlill
lji
ijmljmil
lmmlji
ijjmm
mmill
llji
jjj
iii
xxkxx
xx
xx
xkxkgf
∑∑ ∑
∑ ∑
∑ ∑∑
∑∑
==
=
=
=
βαφφλβα
φφφφλλβα
φφλφφλβα
βα
32
In summaryH ⊂ M and <.,.>* in H is the same as <.,.>** in MQuestion: is M ⊂ H ?• need to show that any H
• from (***),
and, for any sequence {x1, ..., xd},
• if there is an invertible P, then H and M ⊂ H
( ) ∈= ∑ xxfk
kkφα)(
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
(.)
(.)
)()(
)()(
)(.,
)(., 1
11
11111
d
P
dddd
dd
d xx
xx
xk
xk
φ
φ
φλφλ
φλφλM
44444 344444 21
LMM
(.))((.))()(., 111 ddd xxxk φφλφφλ ++= L
( ) ∈= ∑ kk
kk .,xkx αφ )(
33
In summarysince λi > 0
is invertible when Π is. If Π is not invertible, then
if there is no sequence for which Π is invertible then
the φk(x) cannot be orthonormal. Hence there must be invertible P and M ⊂ H.
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
Π
d
i
ddd
d
xx
xxP
λλ
λ
φφ
φφ
0
0
)()(
)()( 1
1
111
M
4444 34444 21
LM
( ) 00 =≠∃∀ ∑ ik
kk xi φαα s.t.
( ) 00 =≠∃∀ ∑ xxk
kkφαα s.t.
34
In summaryH = M and <.,.>* in H is the same as <.,.>** in M
the reproducing kernel map and the Mercer kernel map lead to the same RKHS
Reproducing kernel map
HK = ( )⎭⎬⎫
⎩⎨⎧
= ∑=
m
iii xkff
1
.,(.)|(.) α
( )∑∑= =
=m
iji
m
jji xxkgf
1
'
1*
',, βα
)(.,: xkxr →Φ
Mercer kernel map
HM=⎭⎬⎫
⎩⎨⎧
∞<= ∑=
d
ii
d xxl1
22 |
gfgf T=*
,
( )TM xxx L),(),(: 2211 φλφλ→Φ
rM Φ=ΦΓ o { } (.)(.): 2
kkk
kd
espanl
φλφ
→→Γ
r
35