EE269Signal Processing for Machine Learning
Lecture 13
Instructor : Mert Pilanci
Stanford University
February 27, 2019
Gaussian regression models
I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn
p(y|x,w, b) = N(wTx+ b,�2)
I P (w) prior probability of wI infinitely many classes parametrized by wI maxw P (y|x1, ...xn) = P (y|x1, ...xn, w, b)P (w)I independent observations: = Qni=1 P (yi|xi, w, b)P (w)I Maximum a Posteriori (MAP) estimate wMAP
wMAP = argmaxnY
i=1
P (yi|xi, w, b)P (w)
= argmaxnX
i=1
logP (yi|xi, w, b) + logP (w)
Gaussian regression models
I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn
p(y|x,w, b) = N(wTx+ b,�2)
wMAP = argmaxnX
i=1
logP (yi|xi, w, b) + logP (w)
I Gaussian prior on w : P (w) = N(0, t2I)
wMAP = argmax�1
2�2
nX
i=1
(yi � wTxi � b)
2�
1
2t2||w||
22
Gaussian regression models
I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn
p(y|x,w, b) = N(wTx+ b,�2)
wMAP = argmaxw
nX
i=1
logP (yi|xi, w, b) + logP (w)
I Gaussian prior on w : P (w) = N(0, t2I)
wMAP = argminw
nX
i=1
(yi � wTxi � b)
2 +�2
t2||w||
22
I `2 regularization (Ridge regression)
Gaussian regression models
I Laplace prior P (w) / e�|w1|t ...e
�|wd|t
wMAP = argminw
nX
i=1
(yi � wTxi � b)
2 +�2
t||w||1
I `1 regularization (Lasso)
`2 regularization (Ridge regression)
minw
||Xw � y||22 + �||w||
22 (1)
`1 regularization (Lasso)
minw
||Xw � y||22 + �||w||1 (2)
Exponential density e�|w|t vs Gaussian density e�
w2
t2
Least Squares Regression and Duality
I in matrix-vector notation (redefine � n�)
minw
1
2||Xw � y||
22 +
�
2||w||
22
I equivalent constrained problem
minz,w : Xw=z
1
2||z � y||
22 +
�
2||w||
22
I Dual problem:
max↵�↵
T (1
2�XX
T +1
2I)↵+ ↵T y
KKT conditions imply w⇤ = 1�X
T↵⇤
We can solve the dual in closed form ↵⇤ = ( 1�XX
T + I)�1y
Dual Least Squares Regression Problem
Dual problem:
max↵�↵
T (1
2�XX
T +1
2I)↵+ ↵T y
KKT conditions imply w⇤ = 1�X
T↵⇤
We can solve the dual in closed form ↵⇤ = ( 1�XX
T + I)�1y
I Given test sample x, the prediction isw
⇤Tx =
�1�
Pni=1 xi↵
⇤
i
�x = 1�
Pni=1hxi, xi↵
⇤
i
I Kernel map x! �(x) and kernel matrixKij = (x, y) = h�(x),�(x)i
I Dual solution ↵⇤ = ( 1�K + I)�1yI Prediction f̂(x) = 1�
Pni=1 (xi, x)↵
⇤
i
Kernel Regression Application
I polynomial kernel (x, y) = (1 + xT y)4prediction f̂(x) = 1�
Pni=1 (xi, x)↵
⇤
i
Kernel Regression Application
I Gaussian kernel (x, y) = e�||x�y||22
2�2
prediction f̂(x) = 1�Pn
i=1 (xi, x)↵⇤
i
Reproducing Kernel Hilbert Space
I Mercer’s Theorem: Any positive definite kernel function canbe represented in terms of eigenfunctions
(x, y) =1X
i=1
�i�i(x)�i(y)
I The functions �(x) form an orthonormal basis for a functionspace
Hk = {f : f(x) =1X
i=1
fi�i(x),1X
i=1
f2i
�i
Reproducing Kernel Hilbert Space
I Mercer’s Theorem: Any positive definite kernel function canbe represented in terms of eigenfunctions
(x, y) =1X
i=1
�i�i(x)�i(y)
I The functions �(x) form an orthonormal basis for a functionspace
Hk = {f : f(x) =1X
i=1
fi�i(x),1X
i=1
f2i
�i
Representer Theorem in Reproducing Kernel Hilbert Space
(⇤) minf2Hk
nX
i=1
(f(xi)� yi)2 + �||f ||2Hk
I Representer theorem : The optimal solution must have theform f
⇤(x) =Pn
i=1 ↵i(x, xi)
I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get
(⇤) = min↵
||K↵� y||22 + �↵
TK↵
I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression
Representer Theorem in Reproducing Kernel Hilbert Space
(⇤) minf2Hk
nX
i=1
(f(xi)� yi)2 + �||f ||2Hk
I Representer theorem : The optimal solution must have theform f
⇤(x) =Pn
i=1 ↵i(x, xi)
I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get
(⇤) = min↵
||K↵� y||22 + �↵
TK↵
I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression
Representer Theorem in Reproducing Kernel Hilbert Space
(⇤) minf2Hk
nX
i=1
(f(xi)� yi)2 + �||f ||2Hk
I Representer theorem : The optimal solution must have theform f
⇤(x) =Pn
i=1 ↵i(x, xi)
I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get
(⇤) = min↵
||K↵� y||22 + �↵
TK↵
I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression
Example: Gaussian Kernel
I Gaussian kernel (x, y) = e�||x�y||22
2�2
(x, y) =P
1
i=1 �i�i(x)�i(y)
�(x) / e�(c�a)x2Hi(x
p2c) and �i = bi
a, b, c are functions of � and Hk is the ith order Hermite
polynomial
Example: Gaussian Kernel
I Gaussian kernel (x, y) = e�||x�y||22
2�2
f(x) =Pn
i=1 ↵i(xi, x) =Pn
i=1
P1
j=1 �j�j(xi)�j(x) =P1
j=1 fjp�j�j(x)
where fj =Pm
i=1 ↵ip
�j�j(xi)
minf2Hk
nX
i=1
(f(xi)� yi)2 + �||f ||2Hk
I For a function h(x) = P1i=1 hi�i(x)I ||h||2
Hk= hh, hiHk =
P1
i=1h2i�i
enforces smoothness by
penalizing rough eigenfunctions (small �i).
Example: Gaussian Kernel
I Gaussian kernel (x, y) = e�||x�y||22
2�2
f(x) =Pn
i=1 ↵i(xi, x) =Pn
i=1
P1
j=1 �j�j(xi)�j(x) =P1
j=1 fjp�j�j(x)
where fj =Pm
i=1 ↵ip
�j�j(xi)
minf2Hk
nX
i=1
(f(xi)� yi)2 + �||f ||2Hk
I For a function h(x) = P1i=1 hi�i(x)I ||h||2
Hk= hh, hiHk =
P1
i=1h2i�i
enforces smoothness by
penalizing rough eigenfunctions (small �i).
Example: Sobolev Kernel (one dimensional signals)
Hk = {f : [0, 1] ! R | f(0) = 0, abs. continuous,Z
1
�1
|f0(t)|2dt < 1}
I absolutely continuous , f 0(t) exists almost everywhere,I Hk is a Reproducing Kernel Hilbert Space with kernel
(x, y) = min(x, y)
Sobolev Kernel vs Polynomial Kernel
Example: Sinc Kernel (one dimensional signals)
I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)
I f(t) : bandlimited functionsI related to Shannon-Whittaker interpolation formula
uniform samples x[n] = x(nT )
I bandlimited interpolation
x(t) =X
n
x[n]sinc⇣t� nT
T
⌘
Example: Sinc Kernel (one dimensional signals)
I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)I f(t) : bandlimited functions
I related to Shannon-Whittaker interpolation formulauniform samples x[n] = x(nT )
I bandlimited interpolation
x(t) =X
n
x[n]sinc⇣t� nT
T
⌘
Example: Sinc Kernel (one dimensional signals)
I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)I f(t) : bandlimited functionsI related to Shannon-Whittaker interpolation formula
uniform samples x[n] = x(nT )
I bandlimited interpolation
x(t) =X
n
x[n]sinc⇣t� nT
T
⌘