Download pdf - Lecture 13 Instructor : Mert Pilanciweb.stanford.edu/class/ee269/Lecture13.pdfEE269 Signal Processing for Machine Learning Lecture 13 Instructor : Mert Pilanci Stanford University

EE269Signal Processing for Machine Learning

Lecture 13

Instructor : Mert Pilanci

Stanford University

February 27, 2019

Gaussian regression models

I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn

p(y|x,w, b) = N(wTx+ b,�2)

I P (w) prior probability of wI infinitely many classes parametrized by wI maxw P (y|x1, ...xn) = P (y|x1, ...xn, w, b)P (w)I independent observations: = Qni=1 P (yi|xi, w, b)P (w)I Maximum a Posteriori (MAP) estimate wMAP

wMAP = argmaxnY

i=1

P (yi|xi, w, b)P (w)

= argmaxnX

i=1

logP (yi|xi, w, b) + logP (w)



p(y|x,w, b) = N(wTx+ b,�2)

wMAP = argmaxnX

i=1


I Gaussian prior on w : P (w) = N(0, t2I)

wMAP = argmax�1

2�2

nX

i=1

(yi � wTxi � b)

2�

1

2t2||w||

22



p(y|x,w, b) = N(wTx+ b,�2)

wMAP = argmaxw

nX

i=1


I Gaussian prior on w : P (w) = N(0, t2I)

wMAP = argminw

nX

i=1

(yi � wTxi � b)

2 +�2

t2||w||

22

I `2 regularization (Ridge regression)


I Laplace prior P (w) / e�|w1|t ...e

�|wd|t

wMAP = argminw

nX

i=1

(yi � wTxi � b)

2 +�2

t||w||1

I `1 regularization (Lasso)

`2 regularization (Ridge regression)

minw

||Xw � y||22 + �||w||

22 (1)

`1 regularization (Lasso)

minw

||Xw � y||22 + �||w||1 (2)

Exponential density e�|w|t vs Gaussian density e�

w2

t2

Least Squares Regression and Duality

I in matrix-vector notation (redefine � n�)

minw

1

2||Xw � y||

22 +

�

2||w||

22

I equivalent constrained problem

minz,w : Xw=z

1

2||z � y||

22 +

�

2||w||

22

I Dual problem:

max↵�↵

T (1

2�XX

T +1

2I)↵+ ↵T y

KKT conditions imply w⇤ = 1�X

T↵⇤

We can solve the dual in closed form ↵⇤ = ( 1�XX

T + I)�1y

Dual Least Squares Regression Problem

Dual problem:

max↵�↵

T (1

2�XX

T +1

2I)↵+ ↵T y

KKT conditions imply w⇤ = 1�X

T↵⇤

We can solve the dual in closed form ↵⇤ = ( 1�XX

T + I)�1y

I Given test sample x, the prediction isw

⇤Tx =

�1�

Pni=1 xi↵

⇤

i

�x = 1�

Pni=1hxi, xi↵

⇤

i

I Kernel map x! �(x) and kernel matrixKij = (x, y) = h�(x),�(x)i

I Dual solution ↵⇤ = ( 1�K + I)�1yI Prediction f̂(x) = 1�

Pni=1 (xi, x)↵

⇤

i

Kernel Regression Application

I polynomial kernel (x, y) = (1 + xT y)4prediction f̂(x) = 1�

Pni=1 (xi, x)↵

⇤

i

Kernel Regression Application

I Gaussian kernel (x, y) = e�||x�y||22

2�2

prediction f̂(x) = 1�Pn

i=1 (xi, x)↵⇤

i

Reproducing Kernel Hilbert Space

I Mercer’s Theorem: Any positive definite kernel function canbe represented in terms of eigenfunctions

(x, y) =1X

i=1

�i�i(x)�i(y)

I The functions �(x) form an orthonormal basis for a functionspace

Hk = {f : f(x) =1X

i=1

fi�i(x),1X

i=1

f2i

�i

Representer Theorem in Reproducing Kernel Hilbert Space

(⇤) minf2Hk

nX

i=1

(f(xi)� yi)2 + �||f ||2Hk

I Representer theorem : The optimal solution must have theform f

⇤(x) =Pn

i=1 ↵i(x, xi)

I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get

(⇤) = min↵

||K↵� y||22 + �↵

TK↵

I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression

Example: Gaussian Kernel


2�2

(x, y) =P

1

i=1 �i�i(x)�i(y)

�(x) / e�(c�a)x2Hi(x

p2c) and �i = bi

a, b, c are functions of � and Hk is the ith order Hermite

polynomial

Example: Gaussian Kernel


2�2

f(x) =Pn

i=1 ↵i(xi, x) =Pn

i=1

P1

j=1 �j�j(xi)�j(x) =P1

j=1 fjp�j�j(x)

where fj =Pm

i=1 ↵ip

�j�j(xi)

minf2Hk

nX

i=1

(f(xi)� yi)2 + �||f ||2Hk

I For a function h(x) = P1i=1 hi�i(x)I ||h||2

Hk= hh, hiHk =

P1

i=1h2i�i

enforces smoothness by

penalizing rough eigenfunctions (small �i).

Example: Sobolev Kernel (one dimensional signals)

Hk = {f : [0, 1] ! R | f(0) = 0, abs. continuous,Z

1

�1

|f0(t)|2dt < 1}

I absolutely continuous , f 0(t) exists almost everywhere,I Hk is a Reproducing Kernel Hilbert Space with kernel

(x, y) = min(x, y)

Sobolev Kernel vs Polynomial Kernel

Example: Sinc Kernel (one dimensional signals)

I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)

I f(t) : bandlimited functionsI related to Shannon-Whittaker interpolation formula

uniform samples x[n] = x(nT )

I bandlimited interpolation

x(t) =X

n

x[n]sinc⇣t� nT

T

⌘


I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)I f(t) : bandlimited functions

I related to Shannon-Whittaker interpolation formulauniform samples x[n] = x(nT )


x(t) =X

n

x[n]sinc⇣t� nT

T

⌘


I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)I f(t) : bandlimited functionsI related to Shannon-Whittaker interpolation formula

uniform samples x[n] = x(nT )


x(t) =X

n

x[n]sinc⇣t� nT

T

⌘