31
Learning From Data Lecture 18 Radial Basis Functions Non-Parametric RBF Parametric RBF k-RBF-Network M. Magdon-Ismail CSCI 4100/6100

Learning From Data Lecture 18 Radial Basis Functions

Embed Size (px)

Citation preview

Page 1: Learning From Data Lecture 18 Radial Basis Functions

Learning From DataLecture 18

Radial Basis Functions

Non-Parametric RBF

Parametric RBFk-RBF-Network

M. Magdon-IsmailCSCI 4100/6100

Page 2: Learning From Data Lecture 18 Radial Basis Functions

recap: Data Condensation and Nearest Neighbor Search

Training Set Consistent

−−−→

S1

S2

x

Branch and bound for finding

nearest neighbors.

Lloyd’s algorithm for finding a

good clustering.

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 2 /31 RBF vs. k-NN −→

Page 3: Learning From Data Lecture 18 Radial Basis Functions

Radial Basis Functions (RBF)

k-Nearest Neighbor: Only considers k-nearest neighbors.each neighbor has equal weight

What about using all data to compute g(x)?

RBF: Use all data.data further away from x have less weight.

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 3 /31 Weighting data points −→

Page 4: Learning From Data Lecture 18 Radial Basis Functions

Weighting the Data Points: αn

Test point x.

αn: the weight of xn in g(x).

αn(x) = φ

(||x− xn ||r

)

decreasing function of ||x− xn ||

Most popular kernel: Gaussian

φ(z) = e−1

2z2.

Window kernel, mimics k-NN,

φ(z) =

{

1 z ≤ 1,

0 z > 1,

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 4 /31 Weighting depends on distance −→

Page 5: Learning From Data Lecture 18 Radial Basis Functions

Weighting the Data Points: αn

Test point x.

αn: the weight of xn in g(x).

αn(x) = φ

(||x− xn ||r

)

weighting depends on the distance ||x− xn ||

Most popular kernel: Gaussian

φ(z) = e−1

2z2.

Window kernel, mimics k-NN,

φ(z) =

{

1 z ≤ 1,

0 z > 1,

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 5 /31 Relative to scale r −→

Page 6: Learning From Data Lecture 18 Radial Basis Functions

Weighting the Data Points: αn

Test point x.

αn: the weight of xn in g(x).

αn(x) = φ

(||x− xn ||r

)

. . . relative to a scale parameter r

Most popular kernel: Gaussian

φ(z) = e−1

2z2.

Window kernel, mimics k-NN,

φ(z) =

{

1 z ≤ 1,

0 z > 1,

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 6 /31 Determined by φ −→

Page 7: Learning From Data Lecture 18 Radial Basis Functions

Weighting the Data Points: αn

Test point x.

αn: the weight of xn in g(x).

αn(x) = φ

(||x− xn ||r

)

kernel φ determines how the weighting decreases with distance

Most popular kernel: Gaussian

φ(z) = e−1

2z2.

Window kernel, mimics k-NN,

φ(z) =

{

1 z ≤ 1,

0 z > 1,

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 7 /31 Example Kernels φ −→

Page 8: Learning From Data Lecture 18 Radial Basis Functions

Weighting the Data Points: αn

Test point x.

αn: the weight of xn in g(x).

αn(x) = φ

(||x− xn ||r

)

kernel φ determines how the weighting decreases with distance

Most popular kernel: Gaussian

φ(z) = e−1

2z2.

Window kernel, mimics k-NN,

φ(z) =

{

1 z ≤ 1,

0 z > 1,

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 8 /31 Nonparametric RBF final hypothesis −→

Page 9: Learning From Data Lecture 18 Radial Basis Functions

Nonparametric RBF – Regression

αn(x) = φ

(||x− xn ||r

)

x (xn, yn)

αn

y

g(x) =N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· yn

Weighted average of target values

ր

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 9 /31 Nonparametric RBF – classsification −→

Page 10: Learning From Data Lecture 18 Radial Basis Functions

Nonparametric RBF – Classification

αn(x) = φ

(||x− xn ||r

)

x (xn, yn)

αn

y

g(x) = sign

(

N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· yn)

Weighted average of target values

ր

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 10 /31 Nonparametric RBF – logistic regression −→

Page 11: Learning From Data Lecture 18 Radial Basis Functions

Nonparametric RBF – Logistic Regression

αn(x) = φ

(||x− xn ||r

)

x (xn, yn)

αn

y

g(x) =N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· Jyn = +1K

Weighted average of target values

ր

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 11 /31 Choosing the scale r −→

Page 12: Learning From Data Lecture 18 Radial Basis Functions

Choice of Scale r

Nearest Neighbor

k = 1 k = 3 k = 11Choosing k:

k = 3

k =√N

CV

Nonparametric RBF

r = 0.01 r = 0.05 r = 0.5 Choosing r:

r ∼ 12d√N

CV

overfitting underfitting

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 12 /31 Highlights of Nonparametric RBF −→

Page 13: Learning From Data Lecture 18 Radial Basis Functions

Highlights of Nonparametric RBF

1. Simple (‘smooth’ version of k-NN rule).

2. No training.

3. Near optimal Eout.

4. Easy to justify classification to customer.

5. Can do classification, multi-class, regression, logistic regression.

6. Computationally demanding.

}A good! method

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 13 /31 Bumps on Data Points −→

Page 14: Learning From Data Lecture 18 Radial Basis Functions

Scaled Bumps on Each Data Point

g(x) =N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· yn

Weighted average of ynx (xn, yn)

αn

y

g(x) =N∑

n=1

(

yn∑N

m=1 αm(x)

)

· φ(||x− xn ||

r

)

=N∑

n=1

wn(x)φ

(||x− xn ||r

)

Sum of bumps at xn scaled by wn(x)

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 14 /31 Rewrite as weighted bumps −→

Page 15: Learning From Data Lecture 18 Radial Basis Functions

Scaled Bumps on Each Data Point

g(x) =N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· yn

Weighted average of ynx (xn, yn)

αn

y

g(x) =N∑

n=1

(

yn∑N

m=1 αm(x)

)

· φ(||x− xn ||

r

)

=N∑

n=1

wn(x)φ

(||x− xn ||r

)

Sum of bumps at xn scaled by wn(x)

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 15 /31 Weighted bumps, wn(x) −→

Page 16: Learning From Data Lecture 18 Radial Basis Functions

Scaled Bumps on Each Data Point

g(x) =N∑

n=1

(

αn(x)∑N

m=1 αm(x)

)

· yn

Weighted average of ynx (xn, yn)

αn

y

g(x) =N∑

n=1

(

yn∑N

m=1 αm(x)

)

· φ(||x− xn ||

r

)

=N∑

n=1

wn(x) · φ(||x− xn ||

r

)

Sum of bumps at xn scaled by wn(x)

x (xn, yn)

wn

y

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 16 /31 Nonparametric RBF: 3 point example −→

Page 17: Learning From Data Lecture 18 Radial Basis Functions

Nonparametric RBF: wn(x)

Nonparametric RBF

g(x) =

N∑

n=1

wn(x) · φ(||x− xn ||

r

)

Only need to specify r.r = 0.1

x

y

r = 0.3

x

y

Parametric RBF

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

Fix r; need to determine the parameters wn.— fit the data.

— overfit the data?

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 17 /31 Parametric RBF −→

Page 18: Learning From Data Lecture 18 Radial Basis Functions

Parametric RBF, wn – A Linear Model

Nonparametric RBF

g(x) =N∑

n=1

wn(x) · φ(||x− xn ||

r

)

Only need to specify r.r = 0.1

x

y

r = 0.3

x

y

Parametric RBF

h(x) =

N∑

n=1

wn · φ(||x− xn ||

r

)

Fix r; need to determine the parameters wn.— fit the data.— overfit the data?

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 18 /31 Parametric RBF 3 point example −→

Page 19: Learning From Data Lecture 18 Radial Basis Functions

Parametric RBF – A Linear Model

Nonparametric RBF

g(x) =N∑

n=1

wn(x) · φ(||x− xn ||

r

)

Only need to specify r.r = 0.1

x

y

r = 0.3

x

y

Parametric RBF

h(x) =

N∑

n=1

wn · φ(||x− xn ||

r

)

Fix r; need to determine the parameters wn.— fit the data.— overfit the data?

r = 0.1

x

y

r = 0.3

x

y

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 19 /31 RBF-Nonlinear Transform −→

Page 20: Learning From Data Lecture 18 Radial Basis Functions

RBF-Nonlinear Transform Depends on Data

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

= wtz

z = Φ(x) =

φ1(x)φ2(x)...

φN(x)

, φn(x)=φ( ||x−xn ||r ). Z =

— zt1 —— zt2 —

...— ztN —

=

— Φ(x1)t —

— Φ(x2)t —

...— Φ(xN)

t —

Fit the data (h(xn) = yn):

w = Z†y = (ZtZ)−1Zty

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 20 /31 Solving for w −→

Page 21: Learning From Data Lecture 18 Radial Basis Functions

RBF-Nonlinear Transform Depends on Data

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

= wtz

z = Φ(x) =

φ1(x)φ2(x)...

φN(x)

, φn(x)=φ( ||x−xn ||r ). Z =

— zt1 —— zt2 —

...— ztN —

=

— Φ(x1)t —

— Φ(x2)t —

...— Φ(xN)

t —

Fit the data (h(xn) = yn):

w = Z†y = (ZtZ)−1Zty

r = 0.1

x

y

r = 0.3

xy

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 21 /31 Reducing N → k (nonparametric) −→

Page 22: Learning From Data Lecture 18 Radial Basis Functions

Reducing the Number of Bumps: Nonparametric

g(x) =N∑

n=1

wn(x) · φ(||x− xn ||

r

)

−−→

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

−−→

h(x) = w0 +k∑

j=1

wj · φ(||x− µj ||

r

)

= wtΦ(x)

Φ(x)t = [1,Φ1(x), . . . ,Φk(x)], where Φj(x) = Φ(

||x−µj ||r

)

.

ւnonlinear in µj

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 22 /31 Parametric, N centers −→

Page 23: Learning From Data Lecture 18 Radial Basis Functions

Reducing the Number of Bumps: Parametric

g(x) =N∑

n=1

wn(x) · φ(||x− xn ||

r

)

−−→

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

−−→

h(x) = w0 +k∑

j=1

wj · φ(||x− µj ||

r

)

= wtΦ(x)

Φ(x)t = [1,Φ1(x), . . . ,Φk(x)], where Φj(x) = Φ(

||x−µj ||r

)

.

ւnonlinear in µj

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 23 /31 k-RBF-Network −→

Page 24: Learning From Data Lecture 18 Radial Basis Functions

Reducing the Number of Bumps: k-RBF-Network

g(x) =N∑

n=1

wn(x) · φ(||x− xn ||

r

)

−−→

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

−−→

h(x) = w0 +k∑

j=1

wj · φ(||x− µj ||

r

)

= wtΦ(x)

Φ(x)t = [1,Φ1(x), . . . ,Φk(x)], where Φj(x) = Φ(

||x−µj ||r

)

.

ւnonlinear in µj

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 24 /31 Graphical representation −→

Page 25: Learning From Data Lecture 18 Radial Basis Functions

Reducing the Number of Bumps: k-RBF-Network

g(x) =N∑

n=1

wn(x) · φ(||x− xn ||

r

)

−−→

h(x) =N∑

n=1

wn · φ(||x− xn ||

r

)

−−→

h(x) = w0 +k∑

j=1

wj · φ(||x− µj ||

r

)

= wtΦ(x)

Φ(x)t = [1,Φ1(x), . . . ,Φk(x)], where Φj(x) = Φ(

||x−µj ||r

)

.

||x−µk ||r

x

wj

+

· · ·· · ·φ φ φ

w1

h(x)

wkw0

||x−µ1 ||r

||x−µj ||r

1

ւnonlinear in µj

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 25 /31 Fitting the data −→

Page 26: Learning From Data Lecture 18 Radial Basis Functions

Fitting the Data

Before: bumps were centered on xn — no choice

Now: we may choose the bump centers µj

Choose them to ‘cover’ the data

As the centers of k ‘clusters’

Given the bump centers, we have a linear model to determine the wj

That’s ‘easy’, we know how to do that.

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 26 /31 The algorithm −→

Page 27: Learning From Data Lecture 18 Radial Basis Functions

Fitting the Data

Fitting the RBF-network to the data (given k, r):

1: Use the inputs X to determine k centers µ1, . . . ,µk.

2: Compute the N × (k + 1) feature matrix Z

Z =

— zt1 —

— zt2 —...

— ztN —

=

— Φ(x1)t —

— Φ(x2)t —

...— Φ(xN)

t —

, where Φ(x) =

1

φ1(x)...

φk(x)

, φj(x) = φ(

||x−µj ||r

)

Each row of Z is the RBF-feature corresponding to xn (with dummy bias coordinate 1).

3: Fit the linear model Zw to y to determine the weights w∗.classification: PLA, pocket, linear programming,. . . .regression: pseudoinverse.

logistic regression: gradient descent on cross entropy error.

Choose r using CV, or (a heuristic):

r ∼ radius of data

k1/d(so your clusters ‘cover’ the data)

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 27 /31 Example, k = 4, 10 −→

Page 28: Learning From Data Lecture 18 Radial Basis Functions

Our Example

k = 4, r = 1k k = 10, r = 1

k

x

y

0.2

xy

w = (ZtZ)−1Zty

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 28 /31 Regularization −→

Page 29: Learning From Data Lecture 18 Radial Basis Functions

Use Regularization to Fight Overfitting

k = 10, r = 1k

k = 10, r = 1k, regularized

x

y

0.2

xy

w = (ZtZ + λI)−1Zty

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 29 /31 Summary of k-RBF-Network −→

Page 30: Learning From Data Lecture 18 Radial Basis Functions

Reflecting on the k-RBF-Network

1. We derived it as a ‘soft’ generalization of k-NN rule.Can also be derived from regularization theory.

Can also be derived from noisy interpolation theory.

2. Can use nonparametric or parametric versions.

3. Given centers, ‘easy’ to learn the weights using techniques from linear models.A linear model with an adaptable nonlinear transform.

4. We used uniform bumps – can have different shapes Σj.

5.NEXT: How to better choose the centers: unsupervised learning.

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 30 /31 A peek at unsupervised learning −→

Page 31: Learning From Data Lecture 18 Radial Basis Functions

A Peek at Unsupervised Learning

21-NN rule, 10 Classes 10 Clustering of Data

0

1

2

3

4

67

89

Average Intensity

Sym

metry

c© AML Creator: Malik Magdon-Ismail Radial Basis Functions: 31 /31