Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Marco Trincavelli

21/11/2011

Mobile Robotics and Olfaction Lab

AASS Research Centre, Örebro University

Nonlinear models for Classification

and Regression

State of the Art Methods of Data Modeling and Machine Learning,

IMRIS program, Fall 2011

Acknowledgments

Nonlinear models for Classification and Regression 2

These slides have been adapted from the slides used in previous years

for the Machine Learning course at Örebro University.

My gratitude to the former teachers of this course that provided me

their slides and greatly simplified my work.

Erik Berglund Thorsteinn Rögnvaldsson

Repetition

1. Repetition

2. Nonlinear models for regression

3. Nonlinear models for classification

4. Artificial Neural Networks


Summary of previous lecture


Learning issues Generalization

Bias & Variance

Hypothesis space

Cost of inputs

Linear systems Linear regression

LMS (adaptive filters)

Simple perceptron

Gaussian PDF-based classifier

Logistic regression

Summary Classification


What does classification mean?

Decision theory

Bayes rule

Linear classifiers

Simple Perceptron

Linear Gaussian Classifier

Logistic Regression

Summary Regression


The fixed regressor assumption Noise in output, static model

Bias & Variance

Error Measures

Analytical solution or learning (gradient descent)

Linear regressors Linear regression

Ridge regression

LMS (on-line learning)

Bayes’ Rule


),(),( kk cxpxcp

)(

)()|()|(

xp

cpcxpxcp kk

k

K

k

kk cpcxpxp1

)()|()(where

Assumptions about the process


The ”fixed regressor” model: x(n) Observed input

y(n) Observed output

g[x(n)] True underlying function

e(n) I.I.D. noise process with zero mean

Data set:

)()]([)( nnxgny e

NnnynxD 1)(),(

Idealized regression


Use (find) appropriate model family F.

Find f(x) in F with minimum “distance” to g(x) (“error”).

Modify the model parameters until the “error” is minimized.

F

Model family (our hypothesis set)

g(x)

f(x) in F

Error

Error I – Summed Square Error (SSE)


W are the parameters of the function f.

SSE assumes zero mean IID noise.

SSE is the error measure used in least-squares fit.

N

n

nywnxfSSEE1

2)(]),([

Error II – Negative log-likelihood


W are the parameters of the function f.

D is the dataset.

Common to assume normally distributed noise which leads to:

SSEE

N

n

nywnxfpwDPE1

)(]),([ln)|(ln

Error III – The Bayesian error measure


Allows including a prior belief, expressed in p(w) , about the

function f(x ,w) . A common example is:

2

2

2exp)(

w

wwp

)(lnln)(

)()|(ln)|(ln wpL

Dp

wpwDpDwpE

We assume linear process:

We use a linear model family F.

...and the goal is to make w = w* . Analytical solution:

Linear regression


e xwxy T

*)(

xwwxfxy T ),()(

yXyXXXw †TT 1

Gradient Descent


E(w)η=Δw w

Go downhill.

The learning rate h is set heuristically.

Bias & Variance


2

εζ++= VarianceBiasError2

A comment on learning...


Learning can be done in two forms:

Storing the information as examples (e.g. a look-up table). This

requires a ”distance” measure between samples.

Storing the information in the form of parameters w of a function

(e.g. linear regression). This requires a parameter update equation

(e.g. gradient descent).

There are intermediate forms of this, e.g. model that are updated

locally around examples.

Nonlinear models for regression

1. Repetition





Nonlinear regression


We assume a nonlinear process:

With i.i.d. noise e. We use a nonlinear model family F.

e )()( xgxy

),()( wxfxy

Polynomial model family


Linear in w. It reduces to the linear

regression case, but with more variables.

M

M210 xw++xw+xw+w=w)f(x; 2

Polynomial regression – 1 dimension


)()(1

)2()2(1

)1()1(1

11

11

11

NxNx

xx

xx

=X

M

M

M

Analytic Solution: yX=yXXX=w †TT 1

Requires the estimation of M+1 parameters, where M is the

order of the polynomial.

Polynomial regression – 2 dimensions


The number of parameters to estimate scales as MD, where M is the

order of the polynomial and D the dimensionality of the input space.

M

2M

M

12M3210 xw+xw++xxw+xw+xw+w=w)f(x; 212121

)()(

)2()2(

)1()1(

)()()()(1

)2()2()2()2(1

)1()1()1()1(1

21

21

21

2121

2121

2121

NxNx

xx

xx

NxNxNxNx

xxxx

xxxx

=X

MM

MM

MM

Example – Polynomial model


The true function is

a Bessel function

Generalized Linear model


Linear in the parameters w. Reduces to the linear regression case,

but with more variables.

Requires a good guess on the basis functions hk(x).

(x)hw++(x)hw+w=w)f(x; MM10 1

Example – Generalized Linear model


...21 +(x)Jw+(x)Jw=w)f(x, 21

Jk(x) is a Bessel

function

Fourier Series


Fourier Series are another example of generalized linear model.

k

T

kk0 xiαw+w=w)f(x exp,

K Nearest Neighbour regression


The prediction equals y of the nearest neighbour (K=1).

The prediction equals the average, mode, median, etc... of the y

of K nearest neighbours.

The prediction equals the weighted average of the y of K nearest

neighbours.

K

=k

kkk ))y(m(rw=(x)y1

ˆ

Where mk is the index of the k:th neighbor and

rk is the distance ||x – x (mk )||

1 Nearest Neighbour


otherwise

kfor=)(rw kk

0

11

)y(=y

=m

2ˆ

21

3 Nearest Neighbours


3/542ˆ

5421

)y(+)y(+)y(=y

=m;=m;=m 32

otherwise

KkforK=)(rw kk

0

/1

Linear Interpolation


1/r is an interpolation kernel.

Can consider all the observations in the dataset.

N

=m

m

kkk

r

r=)(rw

1

/1

/1

N

=kk

m

N

=kk

mk r)y(m

=(x)y

1

1

1/r

/

ˆ

Example: KNNR


Kernel Regression


Kernel functions around x(n)

Example: Nadaraya-Watson estimator

Bishop´s book Ch. 6.3.1

N

=n

N

=n0

](x(n))r[

](x(n))r[y(n)

=)wf(x;

1

2

0

2

1

2

0

2

2w/exp

2w/exp

N

=n

n

kkk

]r[

]r[=)(rw

1

2

0

2

2

0

2

2w/exp

2w/exp

Example: Kernel Regression


Note on nonlinear regression


Polynomial regression and generalized linear regression are fitted

using error based learning.

KNN regression is just a look-up method.

Kernel regression is a combination. The ”width” (w0) is fitted

using an iterative (often manual) method.

Nonlinear models for classification

1. Repetition






Quadratic Gaussian Classifier

Assume p(x|ck) Gaussian with different means uk and different

covariance matrices ∑k. D is the dimension of the input space.

Estimate means and covariance matrices for the categories

maximizing the likelihood of the dataset p(D| uk, ∑k ):

kK

T

k

K

Dk uxuxcxp2

1exp

)det()2(

1)|(

2/

kcnxk

k nxN

u)(

)(1

ˆ Tk

K

k cnx

k

k

k unxunxN

k

ˆ)(ˆ)(1

1

1 )(

Example: Quadratic Gaussian Classifier


Training error = 0.07%

Test error = 0.03%

Example: Quadratic Gaussian Classifier



Test error = 0.03%


Linear Gaussian class boundary

11399 green samples

2142 red samples


Test error = 0.10%

Quadratic logistic regression


Fit w maximizing conditional likelihood like for the linear logistic

regression.

cb,A,=w

c)+xb+Ax(x+=w)f(x,

TTexp1

1

K Nearest Neighbours classification


Estimate the posterior probabilities according to neighbours

Maximum a posteriori classification

K

K=x)|(cp

j

jˆ

x)|(cpc jc

jˆmaxargˆ

Example: 1-NN classifier


Test error = 0.10%

Example: 5-NN classifier


Test error = 0.14%

Decision Trees


Split into smaller and smaller subsets.

Each split increases node purity (e.g. entropy).

Splits usually made along variable axes. This generates a subdivision

into “hypercubes”.

Backwards pruning important.

Example: Decision Tree


First cut along x1:

Rule: if x1 < -0.1515 then red otherwise green.

No suitable cut along x2 after the first cut along x1.


Test error = 0.07%

Inductive learning of a Decision Tree


Simplest: Construct a decision tree with one leaf for every

example. Memory based learning, not very good generalization.

Advanced: Split on each variable so that the purity of each split

increases (i.e. only samples belonging to one class).

A purity measure can be for example entropy.

i

ii )p(c)p(c= lnEntropy

Entropy: a measure of “order”


The entropy is maximal

when all possibilities

are equally likely.

The goal of the decision

tree is to decrease the

entropy in each node.

Entropy is zero in a

“pure” node, i.e. a node

containing only samples

belonging to one class.

Entropy function


Plot the entropy function for a 2 class problem, where the classes

are yes and no as a function of p(yes).

[p(no)]p(no)[p(yes)]p(yes)= lnlnEntropy

Decision Tree learning algorithm


Create pure nodes whenever possible.

If pure nodes are not possible, choose the split that leads to the

largest decrease in entropy.

Decision Tree learning - 1


Apply the decision tree learning algorithm to the following data

set with 10 features and 12 observations.

Decision Tree learning - 2


Dataset:

Variable to predict: TRUE or FALSE

Decision Tree learning result


True Decision Tree


Considerations – Inductive learning


The induced decision tree cannot be more complex than what the

data support.

The tree was constructed based on perfect learning, i.e. we

assume that there are no mistakes on the training data. This is

often not a good idea!

Considerations – Inductive learning


The induced decision tree cannot be more complex than what the

data support.

The tree was constructed based on perfect learning, i.e. we

assume that there are no mistakes on the training data. This is

often not a good idea!

Probably good to stop learning before having pure nodes or to prune

some nodes and branches. Then estimate the a posteriori probabilities

from the number of observations of different classes in the leaf.

K

K=x)|(cp

j

jˆ

How do we know that f≈g?


In other words, how do we know that what we learned is correct?

Try f on a new test set of examples(cross validation)...

...and assume the ”principle of uniformity”, i.e. the result we get on this test data should be indicative of results on future data.

Learning curve for the decision tree

algorithm on 100 randomly generated

examples (test set) in the restaurant

domain. The graph plots the average

of 20 trials.

Cross Validation


Split your data set into two parts, one for training your model

and the other for validating your model. The error on the

validation data is called “validation error”(Eval).

valgen EE

valE

K fold Cross Validation


More accurate than using only one validation set.

Fit your model K times and test it K times. Then average the

performance.

K

=k

valvalgen (k)EK

=EE1

1

)(Eval 1 )(Eval 2 )(Eval 3

Leave One Out Cross Validation


Use every data point as validation set.

Leave One Out is a K fold cross validation with K equals to the

number of observations.

K

=k

valvalgen (k)EK

=EE1

1

)(Eval 1 )(Eval 2 (K)Eval

Artificial Neural Networks

1. Repetition





Problems with single perceptron


1. Single layer allows only

linear machine.

2. Perceptron learning

oscillates if data

distributions overlap.

3. Step functions

complicate learning

with many perceptrons.




linear machine.


oscillates if data


3. Step functions

complicate learning


1. Use many perceptrons

arranged in multiple layers.

2. Use a different learning

algorithm.

3. Use smooth transfer

functions.




linear machine.


oscillates if data


3. Step functions

complicate learning


1. Use many perceptrons

arranged in multiple layers.

2. Use a different learning

algorithm.

3. Use smooth transfer

functions.

Multilayer Perceptron

a.k.a. Artificial Neural Network (ANN)

The Multilayer Perceptron


Combine several single layer perceptrons.

Each single layer perceptron uses a sigmoid shaped transfer

function like the logistic or hyperbolic tangent function.

z)(+=φ(z)

(z)=φ(z)

exp1

1

tanh

Transfer functions


z)(+=φ(z)

(z)=φ(z)

exp1

1

tanh

Training a Multilayer Perceptron


The simplest algorithm for training a multilayer perceptron is the

backpropagation algorithm:

1. Select small random weights w.

2. Until halting condition: 1. Select a random training example.

2. Calculate the output of the hidden layer.

3. Calculate the output of the output layer.

4. Calculate error for output layer.

5. Calculate error for hidden layer.

6. Update weights.

Training a Multilayer Perceptron


The simplest algorithm for training a multilayer perceptron is the

backpropagation algorithm:

1. Select small random weights w.

2. Until halting condition: 1. Select a random training example.

2. Calculate the output of the hidden layer.

3. Calculate the output of the output layer.

4. Calculate error for output layer.

5. Calculate error for hidden layer.

6. Update weights.

Forward Step

Backwards Step

The backpropagation algorithm


The error is propagated backwards, from which the name of the

algorithm.

Forward step:

Backwards step:

Backpropagation is gradient descent on SSE

[h]yh(x)x ˆ

eδδ ij

The backpropagation algorithm


The error is propagated backwards, from which the name of the

algorithm.

Forward step:

Backwards step:

Backpropagation is gradient descent on SSE

[h]yh(x)x ˆ

eδδ ij

E(t)η=Δw(t) w

Example: 2-2-1 backpropagation



Test error = 0.24%

Converges after 3000

epochs (forever!!)

Speeding up BackProp: bold driver


Adaptive learning rate: If things are going well increase speed

If things are going bad decrease speed

)E(t<E(t))(t

)E(tE(t))(t=η(t)

E(t)η(t)=Δw(t) w

1 if 11.2η

1 if 10.5η

Example: bold driver


Backpropagation:

fixed h

Bold driver:

adaptive h

Speeding up BackProp: momentum


Speed up if several steps are in the same direction.

Slow down if steps change direction all the time.

The update is the moving average of the updates calculated at

every iteration.

)αΔw(t+E(t)η=Δw(t) w 1E(t)η=Δw(t) w

update rule for

backpropagation update rule for backpropagation

with momentum

Example: momentum


Backpropagation:

no momentum

Backpropagation:

with momentum

Second order search


Go downhill.

The step length is estimated from the curvature of the function.

EH=Δw w 1

Second order learning algorithms


Adjust the length step analytically.

Jacobian: matrix of the first partial derivatives of a function.

Hessian: matrix of the second partial derivatives of a function:

j

i

x

f=ji,J(f)

ji xx

f=ji,H(f)

2

Why second order learning algorithms?


Taylor expansion of the error function:

Hessian matrix:

Setting the first derivative of E(w+Dw) w.r.t. Dw equals to zero yelds to:

DDDD wH(w)w+E(w)w+E(w)=w)+E(w TT

E(w)H(w) T

0=Δw)+E(wΔw EH=Δw w 1

Example: second order learning


Solution in one step

for a quadratic

error function!

Levenberg-Marquardt algorithm


The Hessian is expensive to calculate, and second derivatives are often very noisy. Use a Jacobian based approximation of the Hessian.

Apply regularization.

E(t)λI+t)(n,t)JJ(n,=Δw(t) w

n

T

1

n

T (n)(n)JH 2J

Example: 2-2-1 backpropagation



Test error = 0.24%

Converges after 3000

epochs (forever!!)

Example: 2-2-1 Levenberg-Marquardt



Test error = 0.07%

Converges after

3 epochs

Example: 2-3-1 Levenberg-Marquardt



Test error = 0.16%

Converges after

150 epochs

ANN for nonlinear regression


No unique solution!


When the optimization

algorithm terminates,

nothing certifies that the

global minimum of the

error function has been

reached.

Often ANN training

algorithms terminate in

local minima.

Interpretation of ANN


Classification: nonlinear logistic regression

Regression: projection pursuit regression

[f(x)]+=(x)y=x)|(cp

exp1

1ˆˆ

j

T

jjj x)(whv=(x)y

Book Reading (Bishop)


Ch. 2.5

Ch. 4.1, 4.2, 4.3

Ch. 5.1, 5.2, 5.3, 5.4

Documents

Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,