85
Marco Trincavelli 21/11/2011 Mobile Robotics and Olfaction Lab AASS Research Centre, Örebro University Nonlinear models for Classification and Regression State of the Art Methods of Data Modeling and Machine Learning, IMRIS program, Fall 2011

Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Marco Trincavelli

21/11/2011

Mobile Robotics and Olfaction Lab

AASS Research Centre, Örebro University

Nonlinear models for Classification

and Regression

State of the Art Methods of Data Modeling and Machine Learning,

IMRIS program, Fall 2011

Page 2: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Acknowledgments

Nonlinear models for Classification and Regression 2

These slides have been adapted from the slides used in previous years

for the Machine Learning course at Örebro University.

My gratitude to the former teachers of this course that provided me

their slides and greatly simplified my work.

Erik Berglund Thorsteinn Rögnvaldsson

Page 3: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Repetition

1. Repetition

2. Nonlinear models for regression

3. Nonlinear models for classification

4. Artificial Neural Networks

Nonlinear models for Classification and Regression 3

Page 4: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Summary of previous lecture

Nonlinear models for Classification and Regression 4

Learning issues Generalization

Bias & Variance

Hypothesis space

Cost of inputs

Linear systems Linear regression

LMS (adaptive filters)

Simple perceptron

Gaussian PDF-based classifier

Logistic regression

Page 5: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Summary Classification

Nonlinear models for Classification and Regression 5

What does classification mean?

Decision theory

Bayes rule

Linear classifiers

Simple Perceptron

Linear Gaussian Classifier

Logistic Regression

Page 6: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Summary Regression

Nonlinear models for Classification and Regression 6

The fixed regressor assumption Noise in output, static model

Bias & Variance

Error Measures

Analytical solution or learning (gradient descent)

Linear regressors Linear regression

Ridge regression

LMS (on-line learning)

Page 7: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Bayes’ Rule

Nonlinear models for Classification and Regression 7

),(),( kk cxpxcp

)(

)()|()|(

xp

cpcxpxcp kk

k

K

k

kk cpcxpxp1

)()|()(where

Page 8: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Assumptions about the process

Nonlinear models for Classification and Regression 8

The ”fixed regressor” model: x(n) Observed input

y(n) Observed output

g[x(n)] True underlying function

e(n) I.I.D. noise process with zero mean

Data set:

)()]([)( nnxgny e

NnnynxD 1)(),(

Page 9: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Idealized regression

Nonlinear models for Classification and Regression 9

Use (find) appropriate model family F.

Find f(x) in F with minimum “distance” to g(x) (“error”).

Modify the model parameters until the “error” is minimized.

F

Model family (our hypothesis set)

g(x)

f(x) in F

Error

Page 10: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Error I – Summed Square Error (SSE)

Nonlinear models for Classification and Regression 10

W are the parameters of the function f.

SSE assumes zero mean IID noise.

SSE is the error measure used in least-squares fit.

N

n

nywnxfSSEE1

2)(]),([

Page 11: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Error II – Negative log-likelihood

Nonlinear models for Classification and Regression 11

W are the parameters of the function f.

D is the dataset.

Common to assume normally distributed noise which leads to:

SSEE

N

n

nywnxfpwDPE1

)(]),([ln)|(ln

Page 12: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Error III – The Bayesian error measure

Nonlinear models for Classification and Regression 12

Allows including a prior belief, expressed in p(w) , about the

function f(x ,w) . A common example is:

2

2

2exp)(

w

wwp

)(lnln)(

)()|(ln)|(ln wpL

Dp

wpwDpDwpE

Page 13: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

We assume linear process:

We use a linear model family F.

...and the goal is to make w = w* . Analytical solution:

Linear regression

Nonlinear models for Classification and Regression 13

e xwxy T

*)(

xwwxfxy T ),()(

yXyXXXw †TT 1

Page 14: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Gradient Descent

Nonlinear models for Classification and Regression 14

E(w)η=Δw w

Go downhill.

The learning rate h is set heuristically.

Page 15: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Bias & Variance

Nonlinear models for Classification and Regression 15

2

εζ++= VarianceBiasError2

Page 16: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

A comment on learning...

Nonlinear models for Classification and Regression 16

Learning can be done in two forms:

Storing the information as examples (e.g. a look-up table). This

requires a ”distance” measure between samples.

Storing the information in the form of parameters w of a function

(e.g. linear regression). This requires a parameter update equation

(e.g. gradient descent).

There are intermediate forms of this, e.g. model that are updated

locally around examples.

Page 17: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Nonlinear models for regression

1. Repetition

2. Nonlinear models for regression

3. Nonlinear models for classification

4. Artificial Neural Networks

Nonlinear models for Classification and Regression 17

Page 18: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Nonlinear regression

Nonlinear models for Classification and Regression 18

We assume a nonlinear process:

With i.i.d. noise e. We use a nonlinear model family F.

e )()( xgxy

),()( wxfxy

Page 19: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Polynomial model family

Nonlinear models for Classification and Regression 19

Linear in w. It reduces to the linear

regression case, but with more variables.

M

M210 xw++xw+xw+w=w)f(x; 2

Page 20: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Polynomial regression – 1 dimension

Nonlinear models for Classification and Regression 20

)()(1

)2()2(1

)1()1(1

11

11

11

NxNx

xx

xx

=X

M

M

M

Analytic Solution: yX=yXXX=w †TT 1

Requires the estimation of M+1 parameters, where M is the

order of the polynomial.

Page 21: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Polynomial regression – 2 dimensions

Nonlinear models for Classification and Regression 21

The number of parameters to estimate scales as MD, where M is the

order of the polynomial and D the dimensionality of the input space.

M

2M

M

12M3210 xw+xw++xxw+xw+xw+w=w)f(x; 212121

)()(

)2()2(

)1()1(

)()()()(1

)2()2()2()2(1

)1()1()1()1(1

21

21

21

2121

2121

2121

NxNx

xx

xx

NxNxNxNx

xxxx

xxxx

=X

MM

MM

MM

Page 22: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example – Polynomial model

Nonlinear models for Classification and Regression 22

The true function is

a Bessel function

Page 23: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Generalized Linear model

Nonlinear models for Classification and Regression 23

Linear in the parameters w. Reduces to the linear regression case,

but with more variables.

Requires a good guess on the basis functions hk(x).

(x)hw++(x)hw+w=w)f(x; MM10 1

Page 24: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example – Generalized Linear model

Nonlinear models for Classification and Regression 24

...21 +(x)Jw+(x)Jw=w)f(x, 21

Jk(x) is a Bessel

function

Page 25: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Fourier Series

Nonlinear models for Classification and Regression 25

Fourier Series are another example of generalized linear model.

k

T

kk0 xiαw+w=w)f(x exp,

Page 26: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

K Nearest Neighbour regression

Nonlinear models for Classification and Regression 26

The prediction equals y of the nearest neighbour (K=1).

The prediction equals the average, mode, median, etc... of the y

of K nearest neighbours.

The prediction equals the weighted average of the y of K nearest

neighbours.

K

=k

kkk ))y(m(rw=(x)y1

ˆ

Where mk is the index of the k:th neighbor and

rk is the distance ||x – x (mk )||

Page 27: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

1 Nearest Neighbour

Nonlinear models for Classification and Regression 27

otherwise

kfor=)(rw kk

0

11

)y(=y

=m

21

Page 28: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

3 Nearest Neighbours

Nonlinear models for Classification and Regression 28

3/542ˆ

5421

)y(+)y(+)y(=y

=m;=m;=m 32

otherwise

KkforK=)(rw kk

0

/1

Page 29: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Linear Interpolation

Nonlinear models for Classification and Regression 29

1/r is an interpolation kernel.

Can consider all the observations in the dataset.

N

=m

m

kkk

r

r=)(rw

1

/1

/1

N

=kk

m

N

=kk

mk r)y(m

=(x)y

1

1

1/r

/

ˆ

Page 30: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: KNNR

Nonlinear models for Classification and Regression 30

Page 31: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Kernel Regression

Nonlinear models for Classification and Regression 31

Kernel functions around x(n)

Example: Nadaraya-Watson estimator

Bishop´s book Ch. 6.3.1

N

=n

N

=n0

](x(n))r[

](x(n))r[y(n)

=)wf(x;

1

2

0

2

1

2

0

2

2w/exp

2w/exp

N

=n

n

kkk

]r[

]r[=)(rw

1

2

0

2

2

0

2

2w/exp

2w/exp

Page 32: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: Kernel Regression

Nonlinear models for Classification and Regression 32

Page 33: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Note on nonlinear regression

Nonlinear models for Classification and Regression 33

Polynomial regression and generalized linear regression are fitted

using error based learning.

KNN regression is just a look-up method.

Kernel regression is a combination. The ”width” (w0) is fitted

using an iterative (often manual) method.

Page 34: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Nonlinear models for classification

1. Repetition

2. Nonlinear models for regression

3. Nonlinear models for classification

4. Artificial Neural Networks

Nonlinear models for Classification and Regression 34

Page 35: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Nonlinear models for Classification and Regression 35

Quadratic Gaussian Classifier

Assume p(x|ck) Gaussian with different means uk and different

covariance matrices ∑k. D is the dimension of the input space.

Estimate means and covariance matrices for the categories

maximizing the likelihood of the dataset p(D| uk, ∑k ):

kK

T

k

K

Dk uxuxcxp2

1exp

)det()2(

1)|(

2/

kcnxk

k nxN

u)(

)(1

ˆ Tk

K

k cnx

k

k

k unxunxN

k

ˆ)(ˆ)(1

1

1 )(

Page 36: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: Quadratic Gaussian Classifier

Nonlinear models for Classification and Regression 36

Training error = 0.07%

Test error = 0.03%

Page 37: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: Quadratic Gaussian Classifier

Nonlinear models for Classification and Regression 37

Training error = 0.07%

Test error = 0.03%

Page 38: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Nonlinear models for Classification and Regression 38

Linear Gaussian class boundary

11399 green samples

2142 red samples

Training error = 0.06%

Test error = 0.10%

Page 39: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Quadratic logistic regression

Nonlinear models for Classification and Regression 39

Fit w maximizing conditional likelihood like for the linear logistic

regression.

cb,A,=w

c)+xb+Ax(x+=w)f(x,

TTexp1

1

Page 40: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

K Nearest Neighbours classification

Nonlinear models for Classification and Regression 40

Estimate the posterior probabilities according to neighbours

Maximum a posteriori classification

K

K=x)|(cp

j

x)|(cpc jc

jˆmaxargˆ

Page 41: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: 1-NN classifier

Nonlinear models for Classification and Regression 41

Test error = 0.10%

Page 42: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: 5-NN classifier

Nonlinear models for Classification and Regression 42

Test error = 0.14%

Page 43: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Decision Trees

Nonlinear models for Classification and Regression 43

Split into smaller and smaller subsets.

Each split increases node purity (e.g. entropy).

Splits usually made along variable axes. This generates a subdivision

into “hypercubes”.

Backwards pruning important.

Page 44: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: Decision Tree

Nonlinear models for Classification and Regression 44

First cut along x1:

Rule: if x1 < -0.1515 then red otherwise green.

No suitable cut along x2 after the first cut along x1.

Training error = 0.06%

Test error = 0.07%

Page 45: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Inductive learning of a Decision Tree

Nonlinear models for Classification and Regression 45

Simplest: Construct a decision tree with one leaf for every

example. Memory based learning, not very good generalization.

Advanced: Split on each variable so that the purity of each split

increases (i.e. only samples belonging to one class).

A purity measure can be for example entropy.

i

ii )p(c)p(c= lnEntropy

Page 46: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Entropy: a measure of “order”

Nonlinear models for Classification and Regression 46

The entropy is maximal

when all possibilities

are equally likely.

The goal of the decision

tree is to decrease the

entropy in each node.

Entropy is zero in a

“pure” node, i.e. a node

containing only samples

belonging to one class.

Page 47: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Entropy function

Nonlinear models for Classification and Regression 47

Plot the entropy function for a 2 class problem, where the classes

are yes and no as a function of p(yes).

[p(no)]p(no)[p(yes)]p(yes)= lnlnEntropy

Page 48: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Decision Tree learning algorithm

Nonlinear models for Classification and Regression 48

Create pure nodes whenever possible.

If pure nodes are not possible, choose the split that leads to the

largest decrease in entropy.

Page 49: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Decision Tree learning - 1

Nonlinear models for Classification and Regression 49

Apply the decision tree learning algorithm to the following data

set with 10 features and 12 observations.

Page 50: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Decision Tree learning - 2

Nonlinear models for Classification and Regression 50

Dataset:

Variable to predict: TRUE or FALSE

Page 51: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Decision Tree learning result

Nonlinear models for Classification and Regression 51

Page 52: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

True Decision Tree

Nonlinear models for Classification and Regression 52

Page 53: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Considerations – Inductive learning

Nonlinear models for Classification and Regression 53

The induced decision tree cannot be more complex than what the

data support.

The tree was constructed based on perfect learning, i.e. we

assume that there are no mistakes on the training data. This is

often not a good idea!

Page 54: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Considerations – Inductive learning

Nonlinear models for Classification and Regression 54

The induced decision tree cannot be more complex than what the

data support.

The tree was constructed based on perfect learning, i.e. we

assume that there are no mistakes on the training data. This is

often not a good idea!

Probably good to stop learning before having pure nodes or to prune

some nodes and branches. Then estimate the a posteriori probabilities

from the number of observations of different classes in the leaf.

K

K=x)|(cp

j

Page 55: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

How do we know that f≈g?

Nonlinear models for Classification and Regression 55

In other words, how do we know that what we learned is correct?

Try f on a new test set of examples(cross validation)...

...and assume the ”principle of uniformity”, i.e. the result we get on this test data should be indicative of results on future data.

Learning curve for the decision tree

algorithm on 100 randomly generated

examples (test set) in the restaurant

domain. The graph plots the average

of 20 trials.

Page 56: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Cross Validation

Nonlinear models for Classification and Regression 56

Split your data set into two parts, one for training your model

and the other for validating your model. The error on the

validation data is called “validation error”(Eval).

valgen EE

valE

Page 57: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

K fold Cross Validation

Nonlinear models for Classification and Regression 57

More accurate than using only one validation set.

Fit your model K times and test it K times. Then average the

performance.

K

=k

valvalgen (k)EK

=EE1

1

)(Eval 1 )(Eval 2 )(Eval 3

Page 58: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Leave One Out Cross Validation

Nonlinear models for Classification and Regression 58

Use every data point as validation set.

Leave One Out is a K fold cross validation with K equals to the

number of observations.

K

=k

valvalgen (k)EK

=EE1

1

)(Eval 1 )(Eval 2 (K)Eval

Page 59: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Artificial Neural Networks

1. Repetition

2. Nonlinear models for regression

3. Nonlinear models for classification

4. Artificial Neural Networks

Nonlinear models for Classification and Regression 59

Page 60: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Problems with single perceptron

Nonlinear models for Classification and Regression 60

1. Single layer allows only

linear machine.

2. Perceptron learning

oscillates if data

distributions overlap.

3. Step functions

complicate learning

with many perceptrons.

Page 61: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Problems with single perceptron

Nonlinear models for Classification and Regression 61

1. Single layer allows only

linear machine.

2. Perceptron learning

oscillates if data

distributions overlap.

3. Step functions

complicate learning

with many perceptrons.

1. Use many perceptrons

arranged in multiple layers.

2. Use a different learning

algorithm.

3. Use smooth transfer

functions.

Page 62: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Problems with single perceptron

Nonlinear models for Classification and Regression 62

1. Single layer allows only

linear machine.

2. Perceptron learning

oscillates if data

distributions overlap.

3. Step functions

complicate learning

with many perceptrons.

1. Use many perceptrons

arranged in multiple layers.

2. Use a different learning

algorithm.

3. Use smooth transfer

functions.

Multilayer Perceptron

a.k.a. Artificial Neural Network (ANN)

Page 63: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

The Multilayer Perceptron

Nonlinear models for Classification and Regression 63

Combine several single layer perceptrons.

Each single layer perceptron uses a sigmoid shaped transfer

function like the logistic or hyperbolic tangent function.

z)(+=φ(z)

(z)=φ(z)

exp1

1

tanh

Page 64: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Transfer functions

Nonlinear models for Classification and Regression 64

z)(+=φ(z)

(z)=φ(z)

exp1

1

tanh

Page 65: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Training a Multilayer Perceptron

Nonlinear models for Classification and Regression 65

The simplest algorithm for training a multilayer perceptron is the

backpropagation algorithm:

1. Select small random weights w.

2. Until halting condition: 1. Select a random training example.

2. Calculate the output of the hidden layer.

3. Calculate the output of the output layer.

4. Calculate error for output layer.

5. Calculate error for hidden layer.

6. Update weights.

Page 66: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Training a Multilayer Perceptron

Nonlinear models for Classification and Regression 66

The simplest algorithm for training a multilayer perceptron is the

backpropagation algorithm:

1. Select small random weights w.

2. Until halting condition: 1. Select a random training example.

2. Calculate the output of the hidden layer.

3. Calculate the output of the output layer.

4. Calculate error for output layer.

5. Calculate error for hidden layer.

6. Update weights.

Forward Step

Backwards Step

Page 67: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

The backpropagation algorithm

Nonlinear models for Classification and Regression 67

The error is propagated backwards, from which the name of the

algorithm.

Forward step:

Backwards step:

Backpropagation is gradient descent on SSE

[h]yh(x)x ˆ

eδδ ij

Page 68: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

The backpropagation algorithm

Nonlinear models for Classification and Regression 68

The error is propagated backwards, from which the name of the

algorithm.

Forward step:

Backwards step:

Backpropagation is gradient descent on SSE

[h]yh(x)x ˆ

eδδ ij

E(t)η=Δw(t) w

Page 69: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: 2-2-1 backpropagation

Nonlinear models for Classification and Regression 69

Training error = 0.20%

Test error = 0.24%

Converges after 3000

epochs (forever!!)

Page 70: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Speeding up BackProp: bold driver

Nonlinear models for Classification and Regression 70

Adaptive learning rate: If things are going well increase speed

If things are going bad decrease speed

)E(t<E(t))(t

)E(tE(t))(t=η(t)

E(t)η(t)=Δw(t) w

1 if 11.2η

1 if 10.5η

Page 71: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: bold driver

Nonlinear models for Classification and Regression 71

Backpropagation:

fixed h

Bold driver:

adaptive h

Page 72: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Speeding up BackProp: momentum

Nonlinear models for Classification and Regression 72

Speed up if several steps are in the same direction.

Slow down if steps change direction all the time.

The update is the moving average of the updates calculated at

every iteration.

)αΔw(t+E(t)η=Δw(t) w 1E(t)η=Δw(t) w

update rule for

backpropagation update rule for backpropagation

with momentum

Page 73: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: momentum

Nonlinear models for Classification and Regression 73

Backpropagation:

no momentum

Backpropagation:

with momentum

Page 74: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Second order search

Nonlinear models for Classification and Regression 74

Go downhill.

The step length is estimated from the curvature of the function.

EH=Δw w 1

Page 75: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Second order learning algorithms

Nonlinear models for Classification and Regression 75

Adjust the length step analytically.

Jacobian: matrix of the first partial derivatives of a function.

Hessian: matrix of the second partial derivatives of a function:

j

i

x

f=ji,J(f)

ji xx

f=ji,H(f)

2

Page 76: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Why second order learning algorithms?

Nonlinear models for Classification and Regression 76

Taylor expansion of the error function:

Hessian matrix:

Setting the first derivative of E(w+Dw) w.r.t. Dw equals to zero yelds to:

DDDD wH(w)w+E(w)w+E(w)=w)+E(w TT

E(w)H(w) T

0=Δw)+E(wΔw EH=Δw w 1

Page 77: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: second order learning

Nonlinear models for Classification and Regression 77

Solution in one step

for a quadratic

error function!

Page 78: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Levenberg-Marquardt algorithm

Nonlinear models for Classification and Regression 78

The Hessian is expensive to calculate, and second derivatives are often very noisy. Use a Jacobian based approximation of the Hessian.

Apply regularization.

E(t)λI+t)(n,t)JJ(n,=Δw(t) w

n

T

1

n

T (n)(n)JH 2J

Page 79: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: 2-2-1 backpropagation

Nonlinear models for Classification and Regression 79

Training error = 0.20%

Test error = 0.24%

Converges after 3000

epochs (forever!!)

Page 80: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: 2-2-1 Levenberg-Marquardt

Nonlinear models for Classification and Regression 80

Training error = 0.03%

Test error = 0.07%

Converges after

3 epochs

Page 81: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Example: 2-3-1 Levenberg-Marquardt

Nonlinear models for Classification and Regression 81

Training error = 0.17%

Test error = 0.16%

Converges after

150 epochs

Page 82: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

ANN for nonlinear regression

Nonlinear models for Classification and Regression 82

Page 83: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

No unique solution!

Nonlinear models for Classification and Regression 83

When the optimization

algorithm terminates,

nothing certifies that the

global minimum of the

error function has been

reached.

Often ANN training

algorithms terminate in

local minima.

Page 84: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Interpretation of ANN

Nonlinear models for Classification and Regression 84

Classification: nonlinear logistic regression

Regression: projection pursuit regression

[f(x)]+=(x)y=x)|(cp

exp1

1ˆˆ

j

T

jjj x)(whv=(x)y

Page 85: Nonlinear models for Classification and Regression130.243.105.49/Research/Learning/courses/ml/2011/...State of the Art Methods of Data Modeling and Machine Learning, IMRIS program,

Book Reading (Bishop)

Nonlinear models for Classification and Regression 85

Ch. 2.5

Ch. 4.1, 4.2, 4.3

Ch. 5.1, 5.2, 5.3, 5.4