31
Data mining and statistic al learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS) Different types of neural networks Considerations in neural network modelling Multivariate Adaptive Regression Splines

Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS) Different types of neural

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Neural networks (NN) and

Multivariate Adaptive Regression Splines (MARS)

Different types of neural networks

Considerations in neural network modelling

Multivariate Adaptive Regression Splines

Page 2: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Feed forward neural network

Feed-forward neural network• Input layer• Hidden layer(s)• Output layer

x1 x2 xp

z1 z2 zM…

f1 fK…

Page 3: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Terminology

• Feed-forward network– Nodes in one layer are connected to the nodes in

next layer

• Recurrent network– Nodes in one layer may be connected to the ones

in previous layer or within the same layer

Page 4: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Multilayer perceptrons

• Any number of inputs• Any number of outputs• One or more hidden layers with

any number of units.• Linear combinations of the

outputs from one layer form inputs to the following layers

• Sigmoid activation functions in the hidden layers. x1 x2 xp

z1 z2 zM…

f1 fK…

Page 5: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Parameters in a multilayer perceptron

• C1, C2 : combination function

• g, : activation function 0m 0k : bias of hidden unit

im jk : weight of connection

MmzC

Tmmm ,...,1,)(

1

0 Xα

KkgfC

Tkkkk ,...,1,)(

2

0 Zβ

x1 x2 xp

z1 z2 zM…

f1 fK…

Page 6: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Least squares fitting of neural networks

Consider a simple perceptron (no hidden layer)

Find weights and bias

minimizing the error function

KkXXf ppkkkTkkk ,1,),()( 1100 Xα

x1 x2 xp

f1 f2 fK

K

k

N

iikik yfR

1 1

2)()( x

Page 7: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Alternative measures of fit

• For regression we normally use the sum-of-squared errors as measure of fit

• For classification we use either squared errors or cross-entropy (deviance)

and the corresponding classifier is argmaxk fk(x)

• The measure of fit can also be adapted to specific distributions, such as Poisson distributions

K

k

N

iikik yfR

1 1

2)()( x

K

k

N

iikik fyR

1 1

)(log)( x

Page 8: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Combination and activation functions

• Combination function– Linear combination:

– Radial combination:

• Activation function in the hidden layer– Identity– Sigmoid

• Activation function in the output layer – Softmax– Identity

j

jjmm x 0

j

jjmm x 220 )(

zβTT TkkkK

lk

kk

T

Tg

0

1

where)exp(

)exp()(

Page 9: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Ordinary radial basis function networks (ORBF)

• Input and output layers and one hidden layer

• Hidden layer: Combination function=radialActivation function=exponential,

softmax

• Output layer:Combination function=linearActivation function =any, normally

identity x1 x2 xp

z1 z2 zM…

f1 fK…

Page 10: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Issues in neural network modelling

• Preliminary training – learning with different initial weights (since multiple local minima are possible)

• Scaling of the inputs is important (standardization)

• The number of nodes in the hidden layer(s)

• The choice of activation function in the output layer– Interval – identity– Nominal – softmax

Page 11: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Overcoming over-fitting

1. Early stopping

2. Adding a penalty function

Objective function=Error function+Penalty term

im mkmkim22

Page 12: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

MARS: Multivariate Adaptive Regression Splines

An adaptive procedure for regression that can be regarded as a generalization of stepwise linear regression

Page 13: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Reflected pair of functions

with a knot at the value x1

00.1

0.20.30.4

0.50.6

0.70.8

0 0.2 0.4 0.6 0.8 1

(x-x1)+ (x1-x)+

x1

Page 14: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Reflected pairs of functions

with knots at the values x1 and x2

00.10.20.30.40.50.60.70.8

0 0.2 0.4 0.6 0.8 1

(x-x1)+ (x1-x)+ (x-x2)+ (x2-x)+

x1 x2

Page 15: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

MARS with a single input X

taking the values x1, …, xN

Form the collection

of base functions

Construct models of the form

where each hm(X) is a function in C or a product of two or more such functions

},...,,{;)(,)( 21 NxxxtXttXC

M

mmm XhXf

10 )()(

Page 16: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

MARS model with a single input X

taking the values x1, x2

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8 1 1.2

E(Y)

x1 x2

Page 17: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

MARS model with a single input X

taking the values x1, x2

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8 1 1.2

E(Y)

x1 x2

Page 18: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

MARS: Multivariate Adaptive Regression Splines

At each stage we consider as a new basis function pair all products of functions already in the model with one of the reflected pairs in the set C

Although each basis function depends only on a single Xj it is considered as a function over the entire input space

Page 19: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

MARS: Multivariate Adaptive Regression Splines

- model selection

MARS functions typically overfit the data and so a backward deletion procedure is applied

The size of the model is determined by Generalized Cross Validation

An upper limit can be set on the order of interaction

Page 20: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

The MARS model can be viewed as a generalization

of the classification and regression tree (CART)

6

7

8

9

10

11

12

13

3 4 5 6 7 8x1

x 2

Page 21: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Some characteristics of different learning methods

Characteristic Neural networks

Trees MARS

Natural handling of data of “mixed” type Poor Good Good

Handling of missing values Poor Good Good

Robustness to outliers in input space Poor Good Poor

Insensitive to monotone transformations of inputs Poor Good Poor

Computational scalability (large N) Poor Good Good

Ability to deal with irrelevant inputs Poor Good Good

Ability to extract linear combinations of features Good Poor Poor

Interpretability Poor Fair Good

Predictive power Good Poor Fair

Page 22: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Separating hyperplane

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

x1

x2

00 Tx

Page 23: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Optimal separating hyperplane

- support vector classifier

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

00 Tx

margin

Find the hyperplane that creates the biggest margin between the training points for class 1 and -1

Page 24: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Formulation

of the optimization problem

NiCxy

C

Tii ...,,1,)( subject to

max

0

1,. 0

Signed distance to decision border

y=1 for one of the groups and y=-1 for the other one

Page 25: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Two equivalent formulations

of the optimization problem

NiCxy

C

Tii ...,,1,)( subject to

max

0

1,. 0

Nixy Tii ...,,1,1)( subject to

min

0

. 0

Page 26: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Characteristics of the support vector classifier

Points well inside their class boundary do not play a big role in the shaping of the decision border

Cf. linear discriminant analysis (LDA) for which the decision boundary is determined by the covariance matrix of the class distributions and their centroids

Page 27: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Support vector machines

using basis expansions (polynomials, splines)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

h1(x)

h2(

x)

0)()( 0 Txhxf

Page 28: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Characteristics of support vector machines

The dimension of the enlarged feature space can be very large

Overfitting is prevented by a built-in shrinkage of beta coefficients

Irrelevant inputs can create serious problems

Page 29: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

The SVM as a penalization method

Misclassification: f(x) < 0 when y=1 or f(x)>0 when y=-1

Loss function:

Loss function + penalty:

)](1[

1i

N

ii xfy

2

1

)](1[ i

N

ii xfy

Page 30: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

The SVM as a penalization method

Minimizing the loss function + penalty

is equivalent to fitting a support vector machine to data

The penalty factor is a function of the constant providing an upper bound of

2

1

)](1[ i

N

ii xfy

N

ii

1

Page 31: Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural

Data mining and statistical learning - lecture 12

Some characteristics of different learning methods

Characteristic Neural networks

Support vector machines

Trees MARS

Natural handling of data of “mixed” type Poor Poor Good Good

Handling of missing values Poor Poor Good Good

Robustness to outliers in input space Poor Poor Good Poor

Insensitive to monotone transformations of inputs

Poor Poor Good Poor

Computational scalability (large N) Poor Poor Good Good

Ability to deal with irrelevant inputs Poor Poor Good Good

Ability to extract linear combinations of features Good Good Poor Poor

Interpretability Poor Poor Fair Good

Predictive power Good Good Poor Fair