View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Data mining and statistical learning - lecture 12
Neural networks (NN) and
Multivariate Adaptive Regression Splines (MARS)
Different types of neural networks
Considerations in neural network modelling
Multivariate Adaptive Regression Splines
Data mining and statistical learning - lecture 12
Feed forward neural network
Feed-forward neural network• Input layer• Hidden layer(s)• Output layer
x1 x2 xp
z1 z2 zM…
…
f1 fK…
Data mining and statistical learning - lecture 12
Terminology
• Feed-forward network– Nodes in one layer are connected to the nodes in
next layer
• Recurrent network– Nodes in one layer may be connected to the ones
in previous layer or within the same layer
Data mining and statistical learning - lecture 12
Multilayer perceptrons
• Any number of inputs• Any number of outputs• One or more hidden layers with
any number of units.• Linear combinations of the
outputs from one layer form inputs to the following layers
• Sigmoid activation functions in the hidden layers. x1 x2 xp
z1 z2 zM…
…
f1 fK…
Data mining and statistical learning - lecture 12
Parameters in a multilayer perceptron
• C1, C2 : combination function
• g, : activation function 0m 0k : bias of hidden unit
im jk : weight of connection
MmzC
Tmmm ,...,1,)(
1
0 Xα
KkgfC
Tkkkk ,...,1,)(
2
0 Zβ
x1 x2 xp
z1 z2 zM…
…
f1 fK…
Data mining and statistical learning - lecture 12
Least squares fitting of neural networks
Consider a simple perceptron (no hidden layer)
Find weights and bias
minimizing the error function
KkXXf ppkkkTkkk ,1,),()( 1100 Xα
x1 x2 xp
f1 f2 fK
…
K
k
N
iikik yfR
1 1
2)()( x
Data mining and statistical learning - lecture 12
Alternative measures of fit
• For regression we normally use the sum-of-squared errors as measure of fit
• For classification we use either squared errors or cross-entropy (deviance)
and the corresponding classifier is argmaxk fk(x)
• The measure of fit can also be adapted to specific distributions, such as Poisson distributions
K
k
N
iikik yfR
1 1
2)()( x
K
k
N
iikik fyR
1 1
)(log)( x
Data mining and statistical learning - lecture 12
Combination and activation functions
• Combination function– Linear combination:
– Radial combination:
• Activation function in the hidden layer– Identity– Sigmoid
• Activation function in the output layer – Softmax– Identity
j
jjmm x 0
j
jjmm x 220 )(
zβTT TkkkK
lk
kk
T
Tg
0
1
where)exp(
)exp()(
Data mining and statistical learning - lecture 12
Ordinary radial basis function networks (ORBF)
• Input and output layers and one hidden layer
• Hidden layer: Combination function=radialActivation function=exponential,
softmax
• Output layer:Combination function=linearActivation function =any, normally
identity x1 x2 xp
z1 z2 zM…
…
f1 fK…
Data mining and statistical learning - lecture 12
Issues in neural network modelling
• Preliminary training – learning with different initial weights (since multiple local minima are possible)
• Scaling of the inputs is important (standardization)
• The number of nodes in the hidden layer(s)
• The choice of activation function in the output layer– Interval – identity– Nominal – softmax
Data mining and statistical learning - lecture 12
Overcoming over-fitting
1. Early stopping
2. Adding a penalty function
Objective function=Error function+Penalty term
im mkmkim22
Data mining and statistical learning - lecture 12
MARS: Multivariate Adaptive Regression Splines
An adaptive procedure for regression that can be regarded as a generalization of stepwise linear regression
Data mining and statistical learning - lecture 12
Reflected pair of functions
with a knot at the value x1
00.1
0.20.30.4
0.50.6
0.70.8
0 0.2 0.4 0.6 0.8 1
(x-x1)+ (x1-x)+
x1
Data mining and statistical learning - lecture 12
Reflected pairs of functions
with knots at the values x1 and x2
00.10.20.30.40.50.60.70.8
0 0.2 0.4 0.6 0.8 1
(x-x1)+ (x1-x)+ (x-x2)+ (x2-x)+
x1 x2
Data mining and statistical learning - lecture 12
MARS with a single input X
taking the values x1, …, xN
Form the collection
of base functions
Construct models of the form
where each hm(X) is a function in C or a product of two or more such functions
},...,,{;)(,)( 21 NxxxtXttXC
M
mmm XhXf
10 )()(
Data mining and statistical learning - lecture 12
MARS model with a single input X
taking the values x1, x2
0
0.5
1
1.5
2
2.5
3
0 0.2 0.4 0.6 0.8 1 1.2
E(Y)
x1 x2
Data mining and statistical learning - lecture 12
MARS model with a single input X
taking the values x1, x2
0
0.5
1
1.5
2
2.5
3
0 0.2 0.4 0.6 0.8 1 1.2
E(Y)
x1 x2
Data mining and statistical learning - lecture 12
MARS: Multivariate Adaptive Regression Splines
At each stage we consider as a new basis function pair all products of functions already in the model with one of the reflected pairs in the set C
Although each basis function depends only on a single Xj it is considered as a function over the entire input space
Data mining and statistical learning - lecture 12
MARS: Multivariate Adaptive Regression Splines
- model selection
MARS functions typically overfit the data and so a backward deletion procedure is applied
The size of the model is determined by Generalized Cross Validation
An upper limit can be set on the order of interaction
Data mining and statistical learning - lecture 12
The MARS model can be viewed as a generalization
of the classification and regression tree (CART)
6
7
8
9
10
11
12
13
3 4 5 6 7 8x1
x 2
Data mining and statistical learning - lecture 12
Some characteristics of different learning methods
Characteristic Neural networks
Trees MARS
Natural handling of data of “mixed” type Poor Good Good
Handling of missing values Poor Good Good
Robustness to outliers in input space Poor Good Poor
Insensitive to monotone transformations of inputs Poor Good Poor
Computational scalability (large N) Poor Good Good
Ability to deal with irrelevant inputs Poor Good Good
Ability to extract linear combinations of features Good Poor Poor
Interpretability Poor Fair Good
Predictive power Good Poor Fair
Data mining and statistical learning - lecture 12
Separating hyperplane
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
x1
x2
00 Tx
Data mining and statistical learning - lecture 12
Optimal separating hyperplane
- support vector classifier
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
00 Tx
margin
Find the hyperplane that creates the biggest margin between the training points for class 1 and -1
Data mining and statistical learning - lecture 12
Formulation
of the optimization problem
NiCxy
C
Tii ...,,1,)( subject to
max
0
1,. 0
Signed distance to decision border
y=1 for one of the groups and y=-1 for the other one
Data mining and statistical learning - lecture 12
Two equivalent formulations
of the optimization problem
NiCxy
C
Tii ...,,1,)( subject to
max
0
1,. 0
Nixy Tii ...,,1,1)( subject to
min
0
. 0
Data mining and statistical learning - lecture 12
Characteristics of the support vector classifier
Points well inside their class boundary do not play a big role in the shaping of the decision border
Cf. linear discriminant analysis (LDA) for which the decision boundary is determined by the covariance matrix of the class distributions and their centroids
Data mining and statistical learning - lecture 12
Support vector machines
using basis expansions (polynomials, splines)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
h1(x)
h2(
x)
0)()( 0 Txhxf
Data mining and statistical learning - lecture 12
Characteristics of support vector machines
The dimension of the enlarged feature space can be very large
Overfitting is prevented by a built-in shrinkage of beta coefficients
Irrelevant inputs can create serious problems
Data mining and statistical learning - lecture 12
The SVM as a penalization method
Misclassification: f(x) < 0 when y=1 or f(x)>0 when y=-1
Loss function:
Loss function + penalty:
)](1[
1i
N
ii xfy
2
1
)](1[ i
N
ii xfy
Data mining and statistical learning - lecture 12
The SVM as a penalization method
Minimizing the loss function + penalty
is equivalent to fitting a support vector machine to data
The penalty factor is a function of the constant providing an upper bound of
2
1
)](1[ i
N
ii xfy
N
ii
1
Data mining and statistical learning - lecture 12
Some characteristics of different learning methods
Characteristic Neural networks
Support vector machines
Trees MARS
Natural handling of data of “mixed” type Poor Poor Good Good
Handling of missing values Poor Poor Good Good
Robustness to outliers in input space Poor Poor Good Poor
Insensitive to monotone transformations of inputs
Poor Poor Good Poor
Computational scalability (large N) Poor Poor Good Good
Ability to deal with irrelevant inputs Poor Poor Good Good
Ability to extract linear combinations of features Good Good Poor Poor
Interpretability Poor Poor Fair Good
Predictive power Good Good Poor Fair