Chapter 3. Advanced Data Mining –Neural Networks.doc

Chapter 3. Advanced Data Mining –Neural Networks

3.1 Supervised PredictionA. IntroductionB. Univariate Function EstimationC. Multivariate Function Estimation

3.2 Network ArchitectureA. Multilayer PerceptronsB. Link FunctionsC. Normalized Radial Basis Functions

3.3 Training ModelA. EstimationB. OptimizationC. Issues in Training Model

3.4 Model ComplexityA. Model GeneralizationB. Architecture SelectionC. Pruning InputsD. Regularization

3.5 Predictive Data MiningA. OverviewB. Generalized Additive Neural Networks

1

3.1 Supervised Prediction

A. Introduction

Data set n cases (observations, examples, instances). A vector of input variables A target variable y (response, outcome)

Supervised prediction: to identify the target variable y, build a model to predict y using inputs .

Predictive data mining: predictive modeling methods applied to large operational (mainly corporate) database.

Types of targets:1. Interval measurement scale (amount of some quantities)2. Binary response (indicator of an event)3. Nominal measurement scale (classifications)4. Ordinal measurement scale (grade)

What do we model?1. Amount of some quantities (interval measurement)2. The probability of the event (binary response)3. The probabilities of each class of grade (nominal or ordinal

measurement)

2

B. Univariate Function Estimation

Example 3.1 The EXPCAR data set contains simulated data relating a single target, to a single input X. The variable EY represents the true function relationship given in a form

This function form is generally unknown in applications.

Nonlinear Regression:

ReferenceSeber and Wild (1989) Nonlinear Regression, New York: John Wiley & Sons.

3

Main idea: assume that the functional form of the relationship between y and x is known, up to a set of unknown parameters. These unknown parameters can be estimated by fitting the model to the data. The SAS NLIN procedure performs nonlinear regression.

Model 1: Model is correctly specified, i.e. we know that

proc nlin data=Neuralnt.expcar; parms theta1=100 theta2=2 theta3=.5 theta4=-.01; model y1=theta1/((theta3/(theta3-theta4))*exp(- theta4*x)+(theta1/theta2- (theta3/(theta3-theta4)))*exp(theta3*x)); output out=pred p=y1hat;

run;

The estimating procedure involves an iterative algorithm, so the initial guesses for the parameter estimates are necessary.

The keyword p in the OUTPUT statement signifies the predicted values

After fitting the model, we plot the predicted results

goptions reset=all;proc gplot data=pred;

symbol1 c=b v=none i=join; symbol2 c=bl v=circle i=none; symbol3 c=r v=none i=join;

legend1 across=3 position=(top inside right) label=none mode=share;plot ey*x=1 y1*x=2 y1hat*x=3 / frame overlay

legend=legend1;run;quit;

4

Clearly, it shows the model fits the data very well. It is to be expected since the model if correctly specified.

By default, PROC NLIN finds the least-squares estimated using the Gauss-Newton method.

The least-squares estimates:

satisfies

Note the algorithm is sensitive to the initial guess. It may diverge with a bad initial value.

5

proc nlin data=Neuralnt.expcar; parms theta1=10 theta2=.2 theta3=.05 theta4=-.001; model y1=theta1/((theta3/(theta3-theta4))*exp(-

theta4*x)+(theta1/theta2-(theta3/(theta3-theta4)))*exp(theta3*x));

output out=pred p=y1hat;run;


symbol1 c=b v=none i=join; symbol2 c=bl v=circle i=none; symbol3 c=r v=none i=join;

legend1 across=3 position=(top inside right) label=none mode=share;

plot ey*x=1 y1*x=2 y1hat*x=3 / frame overlay legend=legend1;

run;quit;

6

Clearly, this is not the right estimate.

The NLIN Procedure Iterative Phase Dependent Variable Y1 Method: Gauss-Newton

Sum of Iter theta1 theta2 theta3 theta4 Squares

88 -0.00848 1.3514 0.00315 0.5310 160349 89 -0.00822 1.3591 0.00303 0.5295 159871 90 -0.00791 1.3684 0.00289 0.5277 159633 91 -0.00771 1.3741 0.00280 0.5266 159106 92 -0.00746 1.3810 0.00269 0.5252 158662 93 -0.00716 1.3895 0.00255 0.5236 158444 94 -0.00697 1.3946 0.00247 0.5227 157959 95 -0.00673 1.4009 0.00237 0.5215 157556 96 -0.00643 1.4086 0.00225 0.5201 157371 97 -0.00625 1.4133 0.00218 0.5192 156930 98 -0.00602 1.4190 0.00208 0.5182 156572 99 -0.00574 1.4260 0.00197 0.5169 156433 100 -0.00556 1.4303 0.00190 0.5162 156037

WARNING: Maximum number of iterations exceeded.

WARNING: PROC NLIN failed to converge.

Sometimes alternative algorithms do better. The Marquardt (Levenberg-Marquardt) method is a modification of the Gauss-

7

Newton method designed for ill-conditioned problems. In this case it is able to find the least squares estimates in 12 iterations.

proc nlin data=Neuralnt.expcar method=marquardt; parms theta1=10 theta2=.2 theta3=.05 theta4=-.001;

model y1=theta1/((theta3/(theta3-theta4))*exp(- theta4*x)+(theta1/theta2- (theta3/(theta3-theta4)))*exp(-theta3*x));

output out=pred p=y1hat;run;

Parametric Model Fitting:

Parametric regression models can be used when the mathematical mechanism of the data generalization is unknown.

Empirical model building

If we look at the scatter plot carefully, we may find that there is a trend: y is very small when x is close to zero, it increases quickly and then decreases to zero again as x increases.

This encourages us to try a parametric model (with two parameters, )

proc nlin data=Nueralnt.expcar; parms theta1=1 theta2=.05; model y1=theta1*x*exp(-theta2*x); output out=pred p=y1hat;

run;


symbol1 c=b v=none i=join;

8

symbol2 c=bl v=circle i=none; symbol3 c=r v=none i=join;


plot ey*x=1 y1*x=2 y1hat*x=3 / frame overlay legend=legend1;

run;quit;

This empirical model fits the data quite well.

Polynomial parametric models:

Based in approximation theorem, any smooth function can be approximate by a polynomial with a certain degree.

Consider a polynomial model

,

where d is the degree of the polynomial.

9

Polynomials are linear with respect to the regression parameters. The least squares estimates have a close-form solution and do not require iterative algorithms. Polynomials can be fitted in a number of SAS procedures. Alternatively, polynomials (up to degree 3) can be fitted in GPLOT using the i=r<l|q|c> option on the SYMBOL statement.

Linear regression:

Model:


symbol1 c=b v=none i=join; symbol2 c=bl v=circle i=none; symbol3 c=r v=none i=rl; legend1 across=3 position=(top inside right)

label=none mode=share;plot ey*x=1 y1*x=2 y1*x=3 / frame overlay legend=legend1;

run;quit;

10

Cubic Regression:

Model:


symbol1 c=b v=none i=join; symbol2 c=bl v=circle i=none; symbol3 c=r v=none i=rc;


plot ey*x=1 y1*x=2 y1*x=3 / frame overlay legend=legend1;

run;quit;

11

Apparently, lower order polynomials are lack of fit for the data set. Instead of continuing to increase the degree, a better strategy is to use modern smoothing methods.Nonparametric Model Fitting:

A nonparametric smoother can fit data without having to specify a parametric functional form. Three popular methods are:

Loess-Each predicted value results from a separate (weighted) polynomial regression on a subset of the data centered on that data point (window).

Kernel Regression-Each predicted value is a weighted average of the data in a window around that data point.

Smoothing Splines-Smoothing splines are made up of piecewise cubic polynomials, joined continuously and smoothly at a set knots

Smoothing spline can be fitted using GPLOT with option i=SM<nn>

Reference:

12

1. Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models, New York: Chapman and Hall.

2. Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and its Applications. New York: Chapman and Hall.

Smoothing Splines

Goal: to minimize

,

subject to that are continuous at each knots.

A smoothing spline is known to be over-parameterization since there are more parameters than data points. But it does not necessarily interpolate (overfit) data.

The parameters are estimated using penalized least squares. The penalty term favors curves where the average squared second derivative is low. This discourages a bumpy fit.

is the smoothing parameter. When it increases, the fit becomes smoother and less flexible. For smoothing splines the smoothing is specified by the constant after SM in the SYMBOL statement. The smoothing parameter rages from 0 to 99. Incorporating

13

(roughness/complexity) penalties into the estimation criterion is called regularization.

A smoothing spline is actually a parametric model, but the interpretation of the parameter is uninformative.

Let’s fit this data using some smoothing splines.

goptions reset=all;proc gplot data=Neuralnt.expcar;

symbol1 c=b v=none i=join; symbol2 c=bl v=circle i=none; symbol3 c=r v=none i=sm0; title1 h=1.5 c=r j=right 'Smoothing Spline,

SM=0'; plot ey*x=1 y1*x=2 y1*x=3 / frame overlay;

run;quit;

14

15

C. Multivariate Function Estimation

Most of the modeling methods devised for supervised prediction problems are multivariate function estimations. The expected value of the target can be thought of as a surface in the space spanned by the input variables.

It is uncommon to use parametric nonlinear regression for more than one, or possibly two inputs.

Smoothing splines and loess suffer because of the relative sparseness of data in higher dimensions.

Multivariate function estimation is a challenging analytical task!

Some notable works on multivariate function estimation:

Friedman, J. H. and Stuetzle, W. (1981). “Projection Pursuit Regression”, J. am. Statist Assoc. 76, 817-23.

16

Friedman, J.H. (1991). “Multivariate adaptive regression splines”, Ann. Statist. 19. 1-141

Example 3.2. The ECC2D data set has 500 cases, two inputs , and an interval-scaled target . The ECC2DTE has an additional 3321 cases. The variable in ECC2DTE set is the true, usually unknown, value of the target. ECC2DTE serves as a test set; the fitted values and residuals can be plotted and evaluated.

The true surface is

17

Method 1:

proc reg data=neuralnt.ecc2d outest=betas; model y1=x1 x2;

quit;

proc score data=neuralnt.ecc2dte type=parms scores=betas out=ste; var x1 x2; run;

data ste; set ste; r_y1=model1-y1; run;

proc g3d data=ste; plot x1*x2=model1 / rotate=315; plot x1*x2=r_y1 / rotate=315 zmin=-50 zmax=50 zticknum=5; run; quit;

18

Method 2.

data a; set neuralnt.ecc2d; x11=x1**2; x22=x2**2; x12=x1*x2; x111=x1**3; x222=x2**3; x112=x1**2*x2; x221=x2**2*x1; run;

data te; set neuralnt.ecc2dte; x11=x1**2; x22=x2**2; x12=x1*x2; x111=x1**3; x222=x2**3;

19

x112=x1**2*x2; x221=x2**2*x1; run;

proc reg data=a outest=betas; model y1=x1 x2 x11 x22 x12 x111 x222 x112 x221; quit;

proc score data=te type=parms scores=betas out=ste; var x1 x2 x11 x22 x12 x111 x222 x112 x221; run;

data ste; set ste; r_y1=model1-y1; run;

proc g3d data=ste; plot x1*x2=model1 / rotate=315;

plot x1*x2=r_y1 / rotate=315 zmin=-50 zmax=50 zticknum=5;

run;quit;

20

These multivariate polynomials are apparently lack of fitting! Method 3: Neural Network Modeling

1. Input Data Source node (top):

21

Select the ECC2D data In the variables tab, change the model role of Y1 to target Change the model role of all other variables except X1 and X2

to rejected 2. Input Data Source node (bottom):

Select the ECC2DTE data In the Data tab, set the role to score

3. Score node: Select the button to apply training data score code to score data

set4. SAS Code node:

Type in the following program

proc g3d data=&_score; plot x1*x2=p_y1 /rotate=315;

run;

5. Run the flow from the Score node.

22

23

Documents

Chapter 3. Advanced Data Mining –Neural Networks.doc