36
Selection and Validation

Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Embed Size (px)

Citation preview

Page 1: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model Selection and Validation

Page 2: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model-Building Process1. Data collection and preparation

2. Reduction of explanatory or predictor variables (for exploratory observational studies)

3. Model refinement and selection

4. Model validation

Page 3: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Data collection and preparation• Controlled Experiments with Covariates: Statistical design of experiments uses

supplemental information, such as characteristics of the experimental units, in designing the experiment so as to reduce the variance of the experimental error terms in the regression model. Sometimes, however, it is not possible to incorporate this supplemental information into the design of the experiment. Instead, it may be possible for the experimenter to incorporate this information into the regression model and thereby reduce the error variance by including uncontrolled variables or covariates in the model.

• Confirmatory Observational Studies: For these studies, data are collected for explanatory variables that previous studies have shown to affect the response variable, as well as for the new variable or variables involved in the hypothesis. In this context, the explanatory variable(s) involved in the hypothesis are sometimes called the primary variables, and the explanatory variables that are included to reflect existing knowledge are called the control variables (known risk factors in epidemiology). The control variables here are not controlled as in an experimental study, but they are used to account for known influences on the response variable.

Page 4: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

• Exploratory Observational Studies: After a lengthy list of potentially useful explanatory variables has 'been compiled, some of these variables can be quickly screened out. An explanatory variable (1) may not be fundamental to the problem, (2) may be subject to large measurement errors, and/or (3) may effectively duplicate another explanatory variable in the list. Explanatory variables that cannot be measured may either be deleted or replaced by proxy variables that are highly correlated with them.

Page 5: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

• Data Preparation

Once the data have been collected, edit checks should be performed and plots prepared to identify gross data errors as well as extreme outliers.• Preliminary Model Investigation:

A variety of diagnostics should be employed to identify (1) the functional forms in which the explanatory variables should enter the regression model and (2) important interactions that should be included in the model Scatter plots and residual plots are useful for determining relationships and their strengths.

Page 6: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Reduction of Explanatory Variables• Controlled Experiments: The reduction of explanatory variables in the

model-building phase is usually not an important issue for controlled experiments. The experimenter has chosen the explanatory variables for investigation.

• Controlled Experiments with Covariates: In studies of controlled experiments with covariates, some reduction of the covariates may take place because investigators often cannot be sure in advance that the selected covariates will be helpful in reducing the error variance.

• Confirmatory Observational Studies: no reduction of explanatory variables should take place in confirmatory observational studies. The control variables were chosen on the basis of prior knowledge and should be retained for comparison with earlier studies even if some of the control variables tum out not to lead to any error variance reduction in the study at hand.

• Exploratory Observational Studies: In exploratory observational studies, the number of explanatory variables that remain after the initial screening typically is still large. Further, many of these variables frequently will be highly intercorrelated.

Page 7: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model Refinement and Selection• Checking tentative regression model, or the several "good"

regression models in detail for curvature and interaction effects.

• Residual plots are helpful in deciding whether one model is to be preferred over another.

• identifying influential outlying observations, multicollinearity, etc.

Page 8: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model Selection

Page 9: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Criteria for Model Selection• or SSE p Criterion• or MSE p Criterion• Criterion: total mean squared error of the n fitted values for each subset

regression model.

• AlC p and SBC p Criteria: model selection criteria that penalize models• having large numbers of predictors.

• PRESS p Criterion: is a measure of how well the use of the fitted values for a subset model can predict the observed responses Yj •

Page 10: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Automatic Search Procedures for Model Selection• "Best" Subsets Algorithms: the best subsets according to a specified

criterion are identified without requiring the fitting of all of the possible subset regression models.

Page 11: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Automatic Search Procedures for Model SelectionStepwise Regression Methods:• Forward Stepwise Regression• Backward Stepwise Regression

Page 12: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model Refinement

Page 13: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

F Test for Lack of Fit• Assumptions

The lack of fit test assumes that the observations Y for given X are (1) independent and (2) normally distributed, and that (3) the distributions of Y have the same variance • Test statistic:

Page 14: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

• If the linear regression model is not appropriate for a data set:

1. Abandon regression model and develop and use a more appropriate model.

2. Employ some transformation on the data so that regression model is appropriate for the transformed data.

Page 15: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Diagnostics

Page 16: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Departures from Model to Be Studied

by Residuals1. The regression function is not linear.2. The error terms do not have constant variance.3. The error terms are not independent.4. The model fits all but one or a few outlier observations.5. The error terms are not normally distributed.6. One or several important predictor variables have been omitted from the model.

Page 17: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Diagnostics for Residuals1. Plot of residuals against predictor variable.2. Plot of absolute or squared residuals against predictor variable.3. Plot of residuals against fitted values.4. Plot of residuals against time or other sequence.5. Plots of residuals against omitted predictor variables.6. Box plot of residuals.7. Normal probability plot of residuals.

Page 18: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonlinearity of Regression

Function• residual plot against the predictor variable • Residual plot against the fitted values• scatter plot

Page 19: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonconstancy of Error Variance

• Residual plot against the predictor variable • Residual plot against the fitted values• Brown-Forsythe test and the Breusch-Pagan test

Page 20: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Presence of Outliers• residual plots against X or Y• box plots• stem-and-leaf plots• dot plots of the residuals.• Plotting of semistudentized residuals

Page 21: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonindependence of Error Terms

• sequence plot of the residuals

Page 22: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonnormality of Error Terms

• box plot • Histogram• dot plot• stem-and-leaf plot• Normal Probability Plot of the residuals• Goodness of fit tests: chi-square test or the Kolmogorov-Smirnov

test and its modification, the Lilliefors test

Page 23: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Omission of Important Predictor Variables

• Residuals should be plotted against variables omitted from themodel that might have important effects on the response

Page 24: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model Adequacy for a Predictor

Variable: Added-Variable Plots

Page 25: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Remedial measures

Page 26: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonlinearity of Regression

Function• transformation• modify regression model by altering the nature of

the regression function. For instance:

Page 27: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonconstancy of Error

Variance

((Nonnormality of Error Terms

• weighted least squares• Transformation:

Box-Cox Transformations:maximum likelihood estimate is that value of A for which SSE is a minimum

Page 28: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

weighted least squares

• Error Variances Known:

• Error Variances Unknown:

1. A residual plot against exhibits a megaphone shape. Regress the absolute residuals

against .

2. A residual plot against Y exhibits a megaphone shape. Regress the absolute residuals

against Y.

3. A plot of the squared residuals against exhibits an upward tendency. Regress the

squared residuals against .

4. A plot of the residuals against X2 suggests that the variance increases rapidly with

increases in X2 up to a point and then increases more slowly. Regress the absolute

residuals against X2 and .

Page 29: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Nonindependence of Error Terms

• When the error terms are correlated, a direct remedial measure is to work with a model that calls for correlated error terms

Page 30: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Multicollinearity Remedial Measures

• Ridge Regression: we prefer an estimator that has only a small bias and is substantially more precise than an unbiased estimator,

Page 31: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Remedial Measures for Influential CasesRobust Regression:• IRLS Robust Regression• LMS Regression

Page 32: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Model Validation

Page 33: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

1. Collection of new data to check the model and its predictive ability.2. Comparison of results with theoretical expectations, earlier empirical results, and simulation results. 3. Use of a holdout sample to check the model and its predictive ability .

Page 34: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Collection of New Data to Check

ModelMethods of Checking Validity:• reestimate the model form chosen earlier using

the new data• calibrate the predictive capability of the selected

regression model.mean squared prediction error:

Page 35: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Comparison with Theory,

Empirical Evidence, or Simulation

Results

• Comparisons of regression coefficients and predictions with theoretical expectations, previous empirical results, or simulation results

Page 36: Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory

Data Splitting• when the data set is large enough is to split the

data into two sets. The first set, called the model-building set or the training sample, is used to develop the model. The second data set, called the validation or prediction set, is used to evaluate the reasonableness and predictive ability of the selected model. This validation procedure is often called cross-validation.