of 36 /36
Print Summary: Lesson 1: Introduction to Statistics This summary contains topic summaries , syntax , and sample programs . Topic Summaries To go to the movie where you learned a task or concept, select a link. Basic Statistical Concepts Descriptive statistics organizes, describes, and summarizes data using numbers and graphical techniques. Inferential statistics is concerned with drawing conclusions about a population from the analysis of a random sample drawn from that population. Inferential statistics is also concerned with the precision and reliability of those inferences. A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. The sample should be representative of the population. You can obtain a representative sample by collecting a simple random sample. Parameters are numerical values that summarize characteristics of a population. Parameter values are typically unknown and are represented by Greek letters. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics. You can measure characteristics of your sample and provide numerical values that summarize those characteristics. You use statistics to estimate parameters. Variables are characteristics or properties of data that take on different values or amounts. A variable can be independent or dependent. In some contexts, you select the value of an independent variable, in order to determine its relationship to the dependent variable. In other contexts, the independent variable’s values are simply taken as given. Variables are also classified according to their characteristics. They can be quantitative or categorical . Data that consists of counts or measurements is quantitative. Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data takes on only a finite, or countable, number of values. Continuous data has an infinite

# Statistics I - Introduction to ANOVA, Regression, And Logistic Regression

Embed Size (px)

DESCRIPTION

SAS

Citation preview

Print

Summary: Lesson 1: Introduction to Statistics

This summary containstopic summaries,syntax, andsample programs.

Topic SummariesTo go to the movie where you learned a task or concept, select a link.Basic Statistical ConceptsDescriptive statisticsorganizes, describes, and summarizes data using numbers and graphical techniques. Inferential statistics is concerned with drawing conclusions about a population from the analysis of a random sample drawn from that population. Inferential statistics is also concerned with the precision and reliability of those inferences.Apopulationis the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. The sample should be representative of the population. You can obtain a representative sample by collecting a simple random sample.Parametersare numerical values that summarize characteristics of a population. Parameter values are typically unknown and are represented by Greek letters. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics. You can measure characteristics of your sample and provide numerical values that summarize those characteristics. You use statistics to estimate parameters.Variablesare characteristics or properties of data that take on different values or amounts. A variable can be independent or dependent. In some contexts, you select the value of an independent variable, in order to determine its relationship to the dependent variable. In other contexts, the independent variables values are simply taken as given.Variables are also classified according to their characteristics. They can bequantitative or categorical. Data that consists of counts or measurements is quantitative. Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data takes on only a finite, or countable, number of values. Continuous data has an infinite number of values and no breaks or jumps.Categorical or attribute data consists of variables that denote groupings or labels. There are two main types: nominal and ordinal. A nominal categorical variable exhibits no ordering within its groups or categories. With ordinal categorical variables, the observed levels of the variable can be ordered in a meaningful way that implies differences due to magnitude.A variables classification is itsscale of measurement. There are two scales of measurement for categorical variables: nominal and ordinal. There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered and has a sensible spacing of observations such that differences between measurements are meaningful. However, interval scales lack the ability to calculate ratios between numbers on the scale because there is no true zero point. Data on a ratio scale includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale.The appropriatestatistical methodfor your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. Multivariate analysis examines two or more variables at the same time, in order to understand the relationships among them.

Descriptive StatisticsAdata's distributiontells you what values your data takes and how often it takes those values.You can calculate descriptive statistics thatmeasure locationsin your data. Statistics that locate the center of the data are measures of central tendency. These include mean, median, and mode.Percentilesare descriptive statistics that give you reference points in your data. A percentile is the value of a variable below which a certain percentage of observations fall. The most commonly reported percentiles are quartiles, which break the data into quarters.There are several descriptive statistics that measure thevariabilityof your data: range, interquartile range (IQR), variance, standard deviation, and coefficient of variation (C.V.).To summarize and generate descriptive statistics, you use theMEANS procedure. PROC MEANS calculates a standard set of statistics, including the minimum, maximum, and mean data values, as well as standard deviation andn. The PRINTALLTYPES option displays statistics for all requested combinations of class variables

Confidence Intervals for the MeanApoint estimatoris a sample statistic used to estimate a population parameter. A statistic that measures the variability of your estimator is the standard error.The standard error of the mean measures the variability of your sample mean. Its an estimate of how much you can expect the sample mean to vary from sample to sample.Thedistribution of sample meansis the distribution of all possible sample means from the population. The distribution of the mean is always less variable than the data.Aninterval estimatoris another way to estimate a population parameter. It incorporates the uncertainty that arises from random variability.Confidence intervalsare a type of interval estimator used to estimate the population mean, while taking into account the variability of the sample mean.Thecentral limit theoremstates that the distribution of sample means is approximately normal, regardless of the population distribution's shape, if the sample size is large enough.You can use theMEANS procedureto generate a 95% confidence interval for the mean.You can use the CLM option in the PROC MEANS statement to calculate the confidence limits for the mean.You can add the ALPHA= option to the PROC MEANS statement in order to construct confidence intervals with a different confidence level.

Hypothesis TestingAhypothesis testuses sample data to evaluate a question about a population. It provides a way to make inferences about a population, based on sample data.There are foursteps in conducting a hypothesis test. The first step is to identify the population of interest and determine the null and alternative hypotheses. The null hypothesis, H0, is what you assume to be true, unless proven otherwise. It is usually a hypothesis of equality. The alternative hypothesis, Haor H1, is typically what you suspect, or are attempting to demonstrate. It is usually a hypothesis of inequality.The second step in hypothesis testing is to select the significance level. This is the amount of evidence needed to reject the null hypothesis. A common significance level is 0.05 (1 chance in 20).The third step is to collect the data. The fourth step is to use a decision rule to evaluate the data. You decide whether or not there is enough evidence to reject the null hypothesis.If you reject the null hypothesis when it's actually true, you've made aType I error. The probability of committing a Type I error is . is the significance level of a test. If you fail to reject the null hypothesis and it's actually false, you've made a Type II error. The probability of committing a Type II error is . Type I and II errors are inversely related.The power of a statistical test is equal to 1 minus beta (1 ),The difference between the observed statistic and the hypothesized value is theeffect size. Ap-value measures the probability of observing a value as extreme as the one observed or more extreme. Ap-value is not only affected by the effect size, but also by the sample size.Thetstatisticmeasures how far X-bar, the sample mean, is from the hypothesized mean, 0. If thetstatistic is much higher or lower than 0 and has a small correspondingp-value, this indicates that the sample mean is quite different from the hypothesized mean, and you would reject the null hypothesis.You can usePROC UNIVARIATEto perform a statistical hypothesis test. You use the MU0= option to specify the value of the hypothesized mean, 0. You can use the ALPHA= option to change the significance level.

SyntaxTo go to the movie where you learned a statement or option, select a link.

PROC MEANSDATA=SAS-data-set ;CLASSvariables;VARvariables;RUN;PROC UNIVARIATEDATA=SAS-data-set ;VARvariables;IDvariables;HISTOGRAMvariables;PROBPPLOTvariables;INSETkeywords;RUN;PROC SGPLOT DATA=SAS-data-set;DOTcategory-variable ;HBARcategory-variable ;VBARcategory-variable ;HBOXresponse-variable ;VBOXresponse-variable ;HISTOGRAMresponse-variable ;SCATTERX=variable Y=variable ;NEEDLEX=variable Y=numeric-variable ;REGX=numeric-variable Y=numeric-variable ;REFLINEvariable | value-1 ;RUN;ODS GRAPHICS ON;statistical procedure codeODS GRAPHICS OFF;

Sample Programs

Using PROC MEANS to Generate Descriptive Statisticsproc means data=statdata.testscores maxdec=2 fw=10 printalltypes n mean median std var q1 q3; class Gender; var SATScore; title 'Selected Descriptive Statistics for SAT Scores';run;title;

Using SAS to Picture Your Dataproc univariate data=statdata.testscores; var SATScore; id idnumber; histogram SATScore / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; probplot SATScore / normal(mu=est sigma=est); inset skewness kurtosis; title 'Descriptive Statistics Using PROC UNIVARIATE';run;title;

proc sgplot data=statdata.testscores; refline 1200 / axis=y lineattrs=(color=blue); vbox SATScore / datalabel=IDNumber; format IDNumber 8.; title "Box Plots of SAT Scores";run;title;

Calculating a 95% Confidence Intervalproc means data=statdata.testscores maxdec=4 n mean stderr clm; var SATScore; title '95% Confidence Interval for SAT';run; title;

Using PROC UNIVARIATE to Perform a Hypothesis Testods select testsforlocation;proc univariate data=statdata.testscores mu0=1200; var SATScore; title 'Testing Whether the Mean of SAT Scores = 1200';run;title;

Close

Print

Summary: Lesson 2: Analysis of Variance (ANOVA)

This summary containstopic summaries,syntax, andsample programs.

Topic SummariesTo go to the movie where you learned a task or concept, select a link.Two-Samplet-TestsThetwo-samplet-testis a hypothesis test for answering questions about the means of two populations. You can examine the differences between populations for one or more continuous variables and assess whether the means of the two populations are statistically different from each other.Thenull hypothesis for the two-samplet-testis that the means for the two groups are equal. The alternative hypothesis is the logical opposite of the null and is typically what you suspect or are trying to show. It is usually a hypothesis of inequality. The alternative hypothesis for the two-samplet-test is that the means for the two groups are not equal.The threeassumptionsfor the two-samplet-test are independence, normality, and equal variances.You use theF-testfor equality of variances to evaluate the assumption of equal variances in the two populations. You calculate theFstatistic, which is the ratio of the maximum sample variance of the two groups to the minimum sample variance of the two groups. If thep-value of theF-test is greater than your alpha, you fail to reject the null hypothesis and can proceed as if thevariances are equalbetween the groups. If thep-value of theF-test is less than your alpha, you reject the null hypothesis and can proceed as if thevariances are not equal.Withone-sided tests, you look for a difference in one direction. For instance, you can test to determine whether the mean of one population is greater than or less than the mean of another population. An advantage of one-sided tests is that they can increase the power of a statistical test.To perform the two-samplet-test and the one-sided test, you can usePROC TTEST. You add thePLOTS optionto the PROC TTEST statement to control the plots that ODS produces. You add theSIDES=U or SIDES=L optionto specify an upper or lower one-sided test.

One-Way ANOVAYou can useANOVAto determine whether there are significant differences between the means of two or more populations. In this model, you have a continuous dependent, orresponse, variable and a categorical independent, orpredictor, variable. With ANOVA, thenull hypothesisis that all of the population means are equal. Thealternative hypothesisis that not all of the population means are equal. In other words, at least one mean is different from the rest.One way to represent the relationship between the response and predictor variables in ANOVA is with a mathematicalANOVA model.ANOVA analyzes the variances of the data to determine whether there is a difference between the group means. You can determine whether thevariation of the meansis large enough relative to the variation of observations within the group. To do this, youcalculatethree types ofsums of squares: between group variation (SSM), within group variation (SSE), and total variation (SST). The SSM and SSE represent pieces of the total variability. If the SSM is larger than the SSE, you reject the null hypothesis that all of the group means are equal.Before you perform the hypothesis test, you need to verify thethree ANOVA assumptions: the observations are independent observations, the error terms are normally distributed, and the error terms have equal variances across groups.Theresidualsthat come from your data are estimates of the error term in the model. You calculate the residuals from ANOVA by taking each observation and subtracting its group mean. Then you verify the two assumptions regarding normality and equal variances of the errors.To verify the ANOVA assumptions and perform the ANOVA test, you usePROC GLM. In theMODEL statement, you specify the dependent and independent variables for the analysis. TheMEANS statementcomputes unadjusted means of the dependent variable for each value of the specified effect. You can add theHOVTEST optionto the MEANS statement to perform Levene's test for homogeneity of variances. If the resultingp-value of Levene's test is greater than 0.05 (typically), then you fail to reject the null hypothesis of equal variances.

ANOVA with Data from a Randomized Block DesignIn acontrolled experiment, you can design the analysis prospectively and control for other factors,nuisance factors, that affect the outcome you're measuring. Nuisance factors can affect the outcome of your experiment, but are not of interest in the experiment. In a randomized block design, you can use a blocking variable to control for the nuisance factors and reduce or eliminate their contribution to the experimental error.One way to represent the relationship between the response and predictor variables in ANOVA is with a mathematicalANOVA model. You can also include ablocking variablein the model.Along with the three original ANOVA assumptions of independent observations, normally distributed errors, and equal variances across treatments, you maketwo more assumptionswhen you include a blocking factor in the model. You assume that the treatments are randomly assigned within each block, and you assume that the effects of the treatment factor are constant across levels of the blocking factor.You usePROC GLMto perform ANOVA with a blocking variable. Youlist the blocking variablein the CLASS statement and in the MODEL statement.

ANOVA Post Hoc TestsApairwise comparisonexamines the difference between two treatment means. If your ANOVA results suggest that you reject the null hypothesis that the means are equal across groups, you can conductmultiple pairwise comparisonsin a post hoc analysis to learn which means differ.The chance that you make a Type I error increases each time you conduct a statistical test. The comparisonwise error rate, or CER, is the probability of a Type I error on a single pairwise test. The experimentwise error rate, orEER, is the probability of making at least one Type I error when performing all of the pairwise comparisons. The EER increases as the number of pairwise comparisons increases.You can use theTukey methodto control the EER. This test compares all possible pairs of means, so it can only be used when you make pairwise comparisons.Dunnett's methodis a specialized multiple comparison test that enables you to compare a single control group to all other groups.You request all of the multiple comparison methods withoptions in the LSMEANS statementin PROC GLM. You use the PDIFF=ALL option to requestp-values for the differences between all of the means. With this option, SAS produces adiffogram. You use the ADJUST= option to specify the adjustment method for multiple comparisons. When you specify theADJUST=Dunnett option, SAS produces multiple comparisons using Dunnett's method and acontrol plot.

Two-Way ANOVA with InteractionsWhen you have two categorical predictor variables and a continuous response variable, you can analyze your data usingtwo-way ANOVA. With two-way ANOVA, you can examine the effects of the two predictor variables concurrently. You can also determine whether they interact with respect to their effect on the response variable. Aninteractionmeans that the effects on one variable depend on the value of another variable. If there is no interaction, you can interpret the test for the individual factor effects to determine their significance. If an interaction exists between any factors, the test for the individual factor effects might be misleading due to the masking of these effects by the interaction.You can include more than one predictor variable and interactions in theANOVA model.You can graphically explore the relationship between the response variable and the effect of the interaction between the two predictor variables usingPROC SGPLOT.You can usePROC GLMto determine whether the effects of the predictor variables and the interaction between them are statistically significant.

SyntaxTo go to the movie where you learned a statement or option, select a link.

PROC TTEST DATA=SAS-data-set;CLASSvariable;VARvariable(s);RUN;Selected Options in PROC TTEST

StatementOption

PROC TTESTPLOTS(SHOWNULL)=INTERVALSIDES=USIDES=L

PROC GLM DATA=SAS-data-set;CLASSvariable(s);MODELdependents=independents ;MEANSeffects ;LSMEANSeffects ;RUN;QUIT;Selected Options in PROC GLM

StatementOption

PROC GLMPLOTS(ONLY)DIAGNOSTICS(UNPACK)

MEANSHOVTEST

Sample Programs

Running PROC TTEST in SASproc ttest data=statdata.testscores plots(shownull)=interval; class Gender; var SATScore; title 'Two-Sample t-Test Comparing Girls to Boys';run;title;

Performing a One-Sidedt-Testproc ttest data=statdata.testscores plots(shownull)=interval h0=0 sides=U; class Gender; var SATScore; title 'One-Sided t-Test Comparing Girls to Boys';run;title;

Examining Descriptive Statistics across Groupsproc means data=statdata.mggarlic printalltypes maxdec=3; var BulbWt; class Fertilizer; title 'Descriptive Statistics of Garlic Weight';run;

proc sgplot data=statdata.mggarlic; vbox BulbWt / category=Fertilizer datalabel=BedID; format BedID 5.; title 'Box Plots of Garlic Weight';run;title;

Using the GLM Procedureproc glm data=statdata.mggarlic plots(only)=diagnostics(unpack); class Fertilizer; model BulbWt=Fertilizer; means Fertilizer / hovtest; title 'Testing for Equality of Means with PROC GLM';run;quit;title;

Performing ANOVA with Blockingproc glm data=statdata.mggarlic_block plots(only)=diagnostics(unpack); class Fertilizer Sector; model BulbWt=Fertilizer Sector; title 'ANOVA for Randomized Block Design';run;quit;title;

Performing a Post Hoc Pairwise Comparisonods select lsmeans diff meanplot diffplot controlplot;

proc glm data=statdata.mggarlic_block; class Fertilizer Sector; model BulbWt=Fertilizer Sector; lsmeans Fertilizer / pdiff=all adjust=tukey; lsmeans Fertilizer / pdiff=controlu('4') adjust=dunnett; lsmeans Fertilizer / pdiff=all adjust=t; title 'Garlic Data: Multiple Comparisons';run;quit;title;

Examining Your Data with PROC MEANSproc format; value dosef 1="Placebo" 2="100mg" 3="200mg" 4="500mg";run; proc means data=statdata.drug mean var std printalltypes; class Disease DrugDose; var BloodP; output out=means mean=BloodP_Mean; format DrugDose dosef.; title 'Selected Descriptive Statistics for Drug Data Set';run;title;

Examining Your Data with PROC SGPLOTproc sgplot data=means; where _TYPE_=3; scatter x=DrugDose y=BloodP_Mean / group=Disease markerattrs=(size=10); series x=DrugDose y=BloodP_Mean / group=Disease lineattrs=(thickness=2); xaxis integer; format DrugDose dosef.; title 'Plot of Stratified Means in Drug Data Set';run;title;

Performing Two-Way ANOVA with Interactionsproc glm data=statdata.drug; class DrugDose Disease; model Bloodp=DrugDose Disease DrugDose*Disease; format DrugDose dosef.; title1 'Analyze the Effects of DrugDose and Disease'; title2 'including Interactions';run;quit;title;Performing a Post Hoc Pairwise Comparisonproc format; value dosef 1="Placebo" 2="100mg" 3="200mg" 4="500mg";run;

ods select meanplot lsmeans slicedanova; proc glm data=statdata.drug; class DrugDose Disease; model Bloodp=DrugDose Disease DrugDose*Disease; lsmeans DrugDose*Disease / slice=Disease; format DrugDose dosef.; title 'Analyze the Effects of DrugDose at Each Level of Disease';run;quit;title;

Close

Print

Summary: Lesson 3: Regression

This summary containstopic summaries,syntax, andsample programs.

Topic SummariesTo go to the movie where you learned a task or concept, select a link.To analyze continuous variables, you can use linear regression. To investigate your data before performing linear regression, you can use techniques forexploratory data analysis, including scatter plots and correlation analysis. In exploratory data analysis, you're simply trying to explore the relationships between variables and to screen for outliers.Scatter plotsare an important tool for describing the relationship between continuous variables.Plot your data!You can use scatter plots to examine the relationship between two continuous variables, to detect outliers, to identify trends in your data, to identify the range of X and Y values, and to communicate the results of a data analysis.You can also usecorrelation analysisto quantify the relationship between two variables. Correlation statistics measure the strength of the linear relationship between two continuous variables. Two variables are correlated if there is alinearassociation between them. A common correlation statistic used for continuous variables is thePearson correlation coefficient, which ranges from 1 to +1.The population parameter that represents a correlation is . Thenull hypothesis for a test of a correlation coefficientis that equals 0, and the alternative hypothesis is that is not 0. Rejecting the null hypothesis means only that you can be confident that the true population correlation is not exactly 0. You need to avoidcommon mistakeswhen interpreting the correlation between variables.To produce correlation statistics and scatter plots for your data, you usePROC CORR. To rank-order the absolute value of the correlations from highest to lowest, you add theRANK optionto the PROC CORR statement. To produce scatter plots, you add thePLOTS= optionin the PROC CORR statement. You can also add context-specific options in parentheses following the main option keyword, such as PLOTS or SCATTER.To examine the correlations between the potential predictor variables, you produce acorrelation matrix and scatter plot matrixby using the NOSIMPLE, PLOTS=MATRIX, and HISTOGRAM options. To specify tooltips for hovering over data points and seeing detailed information about the observations, you use the IMAGEMAP=ON option in the ODS GRAPHICS statement and an ID statement in the PROC CORR step.

In correlation analysis, you determine the strength of the linear relationships between continuous response variables. Insimple linear regression, you use thesimple linear regression modelto determine the equation for the straight line that defines the linear relationship between the response variable and the predictor variable.To determine how much better the model that takes the predictor variable into account is than a model that ignores the predictor variable, you cancompare the simple linear regression model to a baseline model. For your comparison, you calculate the explained, unexplained, and total variability in the simple linear regression model.Thenull hypothesis for linear regressionis that the regression modeldoes notfit the data better than the baseline model.The alternative hypothesis is that the regression modeldoesfit the data better than the baseline model. In other words, the slope of the regression line is not equal to 0, or the parameter estimate of the predictor variable is not equal to 0.Before performing simple linear regression, you need to verify thefour assumptions for linear regression: that the mean of the response variable is linearly related to the value of the predictor variable, that the error terms are normally distributed, that the error terms have equal variances, and that the error terms are independent at each value of the predictor variable.To fit regression models to your data, you usePROC REG. TheMODEL statementspecifies the response variable and the predictor variable. To evaluate your model, you typically examine thep-value for the overall model, the R-square value, and the parameter estimates.To assess the level of precision around the mean estimates of the response variable, you canproduce confidence intervalsaround the means andconstruct prediction intervalsfor a single observation. To display confidence and prediction intervals, you can specify theCLM and CLI optionsin the MODEL statement.Toproduce predicted values for small data setsusing PROC REG, you create a new data set containing the values of the independent variable for which you want to make predictions, concatenate the new data set with the original data set, and fit a simple linear regression model to the new data set.To produce predicted values for large data sets,using PROC REG and PROC SCOREis more efficient. You can use theNOPRINTandOUTEST=options in a PROC REG statement to write the parameter estimates from PROC REG to an output data set. Then you score the new observations using PROC SCORE, with theSCORE=option specifying the data set containing the parameter estimates, theOUT=option specifying the data set that PROC SCORE creates, and theTYPE=option specifying what type of data the SCORE= data set contains.

Inmultiple regression, you can model the relationship between the response variable and more than one predictor variable. In a model with two predictor variables, you can model the relationship of the three variablesthree dimensionswith a two-dimensional plane.Multiple linear regression hasadvantages and disadvantages. Its biggest advantage is that it's more powerful than simple linear regression, that is, you can determine whether a relationship exists between the response variable and several predictor variables at the same time. The disadvantages of multiple linear regression are that you have to decide which model to use, and that when you have more predictors, interpreting the model becomes more complicated.You can use multiple regression intwo ways: for analytical or explanatory analysis and for prediction. If you specify many terms, themodel for multiple regressioncan become very complex.The hypotheses for multiple regression are similar to those for simple linear regression. Thenull hypothesisis that the multiple regression modeldoes notfit the data better than the baseline model. (All the slopes or parameter estimates are equal to 0.) The alternative hypothesis is that the regression modeldoesfit the data better than the baseline model.For multiple linear regression, the samefour assumptionsas for simple linear regression apply: that the mean of the response variable is linearly related to the value of the predictor variables, that the error terms are normally distributed, that the error terms have equal variances, and that the error terms are independent at each value of the predictor variable.Tocompare multiple linear regression models, you typically examine thep-value for the overall models, the adjusted R-square values, and the parameter estimates. The adjusted R-square value takes into account the number of terms in the model and increases only if new terms significantly improve the model.

Yourfirst decisionin model selection is whether to use a manual or automated approach.Automated model selection techniquesin SAS fall into two general categories: the all-possible regressions method and stepwise selection methods. For a large number of potential predictor variables, the stepwise regression methods might be a better option. The all-possible regressions method produces more candidate models, which requires you to use your expertise to select a model.In theall-possible regressions method, SAS calculates all possible regression models. To describe your model, you can add an optionallabelto the MODEL statement. You can reduce the number of models in the output by specifying theBEST= optionin the MODEL statement. To help evaluate the models you produce, you can requestMallows' Cp statisticin thePLOTS= optionin the PROC REG statement and in theSELECTION= optionin the MODEL statement. Torequest statisticsfor each model, you can specify them in the SELECTION= option. To select the best model for prediction,you should useMallows' criterion for Cp. To select the best model for parameter estimation, you should useHocking's criterion for Cp.Stepwise selection methodsare another, less computer-intensive way to find good candidate models without having to generate all possible models. You can specify forward, backward, and stepwise methods in theSELECTION= optionin the MODEL statement. Each method selects variables based on theirp-values. To change the defaultp-values that PROC REG uses to select variables, you can use theSLENTRY= and SLSTAY= optionsin the MODEL statement. It's agood ideato always run all three stepwise selection methods and look for commonalities among the final models for all three methods.

SyntaxTo go to the movie where you learned a statement or option, select a link.

PROC CORR DATA=SAS-data-set ;VARvariable(s);WITHvariable(s);RUN;Selected Options in PROC CORR

StatementOption

PROC CORRRANKPLOTS=NOSIMPLE

Selected ODS Option

StatementOption

ODS GRAPHICSIMAGEMAP=ON

PROC REG DATA=SAS-data-set ;MODELdependent-regressor ;WITHvariable(s);IDvariable(s);RUN;SelectedOptionsinPROCREG

StatementOption

PROC REGNOPRINTOUTEST=

MODELCLMCLIP

PROC SCORE DATA=SAS-data-setSCORE=SAS-data-setOUT=SAS-data-setTYPE=name;VARvariable(s);RUN;

Sample ProgramsProducing Correlation Statistics and Scatter Plotsproc corr data=statdata.fitness rank plots(only)=scatter(nvar=all ellipse=none); var RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse Performance; with Oxygen_Consumption; title "Correlations and Scatter Plots with Oxygen_Consumption";run;title;Examining Correlations between Predictor Variablesods graphics on / imagemap=on;proc corr data=statdata.fitness nosimple plots=matrix(nvar=all histogram); var RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse Performance; id name; title "Correlations with Oxygen_Consumption";run;title;Performing Simple Linear Regressionproc reg data=statdata.fitness; model Oxygen_Consumption=RunTime; title 'Predicting Oxygen_Consumption from RunTime';run;quit;title;Viewing and Printing Confidence Intervals and Prediction Intervalsproc reg data=statdata.fitness; model Oxygen_Consumption=RunTime / clm cli; id name runtime; title 'Predicting Oxygen_Consumption from RunTime';run;quit;title;Producing Predicted Values of the Response Variabledata need_predictions; input RunTime @@; datalines;9 10 11 12 13;run;data predoxy; set need_predictions statdata.fitness;run;proc reg data=predoxy; model Oxygen_Consumption=RunTime / p; id RunTime; title 'Oxygen_Consumption=RunTime with Predicted Values';run;quit;title;Storing Parameter Estimates and Scoringproc reg data=statdata.fitness noprint outest=estimates; model Oxygen_Consumption=RunTime;run;quit;proc print data=estimates; title "OUTEST= Data Set from PROC REG";run;title;proc score data=need_predictions score=estimates out=scored type=parms; var RunTime; run;proc print data=Scored; title "Scored New Observations";run;title;Performing Multiple Linear Regressionproc reg data=statdata.fitness; model Oxygen_Consumption=Performance RunTime; title 'Multiple Linear Regression for Fitness Data';run;quit;title;Using Automatic Model Selectionods graphics / imagemap=on;proc reg data=statdata.fitness plots(only)=(cp); ALL_REG: model Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / selection=cp rsquare adjrsq best=20;title 'Best Models Using All-Regression Option';run;quit;title;Estimating and Testing Coefficients for Selected Modelsproc reg data=statdata.fitness; PREDICT: model Oxygen_Consumption= RunTime Age Run_Pulse Maximum_Pulse; EXPLAIN: model Oxygen_Consumption= RunTime Age Weight Run_Pulse Maximum_Pulse; title 'Check "Best" Two Candidate Models';run;quit;title;Performing Stepwise Regressionproc reg data=statdata.fitness plots(only)=adjrsq; FORWARD: model Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / selection=forward; BACKWARD: model Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / selection=backward; STEPWISE: model Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / selection=stepwise; title 'Best Models Using Stepwise Selection';run;quit;title;

Close

Print

Summary: Lesson 4: Regression Diagnostics

This summary containstopic summaries,syntax, andsample programs.

Topic SummariesTo go to the movie where you learned a task or concept, select a link.Verifying thefirst assumptionof linear regression, that the linear model fits the data adequately, is critical. You should always plot your data before producing a model.The remaining threeassumptionsof linear regression relate to error terms, so you check these assumptions in terms oferrors, not in terms of the values of the response variable. To verify these assumptions, you can use several differentresidual plotsto check your regression assumptions. You can plot the residuals versus the predicted values, plot the residuals versus the values of the independent variables, and produce a histogram or a normal probability plot of the residuals. To verify that model assumptions are valid, you can analyze the shape of the residual values to ensure that they display a random scatter of the residual values above and below the reference line at 0. If you see patterns or trends in the residual values, the assumptions might not be valid and the models might have problems. You can also use residual plots todetect outliers.To create residual plots and other diagnostic plots, you usePROC REG, which creates a number of default plots. Specifying an identifier variable in theID statementshows you that information when you hover your cursor over the data points in the graph. You can also request specific plots with thePLOTS= optionin the PROC REG statement.

You should also identify anyinfluential observationsthat strongly affect the linear model's fit to the data. To identify outliers and influential observations in your data, you can use severaldiagnostic statisticsin PROC REG.To detect outliers, you can useSTUDENT residuals. To detect influential observations, you can useCooksDstatistics,RSTUDENT residuals, andDFFITS statistics. CooksDstatistic is most useful for explanatory or analytic models, and DFFITS is most useful for predictive models. If you detect an influential observation, you can identify which parameter the observation is influencing most by usingDFBETAS.To detect influential observations in your model using PROC REG, you can produce diagnostic statistics as well as diagnostic plots. To control which plots are produced, you can use thePLOTS= optionin the PROC REG statement. To request thediagnostic statisticsused in creating the plots without producing the plots themselves, you can use theR and INFLUENCE optionsin the MODEL statement. When you use these options, PROC REG creates an ODS output object calledOutputStatistics, which contains the residuals and influential statistics from the R and INFLUENCE model options. To add variables in the model to the OutputStatistics data object, you specify them in theID statement. To save the statistics in an output data set, you use theODS OUTPUT statement.For very large data sets, viewing or printing all residuals and influence statistics quickly becomes unwieldy. To reduce the amount of output, you can use thecutoff valuesfor each of the diagnostic criteria to detect influential observations. To do so, you can use macro variables and the DATA step to create a program that you can reuse.You canhandle influential observationsin several ways. You can recheck for data entry errors, determine whether you have an adequate model, and determine whether the observation is valid but unusual. In your analysis, you should report the results of your model with and without the influential observation.

Collinearity, also calledmulticollinearity, occurs in multiple regression when two or more predictor variables are highly correlated with each other. Collinearity doesn't violate the assumptions of multiple regression, but itleads to instabilityin the regression model.Todetect collinearity, you can check your PROC REG output. To measure the magnitude of collinearity in a model, you can use theVIF optionin the MODEL statement. If you detect collinearity, you can determinehow to proceedandwhich model to select.To review,effective modelingincludes performing preliminary analyses, selecting candidate models, validating assumptions, detecting influential observations and collinearity, revising your model, and performing prediction testing.

SyntaxTo go to the movie where you learned a statement or option, select a link.

LIBNAMElibref'SAS-library';ODS OUTPUToutput-object-specification=data-set;PROC REG DATA=SAS-data-set ;MODELdependent-regressor ;IDvariable(s);RUN;SelectedOptionsinPROCREG

StatementOption

PROC REGPLOTS=

MODELRINFLUENCEVIF

%LETvariable=value;DATASAS-data-set;SETSAS-data-set;variable=value;IFexpression;RUN;PROC PRINT DATA=SAS-data-set;VARvariable(s);RUN;

Sample ProgramsProducing Default Diagnostic Plotsods graphics / imagemap=on;

proc reg data=statdata.fitness; PREDICT: model Oxygen_Consumption= RunTime Age Run_Pulse Maximum_Pulse; id Name; title 'PREDICT Model - Plots of Diagnostic Statistics';run;quit;

title;Requesting Specific Diagnostic Plotsods graphics / imagemap=on;

proc reg data=statdata.fitness plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS); PREDICT: model Oxygen_Consumption= RunTime Age Run_Pulse Maximum_Pulse; id Name; title 'PREDICT Model - Plots of Diagnostic Statistics';run;quit;

title;Using Diagnostic Plots to Identify Influential Observationsods graphics / imagemap=on;

proc reg data=statdata.fitness plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model Oxygen_Consumption = RunTime Age Run_Pulse Maximum_Pulse; id Name; title 'PREDICT Model - Plots of Diagnostic Statistics';run;quit;

title;Writing Diagnostic Statistics to an Output Data Setods output outputstatistics=Check4Outliers; proc reg data=statdata.fitness; PREDICT: model Oxygen_Consumption= RunTime Age Run_Pulse Maximum_pulse / r influence; id Name Oxygen_Consumption RunTime Age Run_Pulse Maximum_pulse; title 'PREDICT Model - Plots of Diagnostic Statistics';run;quit;

title;Detecting Influential Observations Programmatically%let dsname=check4outliers; /*data set name*/ %let numparms=5; /*# of predictor variables + 1*/ %let numobs=31; /*# of observations*/ %let idvars=Name Oxygen_Consumption RunTime DFB_RunTime Age DFB_Age Run_Pulse DFB_Run_Pulse Maximum_pulse DFB_Maximum_Pulse; /*relevant variable(s)*/

data influential; set &dsname; CutDFFits=2*(sqrt(&numparms/&numobs)); CutCooksD=4/&numobs; RStud_i=(abs(RStudent)>3); DFits_i=(abs(DFFits)>CutDFFits); CookD_i=(CooksD>CutCooksD); Summary_i=compress(RStud_i||DFits_i||CookD_i); if Summary_i ne '000'; run;

proc print data=influential; var Summary_i &IDVars PredictedValue RStudent DFFits CutDFFits CooksD CutCooksD; title 'Observations Exceeding Suggested Cutoffs';run;

title;Detecting Collinearityproc reg data=statdata.fitness; PREDICT: model Oxygen_Consumption= RunTime Age Run_Pulse Maximum_Pulse; FULL: model Oxygen_Consumption = Performance RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse; title 'Collinearity: Full Model';run;quit;

title;Calculating Collinearity Diagnosticsproc reg data=statdata.fitness; FULL: model Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / vif; title 'Collinearity: Full Model with VIF';run;quit;

title;Dealing with Collinearityproc reg data=statdata.fitness; NOPERF: model Oxygen_Consumption= RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / vif; title 'Dealing with Collinearity';run;quit;

title;

Close

Print

Summary: Lesson 5: Categorical Data Analysis

This summary containstopic summaries,syntax, andsample programs.

Topic SummariesTo go to the movie where you learned a task or concept, select a link.Aone-way frequency tabledisplays frequency statistics for a categorical variable.Anassociationexists between two variables if the distribution of one variable changes when the value of the other variable changes. If there's no association, the distribution of the first variable is the same regardless of the level of the other variable.To look for a possible association between two or more categorical variables, you can create acrosstabulation table. A crosstabulation table shows frequency statistics for each combination of values (or levels) of two or more variables.To create frequency and crosstabulation tables in SAS, and request associated statistics and plots, you use theTABLES statementin the FREQUENCY procedure. You can use the PLOTS= option in the TABLES statement to request specific plots for frequency and crosstabulation tables.When ordinal values areordered logically, you can use more powerful statistical tests that can detect linear (ordinal) associations instead of only general associations. To logically order the values of a variable for calculations and output, you can create a new variable or you can apply a temporary format to an existing variable. The ORDER=FORMATTED option in the PROC FREQ statement tells PROC FREQ to perform calculations and display output by using the formatted values instead of the stored values.

To perform a formal test of association between two categorical variables, you use the chi-square test. ThePearson chi-square testis the most commonly used of several chi-square tests. The chi-square statistic indicates the difference between observed frequencies and expected frequencies. Neither the chi-square statistic nor itsp-value indicates the magnitude of an association.Cramer's V statisticis one measure of the strength of an association between two categorical variables. Cramer's V statistic is derived from the Pearson chi-square statistic.To measure the strength of the association between a binary predictor variable and a binary outcome variable, you can use anodds ratio. An odds ratio indicates how much more likely it is, with respect to odds, that a certain event, or outcome, occurs in one group relative to its occurrence in another group.To perform a Pearson chi-square test of association and generate related measures of association, you specify theCHISQ option and other optionsin the TABLES statement in PROC FREQ.For ordinal associations, theMantel-Haenszel chi-square testis a more powerful test than the Pearson chi-square test. The Mantel-Haenszel chi-square statistic and itsp-value indicate whether an association exists but not the magnitude of the association.To measure the strength of the linear association between two ordinal variables, you can use theSpearman correlation statistic. The Spearman correlation is considered to be a rank correlation because it provides a degree of linearity between the ordinal variables.To perform a Mantel-Haenszel chi-square test of association and generate related measures of association, you specify theCHISQ option and other optionsin the TABLES statement in PROC FREQ.

Logistic regressionis a type of statistical model that you can use to predict a categorical response, or outcome, on the basis of one or more continuous or categorical predictor variables. You select one of three types of logistic regression binary, nominal, or ordinal based on your response variable.Although linear and logistic regression models have the same structure, you can't use linear regression with a binary response variable.Binary logistic regressionuses a predictor variable to estimate the probability of a specific outcome. To directly model the relationship between a continuous predictor and the probability of an event or outcome, you must use a nonlinear function: the inverse logit function.To model categorical data, you use theLOGISTIC procedure. The two required statements are the PROC LOGISTIC statement and the MODEL statement. Depending on the complexity of your analysis, you can use additional statements in PROC LOGISTIC. If your model has one or more categorical predictor variables, you must specify them in the CLASS statement. The MODEL statement specifies the response variable and can specify other information as well, such as the response variable. In the MODEL statement, the EVENT= option specifies the event category for a binary response model. To specify the type of confidence intervals you want to use, you add the CLODDS= option to the MODEL statement. PROC LOGISTIC computes Wald confidence intervals by default. You can use the PLOTS= option in the PROC LOGISTIC statement to request specific plots.Instead of working directly with the categorical predictor variables in the CLASS statement, PROC LOGISTIC firstparameterizeseach predictor variable. The CLASS statement creates a set of one or more design variables that represent the information in each specified classification variable. PROC LOGISTIC uses the design variables, and not the original variables, in model calculations. Two common parameterization methods areeffect coding(the method that PROC LOGISTIC uses by default) andreference cell coding. To specify a parameterization method other than the default, you use the PARAM= option in the CLASS statement. If you want to specify a reference level other than the default for a classification variable, you use the REF= variable option in the CLASS statement.Akaike's information criterion (AIC) and the Schwarz criterion (SC) aregoodness-of-fit measuresthat you can use to compare models. -2Log L is a goodness-of-fit measure that is not commonly used to compare models.Comparing pairsis another goodness-of-fit measure that you can use to compare models.PROC LOGISTIC uses a 0.05significance leveland a 95% confidence interval by default. If you want to specify a different significance level for the confidence interval, you can use the ALPHA= option in the MODEL statement.For acontinuous predictor variable, the odds ratio measures the increase or decrease in odds associated with a one-unit difference of the predictor variable by default.

Amultiple logistic regressionmodel characterizes the relationship between a categorical response variable and multiple predictor variables.One method of selecting a subset of predictor variables for a multiple logistic regression model is thebackward elimination method. To specify the variable selection method in PROC LOGISTIC, you add theSELECTION= optionto the MODEL statement. By default, for the backward elimination method, PROC LOGISTIC uses a 0.05 significance level to determine which variables remain in the model. If you want to change the significance level, you can use the SLSTAY= (or SLS=) option in the MODEL statement.Multiple logistic regression usesadjusted odds ratios, which measure the effect of a single predictor variable on a response variable while holding all the other predictor variables constant.In PROC LOGISTIC, theUNITS statementenables you to obtain customized odds ratio estimates for a specified unit of change in one or more continuous predictor variables.In the CLASS statement, when you use the REF= option with a variable that has either a temporary or a permanent format assigned to it, you must specify theformatted valueof the level instead of the stored value.When you fit a multiple logistic regression model, the simplest approach is to consider only the main effectsthe effect of each predictor individuallyon the response. If you suspect that there areinteractionsbetween predictor variables, you can fit a more complex logistic regression model that includes interactions. When you use thebackward elimination method with interactions in the model, PROC LOGISTIC must preserve the model hierarchy when eliminating main effects. You specify interactions in theMODEL statement.By default, PROC LOGISTIC produces the odds ratio only for variables that are not involved in an interaction.To tell PROC LOGISTIC to produce the odds ratios for each value of a variable that is involved in an interaction, you can use theODDSRATIO statement. To specify whether PROC LOGISTIC computes the odds ratios for a categorical variable against the reference level or against all of its levels, you can use the DIFF= option. The AT option specifies fixed levels of one or more interacting variables (also called covariates). PROC LOGISTIC computes odds ratios at each of the specified levels.To visualize the interaction between two categorical variables, you can produce aninteraction plot.

SyntaxTo go to the movie where you learned a statement or option, select a link.

PROC FREQ DATA=SAS-data-set'SAS-library';TABLES=table-request(s) ;additional statements;RUN;Selected Options in PROC FREQ

StatementOption

PROC FREQORDER=

TABLESCELLCHI2CHISQ (PearsonandMantel-Haenszel)CLEXPECTEDMEASURESNOCOLNOPERCENTPLOTS=RELRISK

PROC LOGISTIC DATA=SAS-data-set;CLASSvariable ... ;MODELresponse=predictors;UNITSindependent1=list... ;ODDSRATIOvariable;RUN;Selected Options in PROC LOGISTIC

StatementOption

PROC LOGISTICPLOTS=

CLASSPARAM=REF= (general usageandusage with a formatted variable)

MODELALPHA=CLODDS=EVENT=SELECTION=SLSTAY= | SLS=

ODDSRATIOATCL=DIFF=

Sample Programs

Examining the Distribution of Variablesproc freq data=statdata.sales; tables Purchase Gender Income Gender*Purchase Income*Purchase / plots=(freqplot); format Purchase purfmt.; title1 'Frequency Tables for Sales Data';run;

ods select histogram probplot;

proc univariate data=statdata.sales; var Age; histogram Age / normal (mu=est sigma=est); probplot Age / normal (mu=est sigma=est); title1 'Distribution of Age'; run;

title;

Ordering the Values of a Variable by Creating a New Variabledata statdata.sales_inc; set statdata.sales; if Income='Low' then IncLevel=1; else If Income='Medium' then IncLevel=2; else If Income='High' then IncLevel=3;run;

proc freq data=statdata.sales_inc; tables IncLevel*Purchase / plots=freq; format IncLevel incfmt. Purchase purfmt.; title1 'Create variable IncLevel to correct Income';run;

title;

Performing a Pearson Chi-Square Test of Associationproc freq data=statdata.sales_inc; tables Gender*Purchase / chisq expected cellchi2 nocol nopercent relrisk; format Purchase purfmt.; title1 'Association between Gender and Purchase';run;

title;

Performing a Mantel-Haenszel Chi-Square Testproc freq data=statdata.sales_inc; tables IncLevel*Purchase / chisq measures cl; format IncLevel incfmt. Purchase purfmt.; title1 'Ordinal Association between IncLevel and Purchase?';run;

title;

Fitting a Binary Logistic Regression Modelproc logistic data=statdata.sales_inc plots(only)=(effect); class Gender (param=ref ref='Male'); model Purchase(event='1')=Gender; title1 'LOGISTIC MODEL (1):Purchase=Gender';run;

title;

Fitting a Multiple Logistic Regression Modelproc logistic data=statdata.sales_inc plots(only)=(effect oddsratio); class Gender (param=ref ref='Male') IncLevel (param=ref ref='1'); units Age=10; model Purchase(event='1')=Gender Age IncLevel / selection=backward clodds=pl; title1 'LOGISTIC MODEL (2):Purchase=Gender Age IncLevel';run;

title;

Fitting a Multiple Logistic Regression Model with Interactionsproc logistic data=statdata.sales_inc plots(only)=(effect oddsratio); class Gender (param=ref ref='Male') IncLevel (param=ref ref='1'); units Age=10; model Purchase(event='1')=Gender | Age | IncLevel @2 / selection=backward clodds=pl; title1 'LOGISTIC MODEL (3): Main Effects and 2-Way Interactions'; title2 '/ sel=backward';run;

title;

Fitting a Multiple Logistic Regression Model with All Odds Ratiosods select OddsRatiosPL ORPlot;

proc logistic data=statdata.sales_inc plots(only)=(oddsratio); class Gender (param=ref ref='Male') IncLevel (param=ref ref='1'); units Age=10; model Purchase(event='1')=Gender | IncLevel Age; oddsratio Age / cl=pl; oddsratio Gender / diff=ref at (IncLevel=all) cl=pl; oddsratio IncLevel / diff=ref at (Gender=all) cl=pl; title1 'LOGISTIC MODEL (3a): Significant Terms and All Odds Ratios'; title2 '/ sel=backward';run;

title;