28
ANOVA & Regression Selecting the Correct Statistical Test

# ANOVA & Regression Selecting the Correct Statistical Test

Embed Size (px)

Citation preview

ANOVA & Regression

Selecting the Correct Statistical Test

Analysis of Variance

• Is used when you want to compare means for three or more groups.

• You have a normal distribution (random sample or population).

• It can be used to determine causation.• It contains an independent variable that is

nominal and a dependent variable that is interval/ratio.

Other properties of both t-test and ANOVA

• Assumes equal variance (equal size or number of observations in each group).

• Samples for both t-test and ANOVA should be “independent” - this means that separate groups should have different members. Memberships should not overlap between groups.

• Calculations are based on degrees of freedom. (You will see degrees of freedom on the SPSS print out.

DF for t-test is n (number of observations – 1).

As with chi-square, degrees of freedom represent:

• Ability of numbers in the data set to vary,• DF in ANOVA is a bit more complex.

Calculations are based on the difference in means between each group and within each group.

• Therefore Degrees of Freedom between groups are n (number of groups).

• Degrees of Freedom within groups are the number of observations in each group (n) – 1, then you add the total degrees of freedom for each group.

For example, if we had three groups for whom we

have scores on the Depression Test

AA Individual Treatment

Group Counseling

30 60 25

52 34 49

24 56 37

60 27 52

19 42

57 51

45

Degrees of Freedom

• Between Groups = (n –1) = 3 –1 = 2

• Within Groups =

Sum of (n-1) for each group

(7-1) + (6-1) + (5-1) =

6 + 5 + 4 = 15

Reading the ANOVA print-out

ANOVA

Highest Year of School Completed

240.725 2 120.362 13.746 .00013195.99 1507 8.75613436.72 1509

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Report

Highest Year of School Completed

13.06 1262 2.95511.89 199 2.67712.47 49 4.00112.88 1510 2.984

Race of RespondentWhiteBlackOtherTotal

Mean NStd.

Deviation

Testing a Hypothesis with ANOVA

• If our confidence level is .01• Alternative Hypothesis: Ethnicity is associated

with years of education completed• Null hypothesis: There is no association

between ethnicity and years of education completed.

• F = 13.746 p = .000

Do we confirm or reject the null hypothesis?

Regression Analysis:

• Allows us to look at causation using two interval/ratio variables.

• Involves predicting the value of the dependent variable using the independent variable. Other control variables can be added to the regression analysis.

Calculation for Regression is based on:

• The concept of the regression line. What points in the association between two variables are on or off the regression line.

• For simple or two variable regression: y = a + bx where a = the y-intercept and b = the

slope of the line. Slope = the amount y increases for each unit of the increase in X.

X = the x (independent variable value) used to predict Y (dependent variable value)

Regression line when looking at association between two variables

Current Salary

140000120000100000800006000040000200000

Begin

nin

g S

ala

ry

100000

80000

60000

40000

20000

0

Control Variables are

Those variables that when combined with the independent variable may affect the value of the dependent variable.

For example when we look at the association between beginning salary and current salary, both age and gender may affect salary amounts

Regression SPSS print outModel Summary

.881a .776 .775 \$8,096.337Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Minority Classification,Beginning Salary

a.

ANOVAb

1.1E+11 2 5.352E+10 816.484 .000a

3.1E+10 471 65550668.161.4E+11 473

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Minority Classification, Beginning Salarya.

Dependent Variable: Current Salaryb.

Coefficientsa

2516.971 945.359 2.662 .0081.896 .048 .874 39.583 .000

-1632.896 909.959 -.040 -1.794 .073

(Constant)Beginning SalaryMinority Classification

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Current Salarya.

Let’s check on what this means about minority classification and salary

Report

Current Salary

\$36023.3 370 *********\$28713.9 104 *********\$34419.6 474 *********

Minority ClassificationNoYesTotal

Mean NStd.

Deviation

Hypothesis Testing:

• Confidence level: = .05• Alternative Hypothesis is: Controlling for

minority status, beginning salary is associated (or can predict) current salary.

• Null hypothesis is. Controlling for minority status, beginning salary is not associated (or can predict) current salary.

Analyzing regression

Can use three values to interpret – (1) R2 - Correlation between any independent and

control variables and the dependent variable.(1) F – goodness of fit of the regression line.

Calculated based on the number of points off the line.

(2) b – measure of the correlation between one variable in the regression model and the dependent variable. This is used when you include multiple independent or control variables in the model.

Hypothesis Test (continued) Total correlation between the independent and control variables and

the dependent variables = R2 = .776 (note no p value – but the closer the R2 is to 1.00 the better). This means that there is a high correlation between minority classification and beginning salary combined and current salary.

Total fit of the model to the regression line = F = 816, p. = .00 (less than our confidence level of .05) Alternative hypothesis confirmed

Individual Beta values for beginning salary (.874 at p. = .00 and minority status (.040 at p = .073). At p. = .05 CL only beginning salary is statistically significant or associated with current salary.

Review of statistical tests

Statistical Test

Test statistic

Chi-square

T-test T

ANOVA F

Correlation

r

Regression

Use R2 or F for the fit of the model and b for the correlation between each of the independent and control variables and the dependent variable.

General rules for analyzing results

• The bigger the test statistic the more likely there is a relationship between the independent and dependent variables. Values greater than 3 are for every type of inferential statistic other than correlation are usually statistically significant.

• Relationships can be positive or negative. You need the p value to determine if the test statistic is actually large enough to be statistically significant. You must always set a confidence level before determining if the p value is large enough to be statistically significant.

• Findings from small samples are unlikely to be significant unless

there is a very strong relationship between two variables.

How do we write up test results

• We use the test statistic and the probability level.

• Correct procedure for professional journal articles also requires the use of degrees of freedom and number of observations.

• For Assignment #4 use the test statistic and the probability level.

Proper format for this class

• The confidence level is p. = .05. Reject the null hypothesis and accept the alternative hypothesis. Correlation is r = .74 at p. = .04.

• The confidence level is p. = .10.Accept the null hypothesis and reject the alternative hypothesis. There is no association between years of education and salary, controlling for gender; F = .45, p. = .70.

Criteria for Using Statistical Tests

• Independent samples• Level of Measurement• Normal distribution• Sample Size (Minimum for quantitative research

should be 30)• Robustness (can procedure be used when

basic assumptions are violated?) T-test, ANOVA, and chi-square are considered very robust.

Research note:

• Some types of ordinal data can be used as interval/ratio data in statistical analysis. Montcalm and Royse state that such data should be ranked at a least five levels, come from a normal distribution, and result from a large sample.

• The most common type of ordinal data used as ratio/interval data in statistics is a likert scale.

Example of a likert scale

• 1 = Very satisfied• 2 = Satisfied• 3 = Neutral• 4 = Unsatisfied• 5 = Very unsatisfied. Usually presented as a ranking ( 1 to 5),

implies an equal distance among the categories.

If you do not have a random sample, it is proper to

use nonparametric statistics: • Small sample size.

• No normal distribution or random sampling.

• More than one mode.

• Many outliers in the data set.

• Dependent variables are ordinal or dichotomous.

SPSS Instructions for Running ANOVA

• Select Means• Select One-way ANOVA• Highlight your dependent variable (must be ratio) • Click on the arrow• Highlight your factor (independent) variable

(must be nominal with at least three categories)• Click o.k.

SPSS instructions for running Regression

• Select Analyze• Select Regression• Select Linear• Highlight Dependent Variable (must be ratio)• Highlight two or independent or control variables• Click on Arrow• Click o.k.

SPSS Instructions for Running Means

• Select Analyze

• Select Compare Means

• Select Means

• Highlight Dependent (Ratio) Variable

• Highlight Independent (Nominal) Variable

• Click ok