of 44 /44
ANOVA, Regression and Multiple Regression March 22-23

ANOVA, Regression and Multiple Regression March 22-23

Embed Size (px)

Citation preview

Page 1: ANOVA, Regression and Multiple Regression March 22-23

ANOVA, Regression and Multiple Regression

March 22-23

Page 2: ANOVA, Regression and Multiple Regression March 22-23

Why ANOVA

• ANOVA allows us to compare the means for groups and ask if they are sufficiently different from one another to say that such differences are statistically significant.

• In other words, that there is a low probability that a difference of such magnitude would occur between groups in the real world if the actual difference between groups were zero.

Page 3: ANOVA, Regression and Multiple Regression March 22-23

• Unlike “t” tests, ANOVA can be performed with more than two groups

• The statistic associated with ANOVA is the “f” test.

Page 4: ANOVA, Regression and Multiple Regression March 22-23

As the book notes…

• “The details of ANOVA are a bit daunting (they appear in an optional section at the end of this chapter). The main idea of ANOVA is more accessible and much more important. Here it is: when we ask if a set of sample means gives evidence for differences among the population means, what matters is not how far apart the sample means are but how far apart they are relative to the variability of individual observations.”

Page 5: ANOVA, Regression and Multiple Regression March 22-23

In other words, lets look both at the means and the overlap of the distributions!

Page 6: ANOVA, Regression and Multiple Regression March 22-23

Statistically speaking

• The f statistics looks at variation of the sample means over variation among individuals in the same sample

• Like “t” the “f” statistic is very robust and therefore you should not worry too much about deviations from normality if your sample is large

samplesametheincasesindividualamongiation

meanssampletheamongiationF

_______var

____var

Page 7: ANOVA, Regression and Multiple Regression March 22-23

One warning

• ANOVA assumes that the variability of observations, measured by the standard deviation is the same in all populations

• In the real world if you keep the sizes of the groups you are comparing roughly similar few problems occur but you must check.

Page 8: ANOVA, Regression and Multiple Regression March 22-23

The book gives this rule

• Results of the “f” test are usually okay if the largest sample standard deviation is no more than twice as large as the smallest sample standard deviation

• Another way to check is “Levene’s equality of variance” statistic.

• If it is significant (low p) it means there is little probability that the standard deviations of the groups being compared are similar and you have a problem

Page 9: ANOVA, Regression and Multiple Regression March 22-23
Page 10: ANOVA, Regression and Multiple Regression March 22-23

From Table 24.2 (richness of trees by Group)

Page 11: ANOVA, Regression and Multiple Regression March 22-23

Why Regression

• Regression is commonly used in the social sciences because it allows us to– Explain– Predict which are two of the big goals of

social science (along with describe)

Page 12: ANOVA, Regression and Multiple Regression March 22-23

Recall

• Regression involves mathematically describing a linear relationship between a response (or dependent) variable and an explanatory (or independent) variable

• That line is given in the form y = a +b(x) Where:– y is the response variable– a is the y axis intercept of the line– b is the slope of the line– X is the explanatory variable

Page 13: ANOVA, Regression and Multiple Regression March 22-23

Requirements for use of regression• Also recall that if the relationship between our response

and explanatory variable is not linear then regression will give misleading results. Therefore we always do a scatter plot before attempting regression. The mathematical notation for linearity is below.

• This is sometimes called the “least-squares regression line” because this regression procedure finds the line that is the least squared difference from each data point.

bxy

Page 14: ANOVA, Regression and Multiple Regression March 22-23

Regression requirements continued

• For any value of x the values of y are normally distributed and repeated responses of y are independent of each other

• The standard deviation of y is the same for all values of x

Page 15: ANOVA, Regression and Multiple Regression March 22-23

Regression Analysis • As well as estimating the regression line, we

also estimate the goodness of fit between the line and the data by using a statistic known as Rsq

• Rsq (as the name implies) is the square of the correlation measure known as “r”.

• We also have to know the significance of the association between the explanatory and response variables (as well as the coefficient “a”) for the line we have found we use a variation of the “t” test for this.

Page 16: ANOVA, Regression and Multiple Regression March 22-23

A useful tool: Regression Standard Error

• The regression standard error is a useful tool that can help us diagnosis whether we have met the various conditions needed to perform a regression (don’t worry your software will do this).

)ˆ(2

2

1 yyn

s

Page 17: ANOVA, Regression and Multiple Regression March 22-23

So looking at example 23.1 in your book here is the scatter plot

Page 18: ANOVA, Regression and Multiple Regression March 22-23

Here is the regression control showing how I have selected the standard errors

called “residuals” in SPSS

Page 19: ANOVA, Regression and Multiple Regression March 22-23

Here is the dialogue box in Excel using the plug in

Page 20: ANOVA, Regression and Multiple Regression March 22-23

Here is a portion of the printout that was generated in SPSS

Page 21: ANOVA, Regression and Multiple Regression March 22-23

Here you can see the standard residuals or errors that were calculated

Page 22: ANOVA, Regression and Multiple Regression March 22-23

In Excel it looks like this

Page 23: ANOVA, Regression and Multiple Regression March 22-23

A happy coincidence

• As the book notes Rsq is “closely related” to r.• In fact it is literally the sq. of r in a simple OLS

regression with one explanatory variable.• Therefore, when you test the null hypothesis of

the regression line. That it is actually flat you also pretty much have tested the correlation too. However, most software also prints it out in case you want to see it.

Page 24: ANOVA, Regression and Multiple Regression March 22-23

Does it matter that the estimate of the intercept is insignificant?

• In practice no.

• What really matters is the estimate of the slope

Page 25: ANOVA, Regression and Multiple Regression March 22-23

Calculating the confidence interval for your line

• If you look back at our print out you will see the slope is given as is the Standard Error of the Slope and a t value. Put them together and you have the 95% confidence interval for the Population slope.

confidenceSEbtb %95*

Page 26: ANOVA, Regression and Multiple Regression March 22-23

Or you could have had the computer calculate the confidence intervals for you

Page 27: ANOVA, Regression and Multiple Regression March 22-23

As noted before we can use our standardized errors to check our

assumptions• The y values vary normally for each x value, do

a histogram of your residuals and check for relative normality of distribution

• Plot the residuals as the dependent variable with the x variable as independent to check for linearity and that the observations for Y are independent of each other

• Standard deviation of responses can be checked by looking for a rough symmetrical distribution above and below the zero point

Page 28: ANOVA, Regression and Multiple Regression March 22-23

Our previous example had too few cases to check residuals so here is

example 23.9 from the book on climate change

Page 29: ANOVA, Regression and Multiple Regression March 22-23
Page 30: ANOVA, Regression and Multiple Regression March 22-23
Page 31: ANOVA, Regression and Multiple Regression March 22-23

Moving from OLS to Multiple OLS three big changes

• 1 We have to use Beta instead of B• 2 We have to be aware of multicollinearity and

other multiple impacts (in short that we are not just piling on independent variables but that each independent variable is demonstrating a unique explanatory power

• The book gives you a third. We have to be aware of interaction terms and other factors that lead us to pick one model over another

Page 32: ANOVA, Regression and Multiple Regression March 22-23

The equation is now changed to reflect greater number of variables and change

from b to beta

xxx nny

2211

Page 33: ANOVA, Regression and Multiple Regression March 22-23

How to do it?

• Start from the beginning and look at each variable separately using our descriptive and exploratory techniques

• Now look at our dependent variable in pairs with each independent variable using correlations to see which ones might have a big impact

• Fit different models. Pay attention to changes in explanatory power and also the t statistics

• If using stats software use stepwise procedures. Stepwise adds and removes variables in the order you input them based on a selection criteria (change in the F statistic from the ANOVA test). In short the computer tells you which model best fits with as few variables as possible.

Page 34: ANOVA, Regression and Multiple Regression March 22-23

In SPSS You can do more than one scatterplot at once

Page 35: ANOVA, Regression and Multiple Regression March 22-23

The data provided in Table 27.6 represent a random sample of 60 customers from a large clothing retailer.8 The manager of the store is interested in predicting how much a

customer will spend on his or her next purchase.Our goal is to find a regression model for predicting the amount of a purchase from the available explanatory

variables. A short description of each variable is provided below.

Page 36: ANOVA, Regression and Multiple Regression March 22-23
Page 37: ANOVA, Regression and Multiple Regression March 22-23

Here are the print outs for Ex 27.19 using SPSS

Page 38: ANOVA, Regression and Multiple Regression March 22-23
Page 39: ANOVA, Regression and Multiple Regression March 22-23

Let’s add a new variable

• Purchase 12 shows the total purchases each customer makes over the last 12 months divided by the frequency of their visits to the store

Page 40: ANOVA, Regression and Multiple Regression March 22-23

As you will see it changes things Here is the OLS for it alone

Page 41: ANOVA, Regression and Multiple Regression March 22-23

• The last slide was basically an interaction of the two variables we previously identified as helpful. Let’s go back to when they were separate for a second and test whether each has a separate impact or if multicollinearity is at play. Look for tolerances .1 or less as evidence of multicollinearity

Page 42: ANOVA, Regression and Multiple Regression March 22-23

Finally let’s look at our residual plots

• Often you might have the chance to use more elaborate residuals than standardized ones, such as studentized residuals.

• As there is no pattern we assume the variance for y is the same for all values of x

Page 43: ANOVA, Regression and Multiple Regression March 22-23

• The sequence chart also tells us that there the y values are independent of each other

Page 44: ANOVA, Regression and Multiple Regression March 22-23

• The QQ plot tells us the residuals are roughly normal meaning that the notion that values of y vary normally for each value of x might be met